Use case

Data extraction at scale

Extract from authenticated, multi-step pages — and stop re-paying the model to read the same site every run.

The problem

Read-only scrapers handle public pages well, but break the moment data lives behind a login, a multi-step flow, or a stateful session. Agent-driven extraction fixes that — and re-pays the LLM to re-read the same site on every run, so cost scales with volume exactly when you want it to fall.

See how Twin works

app.example.com

Data extraction at scale

Log in via the credential vaultdone
Compile the multi-step flowrunning
Read the token-efficient DOMqueued
Replay extraction — cache hitqueued
Return rows as JSONqueued

A Twin run for data extraction at scale — compile once, then replay on a cache hit.

The wedge

How Twin solves it

Twin compiles an authenticated extraction flow into a skill once, then replays it deterministically. The token-efficient DOM map keeps extraction cheap on context, the credential vault handles the login, and the semantic dispatch cache means a re-phrased extraction request matches the skill you already compiled instead of cold-starting. For bulk read-only ingestion you can pair Twin with a dedicated scraper; for stateful, logged-in extraction, Twin is the layer that bends cost down.

1Describe the extraction as a goal; Twin logs in via the credential vault and compiles the flow into a skill.
2The DOM-to-indexed-state compiler returns a compact, numerically-indexed view, so even a 50-step flow stays within a tight token budget (illustratively ~3k tokens).
3Repeat and re-worded extractions hit the semantic cache and replay with zero LLM calls.
4The cross-tenant skill corpus means common site patterns are already compiled, raising your hit rate.
5Proxy support (IPRoyal) and session video keep large runs observable and resilient.

In practice

One call, then it gets cheaper

Compile an authenticated, multi-step extraction once. Re-phrased pulls hit the semantic cache and replay with zero LLM calls instead of re-reading the site.

run.tsts

import Twin from '@twin-browser/sdk';

const twin = new Twin({ apiKey: process.env.TWIN_API_KEY });

const run = await twin.agents.run({
  goal: 'Log in and export this month\'s transactions as rows',
  url: 'https://billing.example.com',
  credentials: 'billing-account',  // from the per-tenant vault
});

console.log(run.cached);       // true on a cache hit — no re-read cost
console.log(run.tokensUsed);   // a 50-step flow ≈ ~3k tokens, not raw HTML
console.log(run.result.rows);  // [{ id, amount, status }, ...]

What happens on this call

Twin compiles the goal into a deterministic, replayable skill.
The next re-phrased request matches it in the semantic dispatch cache.
Matched runs replay with zero LLM calls — credits drop back toward ~1.
Every call is authenticated, billed, and written to the audit log.

Read the API docs

Under the hood

The machinery that bends the cost curve

Every use case runs on the same primitives — the wedge that makes browser work cheaper the more your agents run.

Semantic dispatch cache

Re-phrased requests fuzzy-match a skill you already compiled, so they skip the planner LLM entirely.

Learn more

Deterministic replay

Matched skills replay the same way every time — a pass is a pass, and the marginal cost trends toward zero.

Learn more

Token-efficient DOM state

A live page becomes a compact, numerically-indexed map of interactive elements instead of raw HTML.

Learn more

Human-in-the-loop handoff

Blocked steps — approvals, MFA on an authorized flow — pause for a person, then resume cleanly.

Learn more

The outcome

Authenticated, repeated extraction that re-pays the model every run on agent-driven infra instead settles to deterministic replay — illustratively cost per 1,000 extractions falling ~5x after warmup rather than scaling with volume.

Go deeper

Twin vs. Bright Data Twin vs. Firecrawl Glossary: Token-efficient DOM

FAQ

Data extraction at scale on Twin — common questions

Is Twin a scraping tool like Bright Data or Firecrawl?

For different jobs. Bright Data and Firecrawl excel at large-scale, read-only public extraction. Twin targets authenticated, stateful, repeated extraction — logging in, multi-step flows, human handoff — and bends cost down with a semantic skill cache instead of billing per gigabyte or per page. The two compose well.

How does Twin keep token cost low on big pages?

Instead of feeding raw HTML to the model, Twin’s DOM-to-indexed-state compiler produces a compact, numerically-indexed map of interactive elements under a token budget, so extraction stays cheap on context even on heavy pages.

Can Twin extract behind a login?

Yes. Credentials live in a per-tenant credential vault with default-deny RLS, so authenticated, multi-step extraction runs without hard-coding secrets in your code.

AI agents

Give your LLM agent a real browser it can drive — and stop paying the model on every single run.

Internal workflow automation

Automate the internal tools and vendor portals that have no API — with audit logging and human approval built in.

RPA replacement

Replace brittle, selector-keyed RPA bots with skills that adapt to the page and get cheaper the more they run.

Put data extraction at scale on autopilot.

Start free, compile your first skill, and watch the marginal cost per run trend toward zero.

Start free Read the guides

Data extraction at scale

The problem

How Twin solves it

One call, then it gets cheaper

What happens on this call

The machinery that bends the cost curve

Semantic dispatch cache

Deterministic replay

Token-efficient DOM state

Human-in-the-loop handoff

The outcome

Go deeper

Data extraction at scale on Twin — common questions

More ways teams use Twin

AI agents

Internal workflow automation

RPA replacement

Put data extraction at scale on autopilot.