Use case

Data extraction at scale

Extract from authenticated, multi-step pages — and stop re-paying the model to read the same site every run.

The problem

Read-only scrapers handle public pages well, but break the moment data lives behind a login, a multi-step flow, or a stateful session. Agent-driven extraction fixes that — and re-pays the LLM to re-read the same site on every run, so cost scales with volume exactly when you want it to fall.

app.example.com
  1. Log in via the credential vaultdone
  2. Compile the multi-step flowrunning
  3. Read the token-efficient DOMqueued
  4. Replay extraction — cache hitqueued
  5. Return rows as JSONqueued

A Twin run for data extraction at scale — compile once, then replay on a cache hit.

The wedge

How Twin solves it

Twin compiles an authenticated extraction flow into a skill once, then replays it deterministically. The token-efficient DOM map keeps extraction cheap on context, the credential vault handles the login, and the semantic dispatch cache means a re-phrased extraction request matches the skill you already compiled instead of cold-starting. For bulk read-only ingestion you can pair Twin with a dedicated scraper; for stateful, logged-in extraction, Twin is the layer that bends cost down.

  1. 1Describe the extraction as a goal; Twin logs in via the credential vault and compiles the flow into a skill.
  2. 2The DOM-to-indexed-state compiler returns a compact, numerically-indexed view, so even a 50-step flow stays within a tight token budget (illustratively ~3k tokens).
  3. 3Repeat and re-worded extractions hit the semantic cache and replay with zero LLM calls.
  4. 4The cross-tenant skill corpus means common site patterns are already compiled, raising your hit rate.
  5. 5Proxy support (IPRoyal) and session video keep large runs observable and resilient.
In practice

One call, then it gets cheaper

Compile an authenticated, multi-step extraction once. Re-phrased pulls hit the semantic cache and replay with zero LLM calls instead of re-reading the site.

run.tsts
import Twin from '@twin-browser/sdk';

const twin = new Twin({ apiKey: process.env.TWIN_API_KEY });

const run = await twin.agents.run({
  goal: 'Log in and export this month\'s transactions as rows',
  url: 'https://billing.example.com',
  credentials: 'billing-account',  // from the per-tenant vault
});

console.log(run.cached);       // true on a cache hit — no re-read cost
console.log(run.tokensUsed);   // a 50-step flow ≈ ~3k tokens, not raw HTML
console.log(run.result.rows);  // [{ id, amount, status }, ...]

What happens on this call

  • Twin compiles the goal into a deterministic, replayable skill.
  • The next re-phrased request matches it in the semantic dispatch cache.
  • Matched runs replay with zero LLM calls — credits drop back toward ~1.
  • Every call is authenticated, billed, and written to the audit log.
Read the API docs

The outcome

Authenticated, repeated extraction that re-pays the model every run on agent-driven infra instead settles to deterministic replay — illustratively cost per 1,000 extractions falling ~5x after warmup rather than scaling with volume.

FAQ

Data extraction at scale on Twin — common questions

Is Twin a scraping tool like Bright Data or Firecrawl?
For different jobs. Bright Data and Firecrawl excel at large-scale, read-only public extraction. Twin targets authenticated, stateful, repeated extraction — logging in, multi-step flows, human handoff — and bends cost down with a semantic skill cache instead of billing per gigabyte or per page. The two compose well.
How does Twin keep token cost low on big pages?
Instead of feeding raw HTML to the model, Twin’s DOM-to-indexed-state compiler produces a compact, numerically-indexed map of interactive elements under a token budget, so extraction stays cheap on context even on heavy pages.
Can Twin extract behind a login?
Yes. Credentials live in a per-tenant credential vault with default-deny RLS, so authenticated, multi-step extraction runs without hard-coding secrets in your code.

Put data extraction at scale on autopilot.

Start free, compile your first skill, and watch the marginal cost per run trend toward zero.