Data extraction at scale
Extract from authenticated, multi-step pages — and stop re-paying the model to read the same site every run.
The problem
Read-only scrapers handle public pages well, but break the moment data lives behind a login, a multi-step flow, or a stateful session. Agent-driven extraction fixes that — and re-pays the LLM to re-read the same site on every run, so cost scales with volume exactly when you want it to fall.
- Log in via the credential vaultdone
- Compile the multi-step flowrunning
- Read the token-efficient DOMqueued
- Replay extraction — cache hitqueued
- Return rows as JSONqueued
A Twin run for data extraction at scale — compile once, then replay on a cache hit.
How Twin solves it
Twin compiles an authenticated extraction flow into a skill once, then replays it deterministically. The token-efficient DOM map keeps extraction cheap on context, the credential vault handles the login, and the semantic dispatch cache means a re-phrased extraction request matches the skill you already compiled instead of cold-starting. For bulk read-only ingestion you can pair Twin with a dedicated scraper; for stateful, logged-in extraction, Twin is the layer that bends cost down.
- 1Describe the extraction as a goal; Twin logs in via the credential vault and compiles the flow into a skill.
- 2The DOM-to-indexed-state compiler returns a compact, numerically-indexed view, so even a 50-step flow stays within a tight token budget (illustratively ~3k tokens).
- 3Repeat and re-worded extractions hit the semantic cache and replay with zero LLM calls.
- 4The cross-tenant skill corpus means common site patterns are already compiled, raising your hit rate.
- 5Proxy support (IPRoyal) and session video keep large runs observable and resilient.
One call, then it gets cheaper
Compile an authenticated, multi-step extraction once. Re-phrased pulls hit the semantic cache and replay with zero LLM calls instead of re-reading the site.
import Twin from '@twin-browser/sdk';
const twin = new Twin({ apiKey: process.env.TWIN_API_KEY });
const run = await twin.agents.run({
goal: 'Log in and export this month\'s transactions as rows',
url: 'https://billing.example.com',
credentials: 'billing-account', // from the per-tenant vault
});
console.log(run.cached); // true on a cache hit — no re-read cost
console.log(run.tokensUsed); // a 50-step flow ≈ ~3k tokens, not raw HTML
console.log(run.result.rows); // [{ id, amount, status }, ...]What happens on this call
- Twin compiles the goal into a deterministic, replayable skill.
- The next re-phrased request matches it in the semantic dispatch cache.
- Matched runs replay with zero LLM calls — credits drop back toward ~1.
- Every call is authenticated, billed, and written to the audit log.
The machinery that bends the cost curve
Every use case runs on the same primitives — the wedge that makes browser work cheaper the more your agents run.
Semantic dispatch cache
Re-phrased requests fuzzy-match a skill you already compiled, so they skip the planner LLM entirely.
Learn moreDeterministic replay
Matched skills replay the same way every time — a pass is a pass, and the marginal cost trends toward zero.
Learn moreToken-efficient DOM state
A live page becomes a compact, numerically-indexed map of interactive elements instead of raw HTML.
Learn moreHuman-in-the-loop handoff
Blocked steps — approvals, MFA on an authorized flow — pause for a person, then resume cleanly.
Learn moreThe outcome
Authenticated, repeated extraction that re-pays the model every run on agent-driven infra instead settles to deterministic replay — illustratively cost per 1,000 extractions falling ~5x after warmup rather than scaling with volume.
Data extraction at scale on Twin — common questions
Is Twin a scraping tool like Bright Data or Firecrawl?
How does Twin keep token cost low on big pages?
Can Twin extract behind a login?
More ways teams use Twin
AI agents
Give your LLM agent a real browser it can drive — and stop paying the model on every single run.
Internal workflow automation
Automate the internal tools and vendor portals that have no API — with audit logging and human approval built in.
RPA replacement
Replace brittle, selector-keyed RPA bots with skills that adapt to the page and get cheaper the more they run.
Put data extraction at scale on autopilot.
Start free, compile your first skill, and watch the marginal cost per run trend toward zero.