OpenTranscription/ Blog
2026-07-03 · ANALYSIS

ElevenLabs Scribe v2: a top-tier transcription product built on an undisclosed model

What Scribe v2 actually changed from v1: features, pricing, benchmark results, API limits, and the architecture details ElevenLabs still won't publish.

Abstract illustration of an audio waveform flowing into a partially veiled geometric lattice, suggesting a precise transcription engine with hidden internals

Scribe v2 is ElevenLabs' current flagship batch speech-to-text model, and it arrived with an unusual profile: exhaustively documented as a product, almost entirely undocumented as a neural network. It launched on January 9, 2026 as the successor to the original Scribe (now effectively Scribe v1, launched February 26, 2025), and it sits alongside Scribe v2 Realtime, released November 11, 2025 for live applications. The positioning shift was deliberate. v1 was "the first Scribe," strong on multilingual batch transcription. v2 was framed as more stable and more accurate on long, messy recordings. v2 Realtime got split out as the low-latency sibling for agents and live captioning. And ElevenLabs is not hedging on the succession: as of June 8, 2026, scribe_v1 is deprecated, with removal scheduled for July 9, 2026.

The v1-to-v2 story is not just "better WER." The publicly documented additions include context-aware keyterm prompting, native entity detection with timestamps, automatic multilingual transcription inside a single file, smart speaker diarization, and dynamic audio-event tagging. Later updates added entity redaction, Indic-English code-switching improvements, and a No Verbatim mode for cleaner text output. On the operational side, ElevenLabs exposes file uploads up to 5 GB, a 100 ms minimum audio length, up to 32 speakers, up to 5 multichannel tracks, async webhooks, a zero-retention mode for enterprise customers, and a separate realtime WebSocket stack with VAD and manual commit controls.

What ElevenLabs has not published is anything about the model itself. There is no public statement of Scribe v2's model family, encoder/decoder design, parameter count, tokenizer, training-hours total, or training-data composition, and no confirmation of whether it uses a transducer, CTC, an encoder-decoder transformer, or a hybrid. Compare that with OpenAI's Whisper launch, which disclosed a model trained on 680,000 hours of multilingual supervised data. For Scribe v2, the most you can reconstruct from public sources is the service pipeline and API behavior, not the underlying neural architecture.

Even so, the market position is real. Scribe v2 is a serious top-tier transcription product now, not a "TTS company adding STT." ElevenLabs' own site shows Scribe v2 beating GPT-4o Transcribe, Gemini 2.5 Pro, and AssemblyAI in its batch accuracy marketing, and Artificial Analysis' February 2026 AA-WER v2 benchmark placed Scribe v2 first at 2.3% AA-WER. The current Artificial Analysis non-streaming leaderboard snapshot, though, shows Scribe v2 second at 2.2%, behind Fun-Realtime-ASR-preview at 1.7%. Leaderboard leadership shifts with benchmark version, provider versioning, and dataset choice, and Scribe v2 is a case study in exactly that.

The short version for buyers: Scribe v2 looks strongest for multilingual batch transcription, subtitling, compliance-sensitive transcripts, and teams already building on ElevenLabs' broader audio stack. It is weaker where you need deep architectural transparency, public custom-model training or fine-tuning options, or a more mature independent streaming-first STT platform that does not depend on a separate "Realtime" product line.

How the Scribe line got here

ElevenLabs' STT roadmap moved in three visible phases. The company launched Scribe as its first speech-to-text model in February 2025, emphasizing multilingual accuracy and structured transcript output. It shipped Scribe v2 Realtime in November 2025 as a dedicated low-latency model for agents and live use cases. Then it launched Scribe v2 in January 2026 as the higher-accuracy, long-form, batch-oriented successor to v1. The 2026 updates since then have focused on privacy, transcript cleanup, and workflow convenience rather than any new public architecture disclosure.

Date Event Why it mattered
Feb 26, 2025 Original Scribe launched. First ElevenLabs STT model; marketed as supporting 99 languages, with word-level timestamps, diarization, and audio-event tagging. The launch post also published an internal contributions list, which is unusually useful for attribution.
Nov 11, 2025 Scribe v2 Realtime launched. Separated the live/agent use case from batch STT; ElevenLabs claimed ~150 ms latency and positioned it for meetings, agents, and real-time captioning.
Jan 9, 2026 Scribe v2 launched. Formal batch successor to v1; ElevenLabs said it improved stability and accuracy on long-form audio, pauses, tone changes, and extended silences.
Apr 1, 2026 source_url added to the STT endpoint. Batch workflows no longer required direct file upload; hosted URLs, YouTube, TikTok, and similar media sources became first-class inputs.
Apr 2, 2026 Scribe v2 upgrade: entity redaction, Indic-English code-switching, No Verbatim, keyterms limit raised from 100 to 1,000. The most substantial post-launch feature drop for v2, and it materially changed Scribe's privacy and transcript-formatting story.
Jun 2026 audio_duration_secs added to STT responses. Small but useful API ergonomics improvement for analytics and billing reconciliation.
Jun 8, 2026 scribe_v1 deprecated; removal scheduled for Jul 9, 2026. Confirms that v2 is not a sidegrade: ElevenLabs is actively retiring v1. The same changelog also changed the default Agents ASR provider toward scribe_realtime.

Scribe's relevance widened as ElevenLabs wired it into more of its own stack. The v2 launch made it the transcription engine inside ElevenLabs Studio for subtitles, captions, and transcripts across marketing, media, research, training, and compliance workflows, while customer stories such as Wockhardt Hospitals show it handling multilingual clinical documentation. Scribe v2 is not only an API feature; it is part of ElevenLabs' broader platform strategy across creators, enterprises, and voice-agent builders.

Abstract illustration of two waveform paths on a timeline, one fading into the background as a stronger amber path takes over

What actually changed from v1

The clearest way to read the v1-to-v2 transition: v1 established multilingual transcription quality, and v2 added production-grade controllability and workflow intelligence. ElevenLabs' public messaging moved from "our first STT model is very accurate" to "this version is better on long recordings and real enterprise pipelines."

Two conclusions matter most. v2's advantage is product depth as much as raw recognition quality. And the public record is much stronger on externally visible behavior than on model internals. If you are comparing specs line by line with a regulated procurement mindset, v2 is richer than v1 on features and controls, but it is not more transparent on architecture or training disclosures.

Dimension Scribe v1 Scribe v2 Assessment
Release status Launched Feb. 26, 2025; deprecated Jun. 8, 2026; removal planned Jul. 9, 2026. Launched Jan. 9, 2026; current main batch model. v2 is the strategic successor, not an experimental branch.
Primary optimization Multilingual batch ASR with timestamps, diarization, audio-event tags. Batch transcription, subtitling, and captioning at scale; specifically improved for long-form audio, pauses, tone changes, and extended silences. v2 shifts from "accurate ASR" to "production batch transcription."
Accuracy claims Launch post claimed best results on FLEURS and Common Voice across 99 languages. Launch post claimed the lowest WER recorded on industry-standard benchmarks; official marketing says it beats GPT-4o Transcribe, Gemini 2.5 Pro, and AssemblyAI in batch accuracy. Official claims favor v2, but only third-party benchmarks let you compare across vendors on a common methodology.
Latency Batch latency publicly unspecified; launch post said a low-latency version was "coming soon." Batch latency publicly unspecified; live use case is delegated to Scribe v2 Realtime at ~150 ms. For batch, ElevenLabs still does not publish hard latency targets.
Languages Launch materials said 99 languages. Current docs say 90+ languages; marketing pages sometimes still say 99. v2 adds automatic multi-language transcription within one file. Official language count is currently inconsistent; safest wording is 90+, with v1 launch copy historically using 99.
Diarization Publicly available at launch. Publicly available; current docs advertise up to 32 speakers. v2's diarization is better documented and more configurable.
Audio-event tagging Publicly available at launch. Publicly available; current docs call it dynamic audio tagging. Feature continues forward.
Entity detection / redaction No public v1 launch support found. Native entity detection with timestamps at launch; redaction added Apr. 2026. Up to 56 categories are documented in marketing. One of the biggest v2 differentiators.
Punctuation / formatting Publicly unspecified beyond standard transcript output. Better handling of pauses and silences; No Verbatim removes filler words, repeated phrases, stuttering, and disfluencies for cleaner transcripts. v2 is much more opinionated about readable output.
Code-switching Multilingual support at launch, but no special public claim for code-switching. Automatic multilingual transcription at launch; Apr. 2026 upgrade specifically improved Indic-English code-switching and preserved English words in Latin script. v2 is materially better for mixed-language conversation.
Prompting / vocabulary control Public status is unclear. Current pricing groups v1/v2 together, but current docs only explicitly describe keyterm prompting for v2 batch and v2 Realtime. Keyterm prompting launched with 100 terms and expanded to 1,000 on Apr. 2, 2026; context-aware rather than blind insertion. Public docs are inconsistent on whether late-era v1 shares this feature; v2 support is explicit.
Timestamps / alignment output Word-level timestamps at launch. Word-level timestamps; current API also exposes character timestamp granularity. v2's public API is richer at the response layer.
API surface Same REST STT endpoint family, model chosen by model_id. Same REST endpoint with model_id="scribe_v2"; async webhooks, source_url, multichannel, entity features, no_verbatim, speaker-role detection, speaker library support. v2 is much more fully instrumented for production workflows.
Pricing Current official pricing groups Scribe v1 / v2 at $0.22/hour batch. Same current batch pricing: $0.22/hour; keyterm prompting adds $0.05/hour and entity detection/redaction $0.07/hour. Public pricing no longer distinguishes v1 from v2.
Hard limits Current shared endpoint supports audio files under 5 GB and at least 100 ms long. Same shared endpoint limits; plus up to 32 diarized speakers, 5 channels for multichannel, 1,000 keyterms, and minimum billable 20 seconds when keyterms exceed 100. v2's operational limits are much better specified.
Model size Unspecified. Unspecified. ElevenLabs does not publish parameter counts for either model.
Training data Public roles for v1 included pre-training data and fine-tuning data, but no sizes or corpus breakdowns were published. Unspecified publicly. No training-hours figure or dataset description found in official v2 materials. This is a major transparency gap.
User fine-tuning / custom training Unspecified publicly. Unspecified publicly. Keyterm prompting exists, but customer model fine-tuning/custom ASR training is not documented. Prompting is not the same as fine-tuning.
Confidence signals Current shared response schema exposes language_probability and per-word logprob; model-specific calibration is undocumented. Same. ElevenLabs exposes useful scoring primitives, but not a clear calibrated confidence API.

What the API surface tells you about the architecture

Public documentation is enough to reconstruct Scribe v2 as a service pipeline, but not as a fully described neural network. ElevenLabs documents input handling, language selection, timestamping, diarization controls, entity and redaction passes, multichannel behavior, logging controls, and realtime commit/VAD flows. It does not document the underlying model class, whether batch and realtime share encoders, the tokenizer, the decoder type, the parameter count, or the exact training regime. That is the central technical limitation in any rigorous public analysis of Scribe v2, and everything below describes the public interface, not the hidden neural layout. It is based on the current STT API reference, feature docs, and v2 launch materials.

Abstract signal-flow diagram of amber processing stages along a path, with the central stage rendered as a dark opaque block

At the preprocessing layer, Scribe v2 accepts direct file upload or source_url; current docs say it supports all major audio and video formats, with files under 5 GB and at least 100 ms long. If audio is passed as 16-bit PCM at 16 kHz mono little-endian (pcm_s16le_16), ElevenLabs says latency is lower than when sending encoded waveforms. That tells you there is a front-end audio normalization and decoding stage, but not whether there is an explicit public VAD stage for batch inference.

The language path is partly explicit. If language_code is omitted, the API predicts language automatically and returns both language_code and language_probability; v2 launch materials also say a single file can contain multiple languages and the model will automatically detect and transcribe them without manual segmentation. The April 2026 upgrade further documented better Indic-English code-switching, specifically preserving English words in Latin script even inside Indic-language utterances.

The alignment and output layer is better documented than the acoustic model. Batch Scribe returns a transcript plus words[] entries containing text, start, end, speaker_id, type, and logprob. Current docs expose word and character timestamp granularity. That strongly suggests a timing-alignment layer after or during decoding, but the exact method is unspecified. ElevenLabs also has a separate Forced Alignment capability in its broader product lineup, yet the public Scribe docs do not state that Scribe v2 internally calls that system.

Speaker handling is explicit at the API level. Batch STT can diarize speakers, accept num_speakers up to 32, and optionally tune a diarization_threshold. The docs explain the threshold tradeoff in concrete terms: higher values reduce the risk of splitting one speaker into multiple clusters but increase the risk of merging different speakers; lower values do the opposite. There is also a separate multichannel mode for audio where each channel contains a single speaker; this supports up to 5 channels, returns channel_index per word, and bills each channel independently for the full duration. So ElevenLabs exposes both speaker clustering and channel-aware decomposition, but leaves the specific diarization model undisclosed.

Transcript enrichment is where v2 looks most differentiated. The publicly documented post-processing and sidecar analysis features include dynamic audio tagging for non-speech events, entity detection across categories such as PII, PHI, PCI, other, and offensive language, entity redaction with several replacement modes, and No Verbatim cleanup for filler words and disfluencies. These make Scribe v2 look less like a bare ASR endpoint and more like a transcript-operations service.

Confidence scoring is present, but only partially. What the API actually exposes is per-word logprob and overall language_probability. ElevenLabs does not currently document a calibrated transcript-level or word-level confidence metric comparable to the explicit confidence fields some competitors document. That matters in quality assurance pipelines: logprob is useful, but it is not the same thing as a documented, business-facing confidence score.

The clearest documentation of VAD and streaming control exists on Scribe v2 Realtime, not batch Scribe v2. Realtime exposes manual or VAD-based commit strategies, vad_threshold, vad_silence_threshold_secs, minimum speech and silence durations, partial versus committed transcript events, predictive transcription, and text conditioning for reconnect continuity. These are family-level clues about ElevenLabs' speech stack, but they should not be over-read as proof that the batch model shares the same online segmentation design.

Error modes are practical and well documented. Batch STT returns 422 Unprocessable Entity for malformed requests; realtime exposes auth, quota, throttling, rate-limit, and "unaccepted terms" error events. On the product side, ElevenLabs also warns that automated entity redaction may not identify or remove all sensitive information, so manual review is still required for high-stakes workflows. That is an important admission for compliance-sensitive use cases.

Who built it, as far as the public record goes

ElevenLabs is not a transparent "paper-first" research lab in the way some open-model ASR groups are, but the company does provide enough public attribution to identify the founders, visible ASR researchers, and public SDK contributors around Scribe. The strongest direct attributions are actually in the v1 launch post, which names specific contributors and roles. The v2 launch post does not publish an equivalent contributor section, so direct v2-batch authorship beyond visible authors and public product writers is unspecified.

Person Public role / bio Publicly linked relevance to Scribe / ElevenLabs
Piotr Dąbkowski Cofounder; leads research and engineering teams developing ElevenLabs' AI audio models. Previously worked on ML at Google; studied at Cambridge. Named as a research contributor on the original Scribe launch and publicly described by ElevenLabs as leading its research team.
Mati Staniszewski Cofounder; leads teams building AI that can communicate at human level. Previously at Palantir; studied Mathematics at Imperial. Not publicly credited as a model engineer on Scribe, but central to company/product direction and platform expansion.
Flavio Schneider ElevenLabs research team; focuses on ASR and Music. Publicly credited in the Scribe launch post as research lead, training, architecture for v1. This is the strongest public model-attribution source ElevenLabs has published for Scribe.
Tim von Känel ElevenLabs research team; focuses on ASR and Music. Publicly credited in the Scribe launch post as project lead, pre-training data, fine-tuning data.
Maximiliano Levi Public bio not retrieved in official materials reviewed. Publicly credited on the Scribe launch as responsible for inference and optimizations.
Johan Nordberg Public bio not retrieved in official materials reviewed. Listed as a research contributor on the Scribe launch.
Austin Malerba Public bio not retrieved in official materials reviewed. Credited for frontend on the Scribe launch.
Hristo Stoychev Public bio not retrieved in official materials reviewed. Credited for backend on the Scribe launch.
Alex George Public bio not retrieved in official materials reviewed. Credited for data acquisition on the Scribe launch.
Joe Reeve Growth team; focused on helping developers get the most out of ElevenLabs' frontier audio models. Authored the April 2026 Scribe v2 upgrade post, suggesting a visible product/developer-relations role around Scribe's external rollout.
Tadas Petra Public bio page not retrieved, but publicly listed as article author. Authored How Scribe v2 Realtime Works, the most technical public explainer of the Scribe family's live stack.

Public GitHub evidence is strongest on the SDK and platformization side, not on the core STT model. In the official ElevenLabs SDK release notes, @kraenhansen is explicitly credited with adding keyterms and no_verbatim support to the Scribe realtime API in both the JS and Python SDKs, and @PaulAsjes appears repeatedly on Speech Engine and SDK release work. Those are real contributions to how developers access Scribe, but they are not contributions to the hidden ASR model itself.

The attribution caveat is simple. ElevenLabs has published a concrete contributor list for Scribe v1, but not for Scribe v2 batch. Any claim that "person X built Scribe v2" should be treated as unspecified unless it comes directly from ElevenLabs.

Where it sits in the market

ElevenLabs positions Scribe v2 as a premium, accuracy-first, multilingual transcription model that inherits platform advantages from the rest of ElevenLabs: Studio, API, Agents, security and compliance packaging, and adjacent audio products. Official marketing places Scribe v2 against GPT-4o Transcribe, Gemini 2.5 Pro, and AssemblyAI for batch accuracy, and places Scribe v2 Realtime against Gemini Flash 2.5, GPT-4o Mini, and Deepgram Nova 3 for live speech. That tells you how ElevenLabs wants the market to think: v2 for long-form clean transcripts, v2 Realtime for agents.

Abstract composition of several parallel waveform tracks racing toward a finish line of spaced amber bars, one track slightly ahead

Independent benchmarking is directionally favorable to ElevenLabs, but not uniformly decisive. Artificial Analysis' AA-WER v2 report from February 2026 put Scribe v2 first at 2.3% AA-WER, and specifically first on its proprietary AA-AgentTalk voice-agent dataset and the Earnings22 corporate-calls set. But the current Artificial Analysis non-streaming leaderboard snapshot shows Scribe v2 second at 2.2%, behind Fun-Realtime-ASR-preview. Separately, AssemblyAI's own benchmark page ranks Universal-3 Pro ahead of ElevenLabs Scribe v2 on AssemblyAI's selected dataset mix. The right takeaway is not "who is universally best." Scribe v2 is firmly in the top band, and final rankings depend on benchmark composition and update timing.

Pricing is aggressive for a batch premium model. ElevenLabs currently lists Scribe v1/v2 at $0.22/hour and Scribe v2 Realtime at $0.39/hour, with add-on charges for entity detection/redaction and keyterm prompting. That places it below Google Cloud STT v2's headline $0.016/minute (~$0.96/hour) and close to AssemblyAI Universal-3 Pro's $0.21/hour, while remaining higher than OpenAI's request-response gpt-4o-transcribe estimated at $0.006/minute (~$0.36/hour) but lower than OpenAI's cited realtime-whisper price of $0.017/minute (~$1.02/hour).

The comparison below is interpretive, not vendor-declared. It reflects the combined picture from public features, pricing, benchmarks, and product framing.

Provider Product Public pricing retrieved Publicly visible strengths Main watch-outs versus Scribe v2 Best fit
ElevenLabs Scribe v2 $0.22/hour batch; keyterms $0.05/hour; entity detection/redaction $0.07/hour. Realtime sibling: $0.39/hour. 90+ languages; up to 1,000 keyterms; entity detection/redaction; audio-event tagging; diarization up to 32 speakers; Studio + Agents + security/compliance packaging. Batch latency unspecified; architecture/training/model size undisclosed; public custom fine-tuning unsupported. Multilingual long-form transcription, subtitles/captions, compliance-sensitive transcripts, teams already on ElevenLabs.
OpenAI gpt-4o-transcribe / gpt-4o-transcribe-diarize gpt-4o-transcribe: $0.006/min (~$0.36/hour). Realtime Whisper: $0.017/min (~$1.02/hour). Very low batch price; simple integration in the OpenAI stack; diarization offered via a dedicated gpt-4o-transcribe-diarize model. Request-response and streaming are split across different model families; current file uploads limited to 25 MB; retrieved docs show less transcript-enrichment depth than Scribe v2. Cost-sensitive developers already standardized on OpenAI APIs.
Deepgram Nova-3 / Flux Official pricing page publishes Nova-3 pay-as-you-go pricing and add-ons; retrieved snippet explicitly shows $0.29/hour for monolingual streaming, with additional published multilingual/add-on pricing on the pricing page. Strong live and multilingual docs; code-switching; diarization; multichannel up to 20; self-serve vocabulary adaptation "without model retraining." Product choice is more fragmented now that Deepgram increasingly pushes Flux for voice agents while Nova-3 remains the general transcription model; add-ons are separately priced. Realtime-heavy voice AI teams that want explicit STT specialization and operational controls.
AssemblyAI Universal-3 Pro $0.21/hour batch for Universal-3 Pro; $0.15/hour for Universal-2. Keyterms prompting add-on shown publicly. Rich transcript intelligence; prompting/custom spelling; strong meeting and medical workflows; streaming model marketed with native code switching and sub-300 ms latency. Universal-3 Pro currently supports 6 languages out of the box, while broader 99-language coverage sits on Universal-2. Buyers who care as much about downstream transcript intelligence as raw ASR.
Google Cloud Chirp 3 / STT v2 $0.016/minute (~$0.96/hour) headline STT v2 price shown on the product page. Streaming + batch; diarization; automatic language detection; speech adaptation; cloud governance and data residency; strong enterprise procurement story. Google docs warn that Chirp 3 word-level timestamps may degrade results, and that word-level "confidence" is not truly a confidence score. Large GCP estates, region-sensitive deployments, teams wanting managed cloud control planes.
AWS Amazon Transcribe Public pricing is usage-based per second, with a 15-second minimum charge and a free tier; exact numeric tier was not fully captured in retrieved excerpts. Mature cloud integration; automatic language identification, multilingual detection, custom language models, PII redaction, and medical transcription product line. Feature availability is fragmented by language and workflow; Amazon Transcribe Medical is US English only. AWS-native organizations prioritizing cloud consistency over frontier-benchmark positioning.

The strongest customer pull for Scribe v2 shows up in a few specific places: global subtitle and caption pipelines, regulated teams that value redaction and data controls, buyers who want an end-to-end audio platform and already use ElevenLabs for TTS, Agents, or Studio, healthcare and contact-center transcription where entity detection and diarization matter, and multilingual mixed-language audio. It is less obviously the right pick when the buyer primarily wants public architectural transparency, documented custom model training, or a single-model streaming-first voice-agent stack from a company whose original core business is STT.

What ElevenLabs still won't tell you

Across the official v2 launch materials, pricing pages, API docs, changelog, and public product pages reviewed here, ElevenLabs does not publicly disclose Scribe v2's model family, encoder/decoder structure, tokenizer, parameter count, training-hours total, dataset mix, or whether customer fine-tuning or custom ASR model training is supported. It also does not publish a clear batch-latency target for Scribe v2, and official materials are somewhat inconsistent on language count and on whether some prompting features should be thought of as v2-only or shared with late-era v1 documentation.

So the rigorous bottom line: Scribe v2 is well documented as a product, weakly documented as a model. Public evidence strongly supports that it is a top-tier multilingual batch transcription system with unusually rich workflow features. Public evidence does not yet support any detailed claim about its hidden neural architecture beyond what the API surface implies.

Sources

The platform

Put these benchmarks to work

The same evaluations behind these dispatches drive OpenTranscription — one API that routes every job to the right speech model for your audio, language, and budget.

© 2026 OpenTranscription · Signal is our journal.Set in system grotesque, serif & mono