OpenTranscription/ Blog
2026-07-03 · MODEL PROFILE

ElevenLabs Scribe v2: model profile

Reference profile of ElevenLabs Scribe v2, a batch speech-to-text model released January 9, 2026: features, benchmarks, pricing, limits, and sources.

model-profilespeech-to-textelevenlabstranscriptionasr
ElevenLabs
Model profile ElevenLabs

Scribe v2 is ElevenLabs' batch speech-to-text model, released on January 9, 2026, as the successor to the original Scribe (Scribe v1).

Specifications

DeveloperElevenLabs
ReleasedJanuary 9, 2026
Model typeBatch speech-to-text (ASR); underlying architecture not publicly disclosed.
Training dataNot publicly disclosed. No training-hours figure or dataset description in official v2 materials.
LanguagesCurrent docs state 90+ languages; some marketing pages state 99.
Modes (batch / streaming)Batch. Live use is delegated to the separate Scribe v2 Realtime model.
LatencyBatch latency not publicly disclosed. Scribe v2 Realtime: vendor-claimed ~150 ms.
DeploymentElevenLabs REST STT API (model_id="scribe_v2") with async webhooks, source_url input, and multichannel support; ElevenLabs Studio integration.
Pricing$0.22/hour batch; keyterm prompting +$0.05/hour; entity detection/redaction +$0.07/hour. Scribe v2 Realtime: $0.39/hour.

Not disclosedParameters · Throughput / concurrency · License

Full technical breakdown9 sections

Overview

Scribe v2 is ElevenLabs' current flagship batch speech-to-text model. It follows the original Scribe model, now designated Scribe v1, which launched on February 26, 2025, and it sits alongside Scribe v2 Realtime, released on November 11, 2025, for live applications. ElevenLabs positioned v2 as more stable and more accurate on long recordings, and separated v2 Realtime as the low-latency model for agents and live captioning. As of June 8, 2026, ElevenLabs has deprecated scribe_v1 and states it will be removed on July 9, 2026.

The v2 launch made Scribe the transcription engine inside ElevenLabs Studio for subtitles, captions, and transcripts across marketing, media, research, training, and compliance workflows. Customer stories such as Wockhardt Hospitals describe use in multilingual clinical documentation.

ElevenLabs has not publicly specified Scribe v2's model family, encoder/decoder design, parameter count, tokenizer, training-hours total, training-data composition, or whether it uses a transducer, CTC, encoder-decoder transformer, or hybrid architecture. Public sources document the service pipeline and API behavior, not the underlying neural architecture.

Capabilities and features

Publicly documented additions in v2 over v1 include context-aware keyterm prompting, native entity detection with timestamps, automatic multilingual transcription inside a single file, smart speaker diarization, dynamic audio-event tagging, and, added later, entity redaction, Indic-English code-switching improvements, and No Verbatim mode for cleaner text output.

Documented API and product features:

  • Input: direct file upload or source_url; docs state support for all major audio and video formats, files under 5 GB, and audio at least 100 ms long. The April 1, 2026 source_url addition made hosted URLs, YouTube, TikTok, and similar media sources first-class inputs.
  • Audio format note: if audio is passed as 16-bit PCM at 16 kHz mono little-endian (pcm_s16le_16), ElevenLabs states latency is lower than when sending encoded waveforms.
  • Language handling: if language_code is omitted, the API predicts the language automatically and returns language_code and language_probability. A single file can contain multiple languages and the model detects and transcribes them without manual segmentation.
  • Output: transcript plus words[] entries containing text, start, end, speaker_id, type, and logprob; word and character timestamp granularity.
  • Diarization: up to 32 speakers via num_speakers, with an optional diarization_threshold. Docs state higher values reduce the risk of splitting one speaker into multiple clusters but increase the risk of merging different speakers; lower values do the opposite.
  • Multichannel mode: for audio where each channel contains a single speaker; supports up to 5 channels, returns channel_index per word, and bills each channel independently for the full duration.
  • Keyterm prompting: launched with 100 terms, expanded to 1,000 on April 2, 2026; described as context-aware rather than blind insertion. Minimum billable duration is 20 seconds when keyterms exceed 100.
  • Entity detection: native detection with timestamps across categories such as PII/PHI/PCI/other/offensive language; marketing documents up to 56 categories.
  • Entity redaction: added April 2026, with several replacement modes.
  • No Verbatim mode: removes filler words, repeated phrases, stuttering, and disfluencies.
  • Dynamic audio tagging for non-speech events.
  • Operational: async webhooks, zero-retention mode for enterprise customers, speaker-role detection, and speaker library support.
  • Response metadata: audio_duration_secs added to STT responses in June 2026.
  • Confidence signals: per-word logprob and overall language_probability. ElevenLabs does not currently document a calibrated transcript-level or word-level confidence metric comparable to explicit confidence fields documented by some competitors.
  • Error modes: batch STT returns 422 Unprocessable Entity for malformed requests; the realtime API exposes auth, quota, throttling, rate-limit, and "unaccepted terms" error events.

VAD and streaming controls are documented on Scribe v2 Realtime, not batch Scribe v2: manual or VAD-based commit strategies, vad_threshold, vad_silence_threshold_secs, minimum speech and silence durations, partial versus committed transcript events, predictive transcription, and text conditioning for reconnect continuity. The source states these should not be read as proof that the batch model shares the same online segmentation design.

Language support

Current documentation states 90+ languages; marketing pages sometimes still state 99, and Scribe v1 launch materials used 99. Scribe v2 adds automatic multi-language transcription within one file. The April 2026 upgrade documented improved Indic-English code-switching, specifically preserving English words in Latin script inside Indic-language utterances.

Performance and benchmarks

Vendor-reported: the v2 launch post claimed the lowest WER recorded on industry-standard benchmarks, and ElevenLabs' marketing states Scribe v2 beats GPT-4o Transcribe, Gemini 2.5 Pro, and AssemblyAI in batch accuracy. The Scribe v1 launch post claimed best results on FLEURS and Common Voice across 99 languages.

Third-party evaluation: Artificial Analysis' AA-WER v2 report from February 2026 placed Scribe v2 first at 2.3% AA-WER, and first on its proprietary AA-AgentTalk voice-agent dataset and the Earnings22 corporate-calls set. The current Artificial Analysis non-streaming leaderboard snapshot shows Scribe v2 second at 2.2%, behind Fun-Realtime-ASR-preview at 1.7%. AssemblyAI's own benchmark page ranks Universal-3 Pro ahead of ElevenLabs Scribe v2 on AssemblyAI's selected dataset mix.

Latency and throughput

Batch latency for Scribe v2 is not publicly disclosed; ElevenLabs does not publish a batch-latency target. The Scribe v1 launch post said a low-latency version was "coming soon." Live use cases are delegated to Scribe v2 Realtime, for which ElevenLabs claims approximately 150 ms latency. Sending audio as pcm_s16le_16 is documented as lower latency than sending encoded waveforms. Throughput and concurrency figures are not publicly disclosed.

Deployment and integrations

Scribe v2 is accessed through the same REST STT endpoint family as v1, with the model selected by model_id ("scribe_v2"). The endpoint supports async webhooks, source_url input, multichannel mode, entity features, no_verbatim, speaker-role detection, and speaker library support. Zero-retention mode is available for enterprise customers. Scribe v2 is integrated into ElevenLabs Studio for subtitles, captions, and transcripts, and sits alongside ElevenLabs' API, Agents, and security/compliance offerings. A June 8, 2026 changelog entry also changed the default Agents ASR provider toward scribe_realtime.

In the official ElevenLabs SDK release notes, @kraenhansen is credited with adding keyterms and no_verbatim support to the Scribe realtime API in the JS and Python SDKs, and @PaulAsjes appears repeatedly on Speech Engine and SDK release work.

Pricing

Item Price
Scribe v1 / v2 batch transcription $0.22/hour
Keyterm prompting add-on $0.05/hour
Entity detection / redaction add-on $0.07/hour
Scribe v2 Realtime $0.39/hour
Multichannel billing Each channel billed independently for the full duration
Minimum billable duration with more than 100 keyterms 20 seconds

Public pricing no longer distinguishes v1 from v2.

Comparison prices retrieved by the source: Google Cloud STT v2 headline price $0.016/minute (~$0.96/hour); AssemblyAI Universal-3 Pro $0.21/hour and Universal-2 $0.15/hour; OpenAI gpt-4o-transcribe estimated at $0.006/minute (~$0.36/hour) and realtime Whisper at $0.017/minute (~$1.02/hour); Deepgram Nova-3 monolingual streaming $0.29/hour; Amazon Transcribe usage-based per second with a 15-second minimum charge and a free tier, exact numeric tier not captured in the source.

Development and ownership

Scribe v2 is developed by ElevenLabs. The Scribe v1 launch post published a contributor list; the v2 launch post did not publish an equivalent contributor section, so direct v2 authorship beyond visible authors and public product writers is unspecified.

Person Public role / bio Publicly linked relevance to Scribe / ElevenLabs
Piotr Dąbkowski Cofounder; leads research and engineering teams developing ElevenLabs' AI audio models. Previously worked on ML at Google; studied at Cambridge. Named as a research contributor on the original Scribe launch and publicly described by ElevenLabs as leading its research team.
Mati Staniszewski Cofounder; leads teams building AI that can communicate at human level. Previously at Palantir; studied Mathematics at Imperial. Not publicly credited as a model engineer on Scribe, but central to company/product direction and platform expansion.
Flavio Schneider ElevenLabs research team; focuses on ASR and Music. Publicly credited in the Scribe launch post as research lead, training, architecture for v1. The strongest public model-attribution source ElevenLabs has published for Scribe.
Tim von Känel ElevenLabs research team; focuses on ASR and Music. Publicly credited in the Scribe launch post as project lead, pre-training data, fine-tuning data.
Maximiliano Levi Public bio not retrieved in official materials reviewed. Publicly credited on the Scribe launch as responsible for inference and optimizations.
Johan Nordberg Public bio not retrieved in official materials reviewed. Listed as a research contributor on the Scribe launch.
Austin Malerba Public bio not retrieved in official materials reviewed. Credited for frontend on the Scribe launch.
Hristo Stoychev Public bio not retrieved in official materials reviewed. Credited for backend on the Scribe launch.
Alex George Public bio not retrieved in official materials reviewed. Credited for data acquisition on the Scribe launch.
Joe Reeve Growth team; focused on helping developers get the most out of ElevenLabs' frontier audio models. Authored the April 2026 Scribe v2 upgrade post.
Tadas Petra Public bio page not retrieved, but publicly listed as article author. Authored How Scribe v2 Realtime Works, the most technical public explainer of the Scribe family's live stack.

Release history

Date Event Details
Feb 26, 2025 Original Scribe launched. First ElevenLabs STT model; marketed as supporting 99 languages, with word-level timestamps, diarization, and audio-event tagging. The launch post published an internal contributions list.
Nov 11, 2025 Scribe v2 Realtime launched. Separated the live/agent use case from batch STT; ElevenLabs claimed ~150 ms latency and positioned it for meetings, agents, and real-time captioning.
Jan 9, 2026 Scribe v2 launched. Formal batch successor to v1; ElevenLabs said it improved stability and accuracy on long-form audio, pauses, tone changes, and extended silences.
Apr 1, 2026 source_url added to the STT endpoint. Batch workflows no longer required direct file upload; hosted URLs, YouTube, TikTok, and similar media sources became first-class inputs.
Apr 2, 2026 Scribe v2 upgrade: entity redaction, Indic-English code-switching, No Verbatim, keyterms limit raised from 100 to 1,000. The most substantial post-launch feature drop for v2; changed Scribe's privacy and transcript-formatting capabilities.
Jun 2026 audio_duration_secs added to STT responses. API ergonomics addition for analytics and billing reconciliation.
Jun 8, 2026 scribe_v1 deprecated; removal scheduled for Jul 9, 2026. The same changelog also changed the default Agents ASR provider toward scribe_realtime.

Scribe v1 versus Scribe v2

Dimension Scribe v1 Scribe v2 Source note
Release status Launched Feb 26, 2025; deprecated Jun 8, 2026; removal planned Jul 9, 2026. Launched Jan 9, 2026; current main batch model. v2 is the designated successor; v1 is being retired.
Primary optimization Multilingual batch ASR with timestamps, diarization, audio-event tags. Batch transcription, subtitling, and captioning at scale; improved for long-form audio, pauses, tone changes, and extended silences. Vendor framing shifted from accuracy to production batch transcription.
Accuracy claims Launch post claimed best results on FLEURS and Common Voice across 99 languages. Launch post claimed the lowest WER recorded on industry-standard benchmarks; marketing says it beats GPT-4o Transcribe, Gemini 2.5 Pro, and AssemblyAI in batch accuracy. Vendor claims favor v2; third-party benchmarks compare vendors on a common methodology.
Latency Batch latency publicly unspecified; launch post said a low-latency version was "coming soon." Batch latency publicly unspecified; live use case is delegated to Scribe v2 Realtime at ~150 ms. ElevenLabs does not publish batch latency targets.
Languages Launch materials said 99 languages. Current docs say 90+ languages; marketing pages sometimes still say 99. v2 adds automatic multi-language transcription within one file. Official language count is currently inconsistent.
Diarization Publicly available at launch. Publicly available; current docs state up to 32 speakers. v2 diarization is more configurable in public docs.
Audio-event tagging Publicly available at launch. Publicly available; current docs call it dynamic audio tagging. Feature continues forward.
Entity detection / redaction No public v1 launch support found. Native entity detection with timestamps at launch; redaction added Apr 2026. Up to 56 categories documented in marketing. New in v2.
Punctuation / formatting Publicly unspecified beyond standard transcript output. Better handling of pauses and silences; No Verbatim removes filler words, repeated phrases, stuttering, and disfluencies. v2 adds transcript cleanup options.
Code-switching Multilingual support at launch, no special public claim for code-switching. Automatic multilingual transcription at launch; Apr 2026 upgrade improved Indic-English code-switching and preserved English words in Latin script. v2 documents specific code-switching support.
Prompting / vocabulary control Public status unclear. Current pricing groups v1/v2 together, but current docs only explicitly describe keyterm prompting for v2 batch and v2 Realtime. Keyterm prompting launched with 100 terms, expanded to 1,000 on Apr 2, 2026; context-aware rather than blind insertion. Public docs are inconsistent on whether late-era v1 shares this feature; v2 support is explicit.
Timestamps / alignment output Word-level timestamps at launch. Word-level timestamps; current API also exposes character timestamp granularity. v2's response layer exposes more granularity.
API surface Same REST STT endpoint family, model chosen by model_id. Same REST endpoint with model_id="scribe_v2"; async webhooks, source_url, multichannel, entity features, no_verbatim, speaker-role detection, speaker library support. v2 has more production controls.
Pricing Current official pricing groups Scribe v1 / v2 at $0.22/hour batch. Same current batch pricing: $0.22/hour; keyterm prompting adds $0.05/hour and entity detection/redaction $0.07/hour. Public pricing no longer distinguishes v1 from v2.
Hard limits Current shared endpoint supports audio files under 5 GB and at least 100 ms long. Same shared endpoint limits; plus up to 32 diarized speakers, 5 channels for multichannel, 1,000 keyterms, and minimum billable 20 seconds when keyterms exceed 100. v2's operational limits are specified in more detail.
Model size Unspecified. Unspecified. ElevenLabs does not publish parameter counts for either model.
Training data Public roles for v1 included pre-training data and fine-tuning data, but no sizes or corpus breakdowns were published. Unspecified publicly. No training-hours figure or dataset description found in official v2 materials. Not publicly disclosed for either model.
User fine-tuning / custom training Unspecified publicly. Unspecified publicly. Keyterm prompting exists, but customer model fine-tuning/custom ASR training is not documented. Prompting is documented; fine-tuning is not.
Confidence signals Current shared response schema exposes language_probability and per-word logprob; model-specific calibration is undocumented. Same. Scoring primitives are exposed; a calibrated confidence API is not documented.

Market comparison

Provider Product Public pricing retrieved Publicly visible strengths Documented constraints versus Scribe v2 Stated fit
ElevenLabs Scribe v2 $0.22/hour batch; keyterms $0.05/hour; entity detection/redaction $0.07/hour. Realtime sibling: $0.39/hour. 90+ languages; up to 1,000 keyterms; entity detection/redaction; audio-event tagging; diarization up to 32 speakers; Studio + Agents + security/compliance packaging. Batch latency unspecified; architecture/training/model size undisclosed; public custom fine-tuning unsupported. Multilingual long-form transcription, subtitles/captions, compliance-sensitive transcripts, teams already on ElevenLabs.
OpenAI gpt-4o-transcribe / gpt-4o-transcribe-diarize gpt-4o-transcribe: $0.006/min (~$0.36/hour). Realtime Whisper: $0.017/min (~$1.02/hour). Low batch price; integration in the OpenAI stack; diarization via a dedicated gpt-4o-transcribe-diarize model. Request-response and streaming are split across different model families; current file uploads limited to 25 MB; retrieved docs show less transcript-enrichment depth than Scribe v2. Cost-sensitive developers already standardized on OpenAI APIs.
Deepgram Nova-3 / Flux Official pricing page publishes Nova-3 pay-as-you-go pricing and add-ons; retrieved snippet shows $0.29/hour for monolingual streaming, with additional published multilingual/add-on pricing. Live and multilingual docs; code-switching; diarization; multichannel up to 20; self-serve vocabulary adaptation "without model retraining." Product choice is split as Deepgram pushes Flux for voice agents while Nova-3 remains the general transcription model; add-ons are separately priced. Realtime-heavy voice AI teams that want explicit STT specialization and operational controls.
AssemblyAI Universal-3 Pro $0.21/hour batch for Universal-3 Pro; $0.15/hour for Universal-2. Keyterms prompting add-on shown publicly. Transcript intelligence; prompting/custom spelling; meeting and medical workflows; streaming model marketed with native code switching and sub-300 ms latency. Universal-3 Pro currently supports 6 languages out of the box, while broader 99-language coverage sits on Universal-2. Buyers who care as much about downstream transcript intelligence as raw ASR.
Google Cloud Chirp 3 / STT v2 $0.016/minute (~$0.96/hour) headline STT v2 price on the product page. Streaming + batch; diarization; automatic language detection; speech adaptation; cloud governance and data residency; enterprise procurement. Google docs warn that Chirp 3 word-level timestamps may degrade results, and that word-level "confidence" is not truly a confidence score. Large GCP estates, region-sensitive deployments, teams wanting managed cloud control planes.
AWS Amazon Transcribe Usage-based per second, 15-second minimum charge, free tier; exact numeric tier not fully captured in retrieved excerpts. Cloud integration; automatic language identification, multilingual detection, custom language models, PII redaction, and a medical transcription product line. Feature availability is fragmented by language and workflow; Amazon Transcribe Medical is US English only. AWS-native organizations prioritizing cloud consistency.

Sources

The platform

Put these benchmarks to work

The same evaluations behind these dispatches drive OpenTranscription — one API that routes every job to the right speech model for your audio, language, and budget.

© 2026 OpenTranscription · Signal is our journal.Set in system grotesque, serif & mono