OpenTranscription/ Blog
2026-07-03 · ANALYSIS

Scribe v2 Realtime: ElevenLabs makes its play for live speech-to-text

ElevenLabs' Scribe v2 Realtime claims sub-150 ms latency, 93.5% accuracy in 30 languages, and $0.39/hr pricing. What the public record actually supports.

Abstract illustration of a live audio waveform resolving into committed signal blocks, in slate teal and amber

ElevenLabs released Scribe v2 Realtime on November 11, 2025, and pointed it squarely at voice agents, meeting assistants, live captioning, and other interfaces where transcription has to keep pace with speech. The headline claims: latency under 150 ms, 93.5% accuracy across 30 common European and Asian languages, support for 90+ languages, and direct availability through the ElevenLabs API, SDKs, and Agents platform.

The individual numbers are interesting. The package is what matters. ElevenLabs shipped a multilingual streaming STT model that plugs into its existing voice and agents stack, with first-party JavaScript and React support, single-use token auth so browser clients never hold permanent credentials, enterprise privacy controls, and public pricing that undercuts Google Cloud Speech-to-Text, Azure Speech real-time transcription, and OpenAI's current realtime transcription rate, while landing in the same cost band as specialist speech vendors like Deepgram and AssemblyAI.

One caveat up front, because it colors everything below. ElevenLabs discloses system-level behavior, not a full neural architecture. The public materials describe a streaming-first design, predictive transcription, text conditioning, manual or VAD-based commit strategies, and word-level timestamps. They do not disclose the backbone, parameter count, training corpus size, or decoder class for Scribe v2 Realtime. Attribution is also incomplete: ElevenLabs published a detailed contributor list for the original Scribe launch but nothing equally explicit for the realtime model.

What it is and why the release matters

Scribe v2 Realtime is the realtime member of the Scribe family. ElevenLabs' model catalog keeps Scribe v2 for batch transcription and Scribe v2 Realtime for live use, and calls the latter its "fastest and most accurate live speech recognition model," built for conversational settings like live meeting transcription, AI agents, and multilingual recognition. The realtime WebSocket API streams partial transcripts as speech arrives, then commits a final transcript when a segment is done.

Context makes the release easier to read. When ElevenLabs launched the original Scribe in February 2025, it said outright that a low-latency version for real-time applications was coming. That promise became the November 2025 launch. By June 2026, ElevenAgents had switched its default ASR provider from elevenlabs to scribe_realtime, which is about as strong a signal as you get that the realtime stack became strategic infrastructure for the agents platform rather than a side feature.

Multilingual breadth at low latency is the second reason to pay attention. ElevenLabs' realtime pages consistently say 90+ languages for Scribe v2 Realtime, while the broader Scribe product family is marketed at 99 languages. The distinction is worth noticing: 90+ appears to be the safe public number for the realtime model specifically, and 99 refers to the wider Scribe brand or the batch model. The realtime model is clearly multilingual either way, but the public materials are inconsistent about exactly how far the list extends.

Then there is price. The public API pricing page lists Speech to Text at $0.39/hour pay-as-you-go, with 2.5 hours included on the free/pay-as-you-go tier and a 20% premium for keyterm prompting. The realtime product page separately advertises $0.28/hour and lower on annual Business plans. That puts Scribe v2 Realtime below Google's standard STT v2 rate, below Azure real-time transcription, and below OpenAI's current realtime transcription model, while sitting near specialist streaming STT pricing.

How it got here

ElevenLabs' speech-to-text line started with the original Scribe launch in February 2025, which emphasized multilingual batch transcription, word-level timestamps, diarization, and audio-event tagging, and previewed a future low-latency version. In April 2025 the company shipped scribe_v1_experimental, which improved multilingual files, reduced hallucinations around silence, and improved audio tags. Scribe v2 Realtime arrived in November 2025, and Scribe v2 for batch transcription followed in January 2026. Shipping realtime v2 before batch v2 reads as a prioritization signal: voice-agent use cases came first. That is my inference from the release order, not a company statement.

Through 2026 the work shifted from launch to maturation. The March 2026 engineering explainer documented how Scribe v2 Realtime works at the product and API level. Changelog entries from January, April, and May 2026 show developer-facing refinements: better useScribe cleanup, keyterms and no_verbatim support, context deduplication, microphone device options, and native mute/unmute in the client packages. By June 2026, ElevenAgents had made scribe_realtime the default ASR provider, and ElevenLabs formally deprecated scribe_v1 with a July 9, 2026 removal date.

The highest-confidence public milestones, all from ElevenLabs' own materials:

Date Milestone Why it matters Sources
Feb 26, 2025 Original Scribe launched First STT model; realtime version promised
Apr 7, 2025 scribe_v1_experimental preview Improved multilingual files, silence handling, audio tags
Nov 11, 2025 Scribe v2 Realtime released Official release date for the live model
Jan 9, 2026 Scribe v2 released Batch/long-form v2 arrives after realtime v2
Jan 19, 2026 SDK improvements around useScribe First visible post-launch package hardening
Mar 4, 2026 "How Scribe v2 Realtime Works" published Best public technical explanation
Apr to May 2026 keyterms, no_verbatim, context, mute/unmute added Realtime usability and control improved
Jun 8, 2026 scribe_v1 deprecated; scribe_realtime default in ElevenAgents Realtime becomes the default ASR direction inside agents

Timeline of Scribe releases rendered as spaced signal pulses along a horizontal track

Who built it, as far as the public record goes

ElevenLabs has not published a Scribe v2 Realtime contributor roster comparable to the original Scribe announcement, so the best public attribution comes in layers. The original Scribe launch names the core contributors behind the underlying speech-to-text program: Flavio Schneider as research lead for training and architecture, Tim von Känel as project lead for pre-training and fine-tuning data, Maximiliano Levi for inference and optimizations, Johan Nordberg and Piotr Dabkowski as research contributors, Austin Malerba on frontend, Hristo Stoychev on backend, and Alex George on data acquisition. ElevenLabs author pages also identify Schneider and von Känel as research team members focused on ASR and music.

For Scribe v2 Realtime specifically, the clearest named public contributor is Tadas Petra, who wrote the official technical deep-dive "How Scribe v2 Realtime Works" in March 2026 and appears on ElevenLabs' author pages as the public-facing writer for that material. That points to developer-relations or developer-platform involvement in the productization and rollout, though ElevenLabs does not publish a role label for him.

Visible SDK work is attributable too. The Python SDK release v2.46.0 credits @kraenhansen for adding keyterms and no_verbatim support to the Scribe realtime API, and Kræn Hansen's GitHub profile describes his work as "Building Developer Experiences @elevenlabs." That makes him a clearly public contributor to the developer-facing tooling, if not necessarily to the model research.

At the team level, Scribe v2 Realtime spans at least three publicly visible groups: Research, ElevenAPI/developer platform, and ElevenAgents. The evidence is circumstantial but consistent: the original Scribe contributor list maps to research and engineering, the explainer and SDK work map to platform, and the launch materials plus the June 2026 changelog place Scribe Realtime inside ElevenAgents' runtime. This is a reading of public materials, not an org chart.

Public attribution layer Named people / group Publicly stated or inferred role Evidence
Core Scribe research foundation Flavio Schneider Research lead; training and architecture
Core Scribe research foundation Tim von Känel Project lead; pre-training and fine-tuning data
Core Scribe research foundation Maximiliano Levi Inference and optimizations
Core Scribe research foundation Johan Nordberg, Piotr Dabkowski Research contributors
Core Scribe engineering Austin Malerba, Hristo Stoychev, Alex George Frontend, backend, data acquisition
Realtime technical rollout Tadas Petra Author of official Scribe v2 Realtime technical guide
SDK/productization Kræn Hansen Realtime SDK contributor; developer experience
Publicly visible teams Research, ElevenAPI/developer platform, ElevenAgents Inference from docs/blog/changelog ownership and integration

What the technical record actually says

From the public sources, Scribe v2 Realtime is a cloud-first streaming STT service exposed primarily as a WebSocket API. Audio chunks go up as input_audio_chunk messages, and the service returns partial and committed transcripts, including timestamped variants. Auth works with an API key or a single-use token, and the official client-side path recommends generating that token server-side so browser clients never expose permanent credentials. First-party JavaScript and React support ships as Scribe.connect() in @elevenlabs/client and the useScribe hook in @elevenlabs/react.

The most meaningful disclosures sit at the system behavior level. ElevenLabs says the model uses a streaming-first architecture and predictive transcription to anticipate likely next words and punctuation, which is how it explains the latency claim. It documents text conditioning, letting the model continue from previous context after a reconnect, and two finalization modes: manual commit and Voice Activity Detection. That combination matters in practice because it separates fast partial text from more accurate committed text, and lets you choose which one your product optimizes for.

What has not been disclosed is just as important. I found no published parameter count, training-hours figure, decoder type, backbone family, or architecture diagram equivalent to what OpenAI published for Whisper. OpenAI's Whisper page describes an encoder-decoder Transformer operating on 30-second chunks of log-Mel spectrograms. ElevenLabs has released nothing comparable for Scribe v2 Realtime in the sources reviewed.

On accuracy, the realtime launch post makes the clearest quantitative claim: 93.5% accuracy across 30 commonly used European and Asian languages. The marketing pages also show Scribe v2 Realtime beating Gemini Flash 2.5, GPT-4o Mini, and Deepgram Nova 3 on a benchmark of "500 hard samples," but they do not publish enough methodology to make that chart reproducible. For vendor-neutral context, Artificial Analysis' non-streaming benchmark currently shows the Scribe v2 family with a stronger WER position than GPT-4o Transcribe, GPT-4o Mini Transcribe, Deepgram Nova-3, and Rev AI. Useful, but note that it applies to batch Scribe v2, not specifically to the realtime model.

On languages, 90+ is the safest public number for Scribe v2 Realtime. On deployment, the GA offering is clearly cloud: API, SDKs, React/JS clients, and Agents. ElevenLabs' broader speech-to-text marketing says Scribe supports cloud and on-premise configurations, and the company runs an early-access on-prem and on-device program for selected models, but those materials never name Scribe v2 Realtime. So the public evidence supports cloud deployment today and only qualified, enterprise early-access evidence for anything local.

Privacy and security are unusually well documented for this category. ElevenLabs states that data is encrypted in transit and at rest, supports SOC 2, GDPR, and HIPAA BAA for qualifying enterprises, and offers EU, India, and Singapore data residency. Zero Retention Mode is exposed for Speech-to-Text by setting enable_logging=false on /v1/speech-to-text/* endpoints, which keeps requests out of history and limits logging for sensitive workloads. For enterprise buyers this is one of the product's strongest differentiators.

Pricing and limits: $0.39/hour base on the API pricing page, keyterm prompting adds 20%, and the realtime page advertises $0.28/hour and lower on annual Business plans. Limit information is thinner. The only explicit realtime concurrency number I found is an FAQ stating 30+ concurrent sessions for enterprise clients, so enterprise concurrency is public but a general self-serve policy is not. The same FAQ says realtime diarization is not a priority at the moment and dual-channel support is not planned. If you run a call center or transcribe two-channel telephony, read that twice before committing.

The use cases ElevenLabs itself pushes are consistent across its materials: voice agents, meeting assistants, real-time captioning, multilingual live transcription, meeting note-taking, and live translation. The March 2026 explainer builds a realtime translator demo from Scribe v2 Realtime plus the Chrome Translator API, which captures the intended niche well: speech interfaces that need to understand before the speaker has finished talking.

Abstract signal-flow diagram of audio chunks streaming through a lattice into partial and committed output paths

The competitive picture

Comparing STT products fairly is hard because vendors publish different kinds of evidence. Some give a headline latency figure, some just say "low latency," some publish vendor benchmarks, and some publish no directly comparable WER at all. The table separates public latency claims, public accuracy signals, and pricing, and flags where quantitative evidence is missing.

Provider / model Public latency Public accuracy signal Language support Real-time capability Public pricing Notable strengths Notable weaknesses Sources
ElevenLabs Scribe v2 Realtime <150 ms 93.5% accuracy across 30 common European and Asian languages 90+ languages Yes, WebSocket streaming, partial + committed transcripts $0.39/hr PAYG; lower on annual Business; keyterms +20% Very strong latency claim; multilingual; tight ElevenAgents/TTS integration; strong privacy controls No public full architecture; limited realtime diarization/dual-channel story; enterprise concurrency only partially disclosed
Google Cloud Speech-to-Text Chirp 3 Streaming supported; no single ms figure in reviewed docs Google says Chirp 3 improves accuracy and speed; no headline public WER in reviewed docs Official Chirp 3 page lists 111 transcription locales / language codes across GA + Preview Yes, StreamingRecognize supported in STT v2 $0.016/min starting tier ($0.96/hr) Broad locale coverage; GCP-native; diarization, auto language detection, speech adaptation Public docs reviewed do not provide a simple apples-to-apples WER or latency figure
OpenAI gpt-realtime-whisper / whisper-1 Low-latency realtime path with tunable delay; no fixed ms figure published in reviewed docs No single public WER on reviewed OpenAI realtime docs; Whisper trained on 680k hours; standard transcription docs list 57 supported languages and note Whisper was trained on 98 57 listed in standard transcription docs; Whisper trained on 98 languages Yes for gpt-realtime-whisper; whisper-1 is not natively streaming in the same way $0.017/min realtime ($1.02/hr); standard gpt-4o-mini-transcribe is $0.003/min but not the realtime path Strong OpenAI ecosystem fit; tunable latency/accuracy tradeoff; clean API No public fixed ms headline; realtime prompt steering limitations; public accuracy evidence less standardized in official docs
Microsoft Azure Speech "Instant transcription with intermediate results"; no reviewed public ms figure No headline public WER; Azure emphasizes customization and custom-speech optimization 140+ languages and dialects Yes, real-time, batch, and fast transcription Search snippet shows $1/hr standard realtime, $0.18/hr batch, $1.20/hr custom realtime Broad language coverage; enterprise stack; fine-tuning/custom speech Public pricing page can be opaque by region/UI; no simple public ms/WER headline in reviewed sources
Deepgram Nova-3 Sub-300 ms streaming Deepgram says 54.2% WER reduction for streaming vs competitors; Artificial Analysis shows 5.2% AA-WER for Nova-3 (non-streaming benchmark) 45+ languages on Nova models Yes, streaming $0.0077/min monolingual streaming ($0.462/hr); $0.0092/min multilingual streaming ($0.552/hr) Mature streaming stack; strong multilingual and noisy-audio positioning; keyword prompting and diarization ecosystem Language breadth lower than ElevenLabs/Google/Azure; flagship multilingual streaming is pricier than monolingual
AssemblyAI Universal-3 Pro Streaming ~300 ms P50 / sub-300 ms Vendor says best-in-class / most accurate streaming model; no single official WER figure in reviewed sources 6 languages on flagship U3 Pro Streaming; 99 on Universal-2 async Yes, secure WebSocket streaming Official AssemblyAI materials put U3 Pro Streaming at $0.45/hr; lower-cost universal streaming at $0.15/hr Strong streaming ergonomics; no hard caps on concurrent streams; good voice-agent fit Flagship streaming language set is much narrower than ElevenLabs' 90+ claim
Rev AI Real-time streaming with low latency; no reviewed public ms figure Rev markets high accuracy in noisy/far-field/telephony and cites "up to 77.4% gains" in challenging conditions; Artificial Analysis shows 5.9% AA-WER 58+ async languages; 9+ streaming languages Yes, realtime streaming + async $0.20/hr English Reverb, $0.10/hr Reverb Turbo, $0.30/hr foreign language Very simple pricing; inexpensive; broad async availability Streaming language breadth is much narrower; public latency disclosure is light

A few comparative judgments hold up.

On latency, ElevenLabs' under-150 ms claim beats the public official numbers I found for Deepgram (sub-300 ms) and AssemblyAI (roughly 300 ms P50). Google, Azure, OpenAI, and Rev all support live or low-latency transcription, but in the sources reviewed none of them publishes a single comparably explicit millisecond headline for their core STT offerings.

On accuracy, ElevenLabs has a solid official realtime claim, but the most convincing vendor-neutral data is still batch. Artificial Analysis currently places Scribe v2 at 2.2% AA-WER, ahead of GPT-4o Transcribe (4.0%), GPT-4o Mini Transcribe (4.5%), Deepgram Nova-3 (5.2%), and Rev AI (5.9%) on its non-streaming benchmark. That does not prove the realtime model wins by the same margins, but it does support the idea that the Scribe family is among the stronger STT systems overall.

On language breadth, the realtime 90+ claim is one of ElevenLabs' biggest advantages over specialist streaming rivals. AssemblyAI's flagship streaming model covers six languages, Deepgram markets 45+ on Nova models, and Rev AI's streaming FAQ says 9+ streaming languages. Google and Azure still look broader at the full cloud-platform level, though their public evidence reads more like locale lists than an agents-first, low-latency multilingual pitch.

On economics, Scribe v2 Realtime is very competitive. Using the public rates reviewed here, it is cheaper than Google Cloud STT v2 entry pricing, Azure standard realtime transcription, OpenAI's realtime whisper model, and AssemblyAI U3 Pro Streaming, and it sits below Deepgram's Nova-3 multilingual streaming rate while staying above Rev AI's very low headline Reverb pricing. That makes it especially attractive when you want one vendor for STT plus TTS plus agents instead of best-of-breed procurement across multiple layers.

Comparison of provider latencies drawn as staggered abstract waveform bars of different lengths

Where it fits, and where it doesn't

If you are choosing an STT stack primarily for conversational AI or live agents, Scribe v2 Realtime is one of the strongest current options when you need very low latency, broad multilingual coverage, and tight integration with the rest of the speech stack at the same time. The case is strongest if you already use ElevenLabs for agents or TTS, because realtime recognition, token auth, privacy controls, and agent runtime stay inside one platform.

It fits live captioning, agent assist, in-product voice UIs, multilingual support bots, and meeting assistants particularly well. The partial/committed transcript split, VAD and manual commit controls, text conditioning, and single-use browser token flow all line up with those product patterns. If your app needs transcripts to appear immediately but still benefit from context-based correction, this API design beats a plain file-transcription workflow.

I would pick Google or Azure when cloud-platform alignment matters more than raw STT specialization. If your team already depends heavily on GCP IAM, logging, and regioning, or on Azure enterprise procurement and custom-speech workflows, those platforms may still be operationally simpler even though their public speech-specific latency story is less aggressively marketed.

I would pick Deepgram when you want a mature dedicated speech vendor with a broad ecosystem of streaming features, your language needs fit inside its supported set, and your architecture is already built around specialized STT. Deepgram is especially competitive if you want general-purpose streaming STT alongside a separate turn-aware conversational model like Flux.

I would pick AssemblyAI when you want strong streaming plus generous concurrency behavior and broader downstream speech-intelligence workflows, and six-language flagship streaming is enough. OpenAI makes sense when STT is one step inside a larger OpenAI-native realtime or multimodal workflow. Rev AI is worth considering when cost and a simple API matter more than multilingual streaming breadth.

The honest framing is narrower than the marketing. Public evidence does not support calling Scribe v2 Realtime the single best STT model in every setting. What it does support: Scribe v2 Realtime is currently one of the best public options for multilingual, low-latency, cloud-delivered STT in voice-agent workflows, and arguably the most compelling one if you are already buying into the ElevenLabs ecosystem.

What we still don't know

ElevenLabs has not disclosed a full architecture description comparable to Whisper's published encoder-decoder Transformer writeup, and it has not published a dedicated realtime contributor roster with the detail it gave the original Scribe launch. The team attribution above is strongest for the underlying Scribe program and the public-facing productization work, and should not be read as a complete staffing map.

Cross-vendor accuracy is also imperfectly comparable. The public record mixes official marketing claims, vendor benchmarks, and vendor-neutral non-streaming data. There is no single public apples-to-apples benchmark covering Scribe v2 Realtime, Google Chirp 3, Azure Speech, Deepgram Nova-3/Flux, AssemblyAI U3 Pro Streaming, OpenAI gpt-realtime-whisper, and Rev AI streaming on the same realtime dataset with the same methodology. Where that evidence was missing, I said so rather than forcing the comparison.

A few ElevenLabs pages also contradict each other slightly. Realtime language claims are usually 90+, broader Scribe pages sometimes say 99. Pricing is $0.39/hour on the pricing page but rounds to $0.40/hour on a speech-to-text marketing page. One realtime product page mentions under 100 ms while the launch and docs narrative standardizes on under 150 ms. This post uses the more conservative, better-documented realtime figures throughout.

Sources

The platform

Put these benchmarks to work

The same evaluations behind these dispatches drive OpenTranscription — one API that routes every job to the right speech model for your audio, language, and budget.

© 2026 OpenTranscription · Signal is our journal.Set in system grotesque, serif & mono