Scribe v2 Realtime: model profile
Reference profile of Scribe v2 Realtime, ElevenLabs' streaming speech-to-text model released November 11, 2025: specs, benchmarks, pricing, limits.
Scribe v2 Realtime is ElevenLabs' streaming speech-to-text model for live transcription, released on November 11, 2025 and delivered through the ElevenLabs API, SDKs, and Agents platform.
Specifications
| Developer | ElevenLabs |
| Released | November 11, 2025 |
| Model type | Streaming speech-to-text with predictive transcription; architecture not publicly described |
| Languages | 90+ for the realtime model; broader Scribe family marketed at 99 |
| Modes (batch / streaming) | Streaming; batch transcription is handled by the separate Scribe v2 model |
| Latency | Vendor-reported: under 150 ms; one realtime product page states under 100 ms |
| Throughput / concurrency | 30+ concurrent sessions for enterprise clients; general self-serve concurrency policy not published |
| Deployment | Cloud: WebSocket API, SDKs, JavaScript and React clients, ElevenAgents; on-prem early-access materials do not explicitly name this model |
| Pricing | $0.39/hour pay-as-you-go; $0.28/hour and lower on annual Business plans; keyterm prompting adds 20% |
Not disclosedParameters · Training data · License
Full technical breakdown9 sections
Overview
Scribe v2 Realtime is the realtime member of the Scribe model family. ElevenLabs' model catalog distinguishes Scribe v2 for batch transcription and Scribe v2 Realtime for live use, and describes the latter as its "fastest and most accurate live speech recognition model," built for conversational settings such as live meeting transcription, AI agents, and multilingual recognition. The realtime WebSocket API streams partial transcripts first and committed transcripts when a segment is finalized.
ElevenLabs positions the model for voice agents, meeting assistants, live captioning, and other low-latency speech interfaces. Its central public claims are latency under 150 ms, 93.5% accuracy across 30 common European and Asian languages, and support for 90+ languages.
ElevenLabs discloses system-level behaviors rather than a full neural architecture. Public materials describe a streaming-first architecture, predictive transcription, text conditioning, manual or VAD commit strategies, and word-level timestamps, but they do not disclose the backbone, parameter count, training corpus size, or decoder class for Scribe v2 Realtime.
Capabilities and features
Scribe v2 Realtime is a cloud streaming speech-to-text service exposed primarily as a WebSocket API. Audio chunks are sent as input_audio_chunk messages, and the service returns partial and committed transcripts, including timestamped variants. Authentication uses an API key or a single-use token; the documented client-side path recommends generating the token server-side so browser clients do not expose permanent credentials. ElevenLabs provides first-party JavaScript and React support, including Scribe.connect() in @elevenlabs/client and the useScribe hook in @elevenlabs/react.
Documented system behaviors:
- Streaming-first architecture and predictive transcription that anticipates likely next words and punctuation, which is how ElevenLabs explains the latency claim.
- Text conditioning, allowing the model to continue transcription from previous context after a reconnect.
- Two transcript finalization modes: manual commit and Voice Activity Detection. This separates fast partial text from committed text.
- Word-level timestamps.
Later client and API additions include keyterms and no_verbatim support, context deduplication, microphone device options, and native mute/unmute support in the client packages.
Use cases named in ElevenLabs materials: voice agents, meeting assistants, real-time captioning, multilingual live transcription, meeting note-taking, and live language translation. The March 2026 technical explainer includes a realtime translator demo built with Scribe v2 Realtime plus the Chrome Translator API.
An ElevenLabs FAQ states that realtime diarization is not a priority at the moment and that dual-channel support is not planned.
Language support
ElevenLabs' realtime pages state 90+ languages for Scribe v2 Realtime. Broader Scribe product pages market 99 languages for the Scribe family; the 99 figure refers to the wider Scribe brand or batch model rather than the realtime model specifically.
Performance and benchmarks
Vendor-reported: the launch post claims 93.5% accuracy across 30 commonly used European and Asian languages.
Vendor benchmark: realtime marketing pages depict Scribe v2 Realtime outperforming Gemini Flash 2.5, GPT-4o Mini, and Deepgram Nova 3 on a benchmark involving "500 hard samples." The published material does not include enough methodology detail to make that chart independently reproducible.
Third-party evaluation: Artificial Analysis' non-streaming benchmark places Scribe v2 at 2.2% AA-WER, ahead of GPT-4o Transcribe (4.0%), GPT-4o Mini Transcribe (4.5%), Deepgram Nova-3 (5.2%), and Rev AI (5.9%). This result applies to the batch, non-streaming Scribe v2 model, not specifically to Scribe v2 Realtime.
The source's cross-vendor comparison, separating public latency claims, accuracy signals, and pricing:
| Provider / model | Public latency | Public accuracy signal | Language support | Real-time capability | Public pricing | Notable strengths | Notable weaknesses | Sources |
|---|---|---|---|---|---|---|---|---|
| ElevenLabs Scribe v2 Realtime | <150 ms | 93.5% accuracy across 30 common European and Asian languages | 90+ languages | Yes, WebSocket streaming, partial + committed transcripts | $0.39/hr PAYG; lower on annual Business; keyterms +20% | Low latency claim; multilingual; ElevenAgents/TTS integration; documented privacy controls | No public full architecture; publicly limited realtime diarization/dual-channel story; enterprise concurrency only partially disclosed | |
| Google Cloud Speech-to-Text Chirp 3 | Streaming supported; no single ms figure in reviewed docs | Google says Chirp 3 improves accuracy and speed; no headline public WER in reviewed docs | Official Chirp 3 page lists 111 transcription locales / language codes across GA + Preview | Yes, StreamingRecognize supported in STT v2 | $0.016/min starting tier ($0.96/hr) | Broad locale coverage; GCP-native; diarization, auto language detection, speech adaptation | Public docs reviewed do not provide a simple apples-to-apples WER or latency figure | |
| OpenAI gpt-realtime-whisper / whisper-1 | Low-latency realtime path with tunable delay; no fixed ms figure published in reviewed docs | No single public WER on reviewed OpenAI realtime docs; Whisper trained on 680k hours; standard transcription docs list 57 supported languages and note Whisper was trained on 98 | 57 listed in standard transcription docs; Whisper trained on 98 languages | Yes for gpt-realtime-whisper; whisper-1 is not natively streaming in the same way | $0.017/min realtime ($1.02/hr); standard gpt-4o-mini-transcribe is $0.003/min but not the realtime path | OpenAI ecosystem fit; tunable latency/accuracy tradeoff | No public fixed ms headline; realtime prompt steering limitations; public accuracy evidence less standardized in official docs | |
| Microsoft Azure Speech | "Instant transcription with intermediate results"; no reviewed public ms figure | No headline public WER; Azure emphasizes customization and custom-speech optimization | 140+ languages and dialects | Yes, real-time, batch, and fast transcription | Search snippet shows $1/hr standard realtime, $0.18/hr batch, $1.20/hr custom realtime | Broad language coverage; enterprise stack; fine-tuning/custom speech | Public pricing page can be opaque by region/UI; no simple public ms/WER headline in reviewed sources | |
| Deepgram Nova-3 | Sub-300 ms streaming | Deepgram says 54.2% WER reduction for streaming vs competitors; Artificial Analysis shows 5.2% AA-WER for Nova-3 (non-streaming benchmark) | 45+ languages on Nova models | Yes, streaming | $0.0077/min monolingual streaming ($0.462/hr); $0.0092/min multilingual streaming ($0.552/hr) | Mature streaming stack; multilingual and noisy-audio positioning; keyword prompting and diarization ecosystem | Language breadth lower than ElevenLabs/Google/Azure; flagship multilingual streaming is pricier than monolingual | |
| AssemblyAI Universal-3 Pro Streaming | ~300 ms P50 / sub-300 ms | Vendor says best-in-class / most accurate streaming model; no single official WER figure in reviewed sources | 6 languages on flagship U3 Pro Streaming; 99 on Universal-2 async | Yes, secure WebSocket streaming | Official AssemblyAI materials put U3 Pro Streaming at $0.45/hr; lower-cost universal streaming at $0.15/hr | Streaming ergonomics; no hard caps on concurrent streams; voice-agent fit | Flagship streaming language set is much narrower than ElevenLabs' 90+ claim | |
| Rev AI | Real-time streaming with low latency; no reviewed public ms figure | Rev markets high accuracy in noisy/far-field/telephony and cites "up to 77.4% gains" in challenging conditions; Artificial Analysis shows 5.9% AA-WER | 58+ async languages; 9+ streaming languages | Yes, realtime streaming + async | $0.20/hr English Reverb, $0.10/hr Reverb Turbo, $0.30/hr foreign language | Simple pricing; inexpensive; broad async availability | Streaming language breadth is much narrower; public latency disclosure is light |
Latency and throughput
Vendor-reported latency is under 150 ms; one realtime product page mentions under 100 ms, while the core launch and documentation narrative standardizes on under 150 ms.
For comparison, official public figures cited in the source are sub-300 ms for Deepgram streaming and about 300 ms P50 for AssemblyAI streaming. Google, Azure, OpenAI, and Rev support live or low-latency transcription but do not publish a single comparably explicit millisecond headline for their core STT offerings in the reviewed sources.
The only explicit realtime concurrency figure in the reviewed materials is an FAQ stating 30+ concurrent sessions for enterprise clients. A general self-serve concurrency policy is not published.
Deployment and integrations
The generally available offering is cloud based: the ElevenLabs API, SDKs, JavaScript and React clients, and the Agents platform. ElevenLabs' broader speech-to-text marketing states that Scribe supports cloud and on-premise configurations, and the company has an early-access on-prem / on-device deployment program for selected models, but the on-prem materials do not explicitly name Scribe v2 Realtime.
By June 2026, ElevenAgents had changed its default ASR provider from elevenlabs to scribe_realtime.
Privacy and security disclosures: data is encrypted in transit and at rest; ElevenLabs supports SOC 2, GDPR, and HIPAA BAA for qualifying enterprises, and offers EU, India, and Singapore data residency. Zero Retention Mode is exposed for Speech-to-Text by setting enable_logging=false on /v1/speech-to-text/* endpoints, which prevents request history from appearing and limits logging for sensitive workloads.
Pricing
- Pay-as-you-go: $0.39/hour for Speech to Text, with 2.5 hours included on the free/pay-as-you-go tier.
- Annual Business plans: $0.28/hour and lower, per the realtime product page.
- Keyterm prompting carries a 20% premium.
- One speech-to-text marketing page rounds the base price to $0.40/hour; the pricing page lists $0.39/hour.
Per the source's comparison of public rates, this pricing is below Google Cloud STT v2 entry pricing ($0.96/hr), Azure standard realtime transcription ($1/hr), OpenAI's realtime whisper model ($1.02/hr), and AssemblyAI U3 Pro Streaming ($0.45/hr); it is below Deepgram's Nova-3 multilingual streaming rate ($0.552/hr) and above Rev AI's Reverb headline pricing ($0.20/hr English).
Development and ownership
Scribe v2 Realtime is developed by ElevenLabs. The company has not published a Scribe v2 Realtime-specific contributor roster comparable to the original Scribe announcement, so public attribution comes in layers.
The original Scribe launch names the core contributors to the underlying speech-to-text program: Flavio Schneider (research lead, training and architecture), Tim von Känel (project lead, pre-training and fine-tuning data), Maximiliano Levi (inference and optimizations), Johan Nordberg and Piotr Dabkowski (research contributors), Austin Malerba (frontend), Hristo Stoychev (backend), and Alex George (data acquisition). ElevenLabs author pages identify Flavio Schneider and Tim von Känel as members of the research team focused on ASR and music.
For Scribe v2 Realtime specifically, Tadas Petra authored the official technical deep-dive "How Scribe v2 Realtime Works" in March 2026. ElevenLabs does not publish a separate role label for him on its author page.
In SDK work, ElevenLabs' Python SDK release v2.46.0 credits @kraenhansen for adding keyterms and no_verbatim support to the Scribe realtime API; Kræn Hansen's GitHub profile describes his work as "Building Developer Experiences @elevenlabs."
| Public attribution layer | Named people / group | Publicly stated or inferred role | Evidence |
|---|---|---|---|
| Core Scribe research foundation | Flavio Schneider | Research lead; training and architecture | |
| Core Scribe research foundation | Tim von Känel | Project lead; pre-training and fine-tuning data | |
| Core Scribe research foundation | Maximiliano Levi | Inference and optimizations | |
| Core Scribe research foundation | Johan Nordberg, Piotr Dabkowski | Research contributors | |
| Core Scribe engineering | Austin Malerba, Hristo Stoychev, Alex George | Frontend, backend, data acquisition | |
| Realtime technical rollout | Tadas Petra | Author of official Scribe v2 Realtime technical guide | |
| SDK/productization | Kræn Hansen | Realtime SDK contributor; developer experience | |
| Publicly visible teams | Research, ElevenAPI/developer platform, ElevenAgents | Inference from docs/blog/changelog ownership and integration |
The source describes the team decomposition (Research, ElevenAPI/developer platform, ElevenAgents) as an inference from public materials rather than a published org chart.
Release history
The original Scribe launched in February 2025 with multilingual batch transcription, word-level timestamps, diarization, and audio-event tagging, and explicitly previewed a future low-latency version. In April 2025, ElevenLabs shipped scribe_v1_experimental. Scribe v2 Realtime was released in November 2025, followed by the batch Scribe v2 in January 2026. In June 2026, ElevenLabs formally deprecated scribe_v1 with a July 9, 2026 removal date, and ElevenAgents made scribe_realtime the default ASR provider.
| Date | Milestone | Why it matters | Sources |
|---|---|---|---|
| Feb 26, 2025 | Original Scribe launched | First STT model; realtime version promised | |
| Apr 7, 2025 | scribe_v1_experimental preview | Improved multilingual files, silence handling, audio tags | |
| Nov 11, 2025 | Scribe v2 Realtime released | Official release date for the live model | |
| Jan 9, 2026 | Scribe v2 released | Batch/long-form v2 arrives after realtime v2 | |
| Jan 19, 2026 | SDK improvements around useScribe | First visible post-launch package hardening | |
| Mar 4, 2026 | "How Scribe v2 Realtime Works" published | Public technical explanation | |
| Apr-May 2026 | keyterms, no_verbatim, context, mute/unmute added | Realtime usability and control improved | |
| Jun 8, 2026 | scribe_v1 deprecated; scribe_realtime default in ElevenAgents | Realtime becomes the default ASR direction inside agents |
Sources
- Introducing Scribe v2 Realtime - https://elevenlabs.io/blog/introducing-scribe-v2-realtime
- ElevenLabs - https://elevenlabs.io/realtime-speech-to-text
- Speech to Text - Most Accurate Speech to Text Model - https://elevenlabs.io/speech-to-text
- Models | ElevenLabs Documentation - https://elevenlabs.io/docs/overview/models
- ElevenLabs - Meet Scribe the world's most accurate ASR model - https://elevenlabs.io/blog/meet-scribe
- ElevenAPI Pricing for creators and businesses of all sizes - https://elevenlabs.io/pricing/api
- How Scribe v2 Realtime Works - https://elevenlabs.io/blog/how-scribe-v2-realtime-works
- April 7, 2025 | ElevenLabs Documentation - https://elevenlabs.io/docs/changelog/2025/4/7
- Introducing Scribe v2 - https://elevenlabs.io/blog/introducing-scribe-v2
- January 19, 2026 | ElevenLabs Documentation - https://elevenlabs.io/docs/changelog/2026/1/19
- Changelog | ElevenLabs Documentation - https://elevenlabs.io/docs/changelog
- June 8, 2026 | ElevenLabs Documentation - https://elevenlabs.io/docs/changelog/2026/6/8
- Releases · elevenlabs/elevenlabs-python · GitHub - https://github.com/elevenlabs/elevenlabs-python/releases
- Realtime | ElevenLabs Documentation - https://elevenlabs.io/docs/api-reference/speech-to-text/v-1-speech-to-text-realtime
- Introducing Whisper | OpenAI - https://openai.com/index/whisper/
- ElevenAPI - ElevenLabs AI audio APIs - https://elevenlabs.io/api
- Chirp 3 Transcription: Enhanced multilingual accuracy | Cloud Speech-to-Text | Google Cloud Documentation - https://docs.cloud.google.com/speech-to-text/docs/models/chirp-3
- Realtime transcription | OpenAI API - https://developers.openai.com/api/docs/guides/realtime-transcription
- Speech to Text Overview - Speech Service - Foundry Tools - https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-to-text
- Measuring STT Latency | Deepgram's Docs - https://developers.deepgram.com/docs/measuring-streaming-latency
- Realtime Speech-to-Text API | AssemblyAI - https://www.assemblyai.com/products/streaming-speech-to-text
- Speech-to-Text API At Scale - https://www.rev.ai/speech-to-text
- ElevenLabs: API Provider Benchmarking & Analysis - https://artificialanalysis.ai/speech-to-text/models/elevenlabs
- Models & Languages Overview | Deepgram's Docs - https://developers.deepgram.com/docs/models-languages-overview