OpenTranscription/ Blog
2026-07-03 · MODEL PROFILE

Ink-Whisper: model profile

Reference profile of Ink-Whisper, Cartesia's Whisper-derived streaming speech-to-text model for real-time voice agents, launched June 10, 2025.

Cartesia
Model profile Cartesia

Ink-Whisper is a proprietary speech-to-text model from Cartesia, derived from OpenAI's whisper-large-v3-turbo and adapted for real-time conversational transcription.

Specifications

DeveloperCartesia
ReleasedJune 10, 2025 (announcement); stable snapshot ink-whisper-2025-06-04 dated June 4, 2025
Model typeSpeech-to-text; variant of OpenAI whisper-large-v3-turbo with dynamic chunking
Training dataNot publicly disclosed beyond the Whisper baseline; the original Whisper paper describes an encoder-decoder Transformer trained on 680,000 hours of multilingual, multitask supervision
Languages100 supported languages (ISO-639-1 codes)
Modes (batch / streaming)Batch (/stt) and realtime streaming (/stt/websocket)
LatencyVendor-reported: 66 ms median TTCT (June 2025). Third-party evaluation: 266 ms median time to final segment (June 2026)
DeploymentHosted API; self-hosting in customer cloud or on-prem, including Kubernetes and air-gapped installs with an offline license
Pricing1 credit per second realtime, 1 credit per 2 seconds batch; marketed at launch as about $0.13/hour on the Scale plan
LicenseProprietary API service; terms prohibit downloading, copying, licensing, selling, or creating derivative works from models obtained through the service, except where expressly permitted

Not disclosedParameters · Throughput / concurrency

Full technical breakdown9 sections

Overview

Ink-Whisper was Cartesia's first publicly launched speech-to-text model in the Ink family, announced on June 10, 2025. Cartesia's documentation lists the stable snapshot as ink-whisper-2025-06-04. The model is described by Cartesia as a variant of OpenAI's whisper-large-v3-turbo, with the key disclosed modification being dynamic chunking, which lets the model process semantically meaningful variable-length audio fragments instead of relying on Whisper's native 30-second chunking behavior.

Cartesia positioned Ink-Whisper as the input layer for production voice agents rather than as a generic transcription model for archival audio. The company's stated motivation was that standard Whisper was built for bulk processing rather than live dialogue, and that voice AI agents need to respond naturally after a person stops speaking. Launch messaging centered on time-to-complete-transcript (TTCT) rather than classic throughput.

On May 22, 2026, Cartesia released ink-2, and its documentation moved Ink-Whisper into the "Older Models" section while keeping it marked stable.

Capabilities and features

  • Real-time streaming transcription optimized for conversational voice agents, with TTCT as the primary latency measure.
  • Dynamic chunking: processing of shorter, semantically meaningful segments instead of fixed 30-second chunks. Cartesia states this reduces errors and hallucinations during silence and fragmented real-time audio.
  • Batch transcription through the /stt endpoint.
  • Support for common audio containers: WAV, MP3, FLAC, OGG, M4A, MP4, and WebM.
  • Optional word-level timestamps. Outputs include transcript text, language, and duration.
  • Intended use cases documented in official materials: customer-service voice agents, structured-data capture, phone-call transcription, and enterprise voice workflows involving dates, alphanumerics, IDs, proper nouns, and noisy telephony conditions.

Confidence scores and token-level posteriors are not listed among the documented outputs.

Language support

Cartesia's documentation lists 100 supported languages for ink-whisper, including English, Chinese, German, Spanish, Russian, Korean, French, Japanese, Portuguese, Turkish, Hindi, Arabic, Hebrew, Thai, Welsh, Bengali, Swahili, Yiddish, Javanese, Sundanese, and Cantonese (yue). Per-language quality breakdowns and language-specific fine-tuning details are not publicly disclosed.

Performance and benchmarks

Vendor-reported: launch benchmark (June 2025)

Cartesia's launch benchmark reported a 66 ms median TTCT for Ink-Whisper, compared with 74 ms for Deepgram Nova3 Streaming, 70 ms for Fireworks Whisper Streaming, and 737 ms for AssemblyAI Universal Streaming.

Cartesia's launch-time WER slices (lower is better):

Official launch-time WER slices Ink-Whisper Deepgram Nova3 Streaming Fireworks Whisper Streaming Assembly Streaming
Phone calls 0.19 0.18 0.28 0.23
Proper nouns 0.065 0.045 0.071 0.044
Background noises 0.033 0.038 0.099 0.027
Disfluencies 0.064 0.055 0.156 0.137
Speech Accent Archive subset 0.015 0.024 0.014 0.016

In these vendor-reported slices, Ink-Whisper posted the lowest error rates on the background-noise and disfluency tests among the compared streaming systems, and was competitive rather than lowest on phone calls, proper nouns, and accent data. Cartesia also states that Ink-Whisper outperformed baseline whisper-large-v3-turbo, but the launch post does not publish the baseline's row-by-row scores.

Third-party evaluation: public Pipecat benchmark as republished by Soniox (June 2026)

The Soniox benchmark page republishes results from the open Pipecat STT benchmark, which uses 1,000 real-world streaming samples and reports semantic WER plus time to final segment.

External benchmark snapshot in June 2026 Price per hour Mean semantic WER Perfect transcripts Median time to final segment
Cartesia ink-2 $0.43 1.47% 84.2% 299 ms
Deepgram nova-3-general $0.55 1.71% 76.5% 247 ms
ElevenLabs scribe_v2_realtime $0.39 3.16% 81.3% 281 ms
OpenAI gpt-4o-transcribe Not listed 3.24% 75.9% 637 ms
Cartesia ink-whisper Not listed 3.92% 60.5% 266 ms
Google latest-long $0.96 2.84% 69.0% 878 ms

In this June 2026 snapshot, Ink-Whisper recorded a 266 ms median time to final segment, a 3.92% mean semantic WER, and a 60.5% perfect-transcript rate, behind newer systems including Cartesia ink-2, Deepgram Nova-3, and ElevenLabs Scribe on the accuracy measures.

Research usage

There is no dedicated Cartesia paper for Ink-Whisper in the sources reviewed. The model is included as an API-hosted STT component in EVA-Bench, an end-to-end voice-agent evaluation framework on arXiv.

Latency and throughput

Vendor-reported: 66 ms median TTCT at launch (June 2025), against 74 ms for Deepgram Nova3 Streaming, 70 ms for Fireworks Whisper Streaming, and 737 ms for AssemblyAI Universal Streaming.

Third-party evaluation: 266 ms median time to final segment in the public Pipecat benchmark as of June 2026.

Throughput and concurrency figures are not publicly disclosed.

Deployment and integrations

  • Hosted API endpoints: batch transcription on /stt and realtime streaming on /stt/websocket.
  • Self-hosting in customer cloud or on-prem environments; "Ink Whisper" is listed as supported on Kubernetes; some self-hosted scenarios can run air-gapped with an offline license.
  • Enterprise customers can opt into Zero Data Retention (ZDR) for STT and TTS inference, under which audio input and transcript output are not retained, although operational metadata still is.
  • Integrations at or after launch: Vapi (same-week availability), Pipecat, and LiveKit; Cartesia's docs maintain partner pages for LiveKit, Pipecat, Tencent RTC, and Thoughtly; ServiceNow later publicized use of Ink-Whisper in its AI Voice Agents stack.
  • By August 2025, Cartesia said tens of thousands of developers had built voice agents with Sonic and Ink over the prior year, before launching Line as a code-first voice-agent platform.

Documentation inconsistency: Cartesia's endpoint-comparison page says /stt/turns/websocket supports ink-2 only, while the pricing page lists a realtime-turns price for ink-whisper. The source concludes that Ink-Whisper is supported on manual realtime and batch endpoints, while auto-turn endpoint support was either transitional or inconsistently documented at the time of review.

Pricing

Pricing has two layers: credits and subscription tiers.

Item Rate
Realtime STT (/stt/websocket and /stt/turns/websocket) 1 credit per second
Batch STT (/stt) 1 credit per 2 seconds

Subscription plans on the public pricing page: Free (20K credits), Pro (100K credits plus a commercial-use license), Startup (1.25M credits), Scale (8M credits), and Enterprise. At launch, Cartesia marketed Ink-Whisper as "just 1 credit per second," or about $0.13/hour on the Scale plan.

Default data-rights terms: unless otherwise agreed, Cartesia's terms allow the company to use inputs, outputs, and user interactions to train and improve its models. ZDR changes retention behavior but is restricted to enterprise plans.

Development and ownership

Ink-Whisper was developed by Cartesia, founded in 2023 out of the Stanford AI Lab. The company states its mission as "architecting AI that learns and interacts like humans" and positions state space models (SSMs) as its architectural approach; its research lineage includes HiPPO, Efficiently Modeling Long Sequences with Structured State Spaces, and Mamba. Cartesia's product line comprises Sonic (text-to-speech), Ink (speech-to-text), and Line (voice agents), with additional capabilities including voice cloning, voice changing, and self-hosted deployments.

Person Public role Relevant connection to Ink-Whisper Affiliation context
Karan Goel CEO & founder Leads Cartesia strategy and product direction Stanford AI Lab background; coauthor on Cartesia audio/SSM research
Albert Gu Chief Scientist & co-founder Scientific leadership across Cartesia's model architecture work Key figure in Mamba and earlier SSM research
Arjun Desai Co-founder Authored Ink launch post and public launch announcement Stanford PhD; affiliated with Stanford AI Lab, CRFM, AIMI via personal site
Brandon Yang Co-founder Founding leadership across product/company evolution Publicly presented as part of founding team in company/investor materials
Chris Ré Founding team member Research and Stanford lab lineage behind company formation Stanford professor; associated with Cartesia's academic roots

Cartesia does not publicly identify a named "Ink-Whisper lab" or publish a model-specific author list; the launch post and public launch announcement are attributed to Arjun Desai.

Funding: TechCrunch reported a $22 million round led by Index Ventures in December 2024, bringing total raised to $27 million. Fortune reported a $64 million Series A in March 2025, bringing total funding to $91 million, and said Cartesia counted Quora, Cresta, and Rasa among its customers at that stage.

Release history

Release item Public date Status as of 2026-06-15 Publicly documented notes
Ink family announcement 2025-06-10 Historical milestone Ink introduced as Cartesia's STT family; Ink-Whisper was the debut model
ink-whisper-2025-06-04 2025-06-04 Stable, older model Docs list one stable snapshot; 100 languages; positioned as most affordable Ink model
ink-whisper alias 2025-06-10 onward Active model family alias in APIs/docs Used in batch and manual realtime docs
ink-2 2026-05-22 Stable flagship Successor marketed as faster and more accurate, with native turn detection

The stable snapshot ink-whisper-2025-06-04 predates the June 10, 2025 announcement by six days. After the May 22, 2026 release of ink-2, Cartesia's documentation lists Ink-Whisper under "Older Models" while keeping it marked stable.

Sources

The platform

Put these benchmarks to work

The same evaluations behind these dispatches drive OpenTranscription — one API that routes every job to the right speech model for your audio, language, and budget.

© 2026 OpenTranscription · Signal is our journal.Set in system grotesque, serif & mono