Ink-Whisper: how Cartesia rebuilt Whisper for real-time voice agents
What Cartesia's Ink-Whisper got right on latency, where its accuracy fell behind by 2026, and why it mattered more as a stepping stone than a benchmark.

Ink-Whisper was Cartesia's first public speech-to-text model, announced on June 10, 2025 as a Whisper derivative reworked for live conversation rather than bulk transcription. It had a short run at the front. By May 22, 2026 Cartesia had shipped ink-2 and moved Ink-Whisper into the "Older Models" section of its docs, still marked stable but clearly superseded. That arc, from headline launch to supported legacy model in under a year, is worth studying if you build voice agents, because Ink-Whisper is a clean case of a company optimizing for one metric (time to complete transcript) and winning on it, while the field moved the goalposts on accuracy underneath it.
It was also the bridge product. Cartesia made its name with Sonic, an ultra low latency text-to-speech model, and Ink-Whisper was the move from voice-out to voice-in, the missing input layer that let the company later pitch a full voice-agent platform.
Where Ink-Whisper came from
Cartesia did not build an STT model on a whim. Its own State of Voice AI report from late 2024 laid out the diagnosis: modern voice agents run on an STT to LLM to TTS pipeline, and speech recognition still had unsolved pain points around domain-specific terms and far-field audio. Ink-Whisper reads as a direct answer to that report. Sonic had already made Cartesia a preferred choice for developers who cared about speed and realism on the output side; Ink extended the same latency-first philosophy to the input side.
The dates are slightly messier than the press cycle suggests. The formal launch was June 10, 2025, but Cartesia's docs list the stable snapshot as ink-whisper-2025-06-04, so the release artifact predated the announcement by six days. Here is the documented lineage:
| Release item | Public date | Status as of 2026-06-15 | Publicly documented notes |
|---|---|---|---|
| Ink family announcement | 2025-06-10 | Historical milestone | Ink introduced as Cartesia's STT family; Ink-Whisper was the debut model |
| ink-whisper-2025-06-04 | 2025-06-04 | Stable, older model | Docs list one stable snapshot; 100 languages; positioned as most affordable Ink model |
| ink-whisper alias | 2025-06-10 onward | Active model family alias in APIs/docs | Used in batch and manual realtime docs |
| ink-2 | 2026-05-22 | Stable flagship | Successor marketed as faster and more accurate, with native turn detection |
The positioning of that last row matters. Cartesia never framed Ink-Whisper as an abandoned experiment. It stayed a supported legacy model that a new flagship had overtaken.
What Cartesia actually disclosed, and what it didn't
The public technical record on Ink-Whisper is thin, and it pays to be precise about the boundary. Cartesia says it started from OpenAI's whisper-large-v3-turbo, chosen because Whisper was already widely used, open source, and comparatively cheap to run. The one modification Cartesia chose to describe in public is dynamic chunking. Stock Whisper was trained on 30-second audio segments, and the original Whisper paper is explicit that long-form transcription needs buffered chunking strategies layered on top. Ink-Whisper instead processes shorter, semantically meaningful variable-length fragments, which Cartesia claims reduces errors and hallucinations during silence and fragmented real-time audio.
That is roughly the whole disclosure. There is no Ink-Whisper paper in Cartesia's research index, no public model card, no parameter count, and no training-data description beyond the Whisper baseline comparison. Separating what is on the record from what isn't:
| Aspect | Publicly disclosed | Unspecified in reviewed public sources |
|---|---|---|
| Base model | Variant of OpenAI whisper-large-v3-turbo | Exact checkpointing flow, whether additional architectures were added |
| Core optimization | Dynamic chunking for variable-length audio and semantically meaningful boundaries | Full decoding strategy, batching policy, endpointing internals |
| Original Whisper basis | Encoder-decoder Transformer trained on 680,000 hours of multilingual, multitask supervision | How much of Whisper's original data/objective remains unchanged after Cartesia adaptation |
| Languages | 100 ISO-639-1 language codes in docs for ink-whisper | Per-language quality breakdown, language-specific fine-tuning details |
| Outputs | Transcript text, language, duration, optional word timestamps | Confidence scores, token-level posteriors |
| Data transparency | None beyond "baseline Whisper" comparison and use-case eval sets | Training corpus size, provenance, licensing, annotation pipeline, model size |
On the practical surface, the docs are more forthcoming. Ink-Whisper supports 100 languages, including English, Chinese, German, Spanish, Russian, Korean, French, Japanese, Portuguese, Turkish, Hindi, Arabic, Hebrew, Thai, Welsh, Bengali, Swahili, Yiddish, Javanese, Sundanese, and Cantonese (yue). The STT API accepts common containers such as WAV, MP3, FLAC, OGG, M4A, MP4, and WebM, with optional word-level timestamps.

The benchmark story: fast at launch, overtaken within a year
Cartesia built the launch around a metric most STT vendors were not leading with: time to complete transcript, the delay between the last word a person speaks and the finished transcript. For a voice agent, that delay decides whether the system feels attentive and interruptible or feels like it is waiting for a fax. Cartesia's official benchmark claimed a 66 ms median TTCT for Ink-Whisper, against 74 ms for Deepgram Nova3 Streaming, 70 ms for Fireworks Whisper Streaming, and 737 ms for AssemblyAI Universal Streaming.
The accuracy numbers in the same launch post are more honest than the headline framing. Cartesia published WER slices across five conditions, and Ink-Whisper won two of them:
| Official launch-time WER slices | Ink-Whisper | Deepgram Nova3 Streaming | Fireworks Whisper Streaming | Assembly Streaming |
|---|---|---|---|---|
| Phone calls | 0.19 | 0.18 | 0.28 | 0.23 |
| Proper nouns | 0.065 | 0.045 | 0.071 | 0.044 |
| Background noises | 0.033 | 0.038 | 0.099 | 0.027 |
| Disfluencies | 0.064 | 0.055 | 0.156 | 0.137 |
| Speech Accent Archive subset | 0.015 | 0.024 | 0.014 | 0.016 |
Lower is better. Ink-Whisper led on background noise and disfluent speech, the messy-audio conditions that break stock Whisper deployments, but on phone calls, proper nouns, and accented speech it was competitive rather than dominant. Cartesia also stated separately that Ink-Whisper beat baseline whisper-large-v3-turbo, though the post never published the baseline's row-by-row scores.
Then the field moved. By June 2026, the Soniox benchmark page, which republishes results from the open Pipecat STT benchmark on 1,000 real-world streaming samples, showed Ink-Whisper still respectably fast but well behind on quality:
| External benchmark snapshot in June 2026 | Price per hour | Mean semantic WER | Perfect transcripts | Median time to final segment |
|---|---|---|---|---|
| Cartesia ink-2 | $0.43 | 1.47% | 84.2% | 299 ms |
| Deepgram nova-3-general | $0.55 | 1.71% | 76.5% | 247 ms |
| ElevenLabs scribe_v2_realtime | $0.39 | 3.16% | 81.3% | 281 ms |
| OpenAI gpt-4o-transcribe | n/a | 3.24% | 75.9% | 637 ms |
| Cartesia ink-whisper | n/a | 3.92% | 60.5% | 266 ms |
| Google latest-long | $0.96 | 2.84% | 69.0% | 878 ms |
A 266 ms median time to final segment is still quick. A 3.92% semantic WER and a 60.5% perfect-transcript rate, against ink-2's 1.47% and 84.2%, is not close. Cartesia's own successor had lapped it decisively, and Deepgram Nova-3 and ElevenLabs Scribe had pulled ahead on benchmarked quality too.
What it was built for
Cartesia's stated motivation was simple: developers were already using Whisper everywhere, but standard Whisper had the wrong performance shape for interactive dialogue. In Cartesia's words, it was fundamentally built for bulk processing rather than live conversation, and the company wanted a model tuned for voice agents that must respond naturally the moment a person stops speaking.
The intended uses in the official materials cluster tightly around production conversational systems: customer-service voice agents, structured-data capture, phone-call transcription, and enterprise voice workflows where the model has to handle dates, alphanumerics, IDs, proper nouns, and noisy telephony audio. Cartesia later generalized this into company-level solution areas covering customer service, recruiting, finance, healthcare, and code-first voice agents through Line. Nobody at Cartesia pitched Ink-Whisper as an archival transcription model, and judging it on that axis misses the design intent. The whole bet was that end-to-end conversation feel, not raw ASR quality on clean audio, was the metric voice-agent builders would actually pay for.

The team and the money behind it
Cartesia's founders met as PhD students at the Stanford AI Lab, and the company is a direct commercial descendant of the state space model research line that includes HiPPO, S4 (Efficiently Modeling Long Sequences with Structured State Spaces), and Mamba. Karan Goel is CEO and founder, Albert Gu is Chief Scientist and co-founder, and Arjun Desai and Brandon Yang are co-founders. Fortune reports that Chris Ré, the Stanford professor whose lab has spun out several major AI companies, was part of the founding team when Cartesia emerged from the lab in 2023.
| Person | Public role | Relevant connection to Ink-Whisper | Affiliation context |
|---|---|---|---|
| Karan Goel | CEO & founder | Leads Cartesia strategy and product direction | Stanford AI Lab background; coauthor on Cartesia audio/SSM research |
| Albert Gu | Chief Scientist & co-founder | Scientific leadership across Cartesia's model architecture work | Key figure in Mamba and earlier SSM research |
| Arjun Desai | Co-founder | Authored Ink launch post and public launch announcement | Stanford PhD; affiliated with Stanford AI Lab, CRFM, AIMI via personal site |
| Brandon Yang | Co-founder | Founding leadership across product/company evolution | Publicly presented as part of founding team in company/investor materials |
| Chris Ré | Founding team member | Research and Stanford lab lineage behind company formation | Stanford professor; associated with Cartesia's academic roots |
For Ink-Whisper specifically, the cleanest attribution is to Arjun Desai, who authored the official Ink launch post and announced it personally on LinkedIn. Cartesia never published a model-specific author list or named a separate Ink team, so the defensible read is that the model came out of the company's general speech effort.
The funding timeline explains the pace. TechCrunch reported a $22 million round led by Index Ventures in December 2024, bringing the total raised to $27 million. Fortune then reported a $64 million Series A in March 2025, taking total funding to $91 million, with Quora, Cresta, and Rasa among the customers at that stage. Ink-Whisper shipped three months after that Series A. There is a slight irony in the architecture choice: a company whose entire pitch is that state space models beat transformers for low-latency AI shipped its first STT model as a fine-tuned transformer, because Whisper was where the developers already were.
Pricing, licensing, and the fine print
Ink-Whisper is a proprietary API service, not an open-weight release, and the distance from its upstream inspiration is stark here. OpenAI released Whisper openly; Cartesia's terms prohibit downloading, copying, licensing, selling, or creating derivative works from models obtained through the service, except where expressly permitted.
Pricing runs on two layers, credits and subscription tiers. Current docs meter Ink-Whisper realtime STT at 1 credit per second on both /stt/websocket and /stt/turns/websocket, and batch STT at 1 credit per 2 seconds on /stt. The pricing page lists Free, Pro, Startup, Scale, and Enterprise plans: Free includes 20K credits, Pro 100K credits plus a commercial-use license, Startup 1.25M credits, and Scale 8M credits. At launch Cartesia marketed the model as "just 1 credit per second," which worked out to roughly $0.13 per hour on the Scale plan. That price was a genuine wedge; in a Reddit thread on cost-effective voice-agent stacks, commenters recommended Cartesia Ink as fast and attractively priced relative to Deepgram, with one user saying they were "a big fan" of its speed.
Deployment options skew enterprise. Cartesia supports self-hosting in customer cloud or on-prem environments, lists Ink Whisper as supported on Kubernetes, and says some self-hosted scenarios can run fully air-gapped with an offline license. Enterprise customers can opt into Zero Data Retention for STT and TTS inference, under which audio input and transcript output are not retained, though operational metadata still is.
The default data posture deserves more attention than it usually gets. Unless otherwise agreed, Cartesia's terms allow the company to use inputs, outputs, and user interactions to train and improve its models. ZDR changes retention behavior but only for enterprise plans. If you are a regulated buyer, your privacy posture depends on your contract tier, not on anything in the default API experience.
One documentation wrinkle is worth flagging for anyone still running the model. Cartesia's endpoint-comparison page says /stt/turns/websocket supports ink-2 only, while the pricing page still lists a realtime-turns price for ink-whisper. The API reference and comparison guide are the more operationally specific sources, so the safe conclusion is that Ink-Whisper definitely works on the manual realtime and batch endpoints, while auto-turn support was either transitional or inconsistently documented at the time of review.

Adoption and how the ecosystem received it
Distribution moved fast. Launch materials named Vapi, Pipecat, and LiveKit as day-one integrations; Vapi announced same-week availability; Cartesia's docs now maintain partner pages for LiveKit, Pipecat, Tencent RTC, and Thoughtly; and ServiceNow later publicized its use of Ink-Whisper inside AI Voice Agents. By August 2025, Cartesia said tens of thousands of developers had built voice agents with Sonic and Ink over the prior year, and it launched Line, its code-first voice-agent platform, on top of that base.
The launch reception was warm inside the realtime voice-developer world. Arjun Desai's post drew builders planning immediate tests, Vapi's announcement emphasized the one-line config change and praised handling of noisy calls, and Pipecat's CEO publicly noted launch-day support. None of that is a neutral benchmark, but it does show the model landed as a real release rather than a press-release model.
The longer-run reception cooled. Beyond the June 2026 benchmark slide, practitioner feedback from the Voice AI Summit ecosystem included one builder who described Cartesia Ink as feeling like a faster Whisper-style model whose transcription accuracy, in their testing, was not yet good enough for production voice agents, especially in harder language settings. That is one anecdote, not a verdict, but it names the exact tradeoff the model embodied: aggressive realtime responsiveness with less staying power on accuracy than the newer streaming leaders.
Academically, the footprint is instrumental rather than foundational. There is no dedicated Cartesia paper for Ink-Whisper, but the model does appear as a concrete API-hosted STT component in EVA-Bench, an end-to-end voice-agent evaluation framework on arXiv. Researchers used it as a tool for comparing voice-agent stacks even though it never got a paper of its own.
The honest verdict
Ink-Whisper's limitations are easy to list because they are structural, not subtle. Transparency is thin: no model card, no parameter count, no training-data recipe beyond "Whisper variant with dynamic chunking." The model is proprietary and API-bound. The default terms permit training on customer content unless an enterprise agreement says otherwise. The docs carry at least one real inconsistency on endpoint support. And by 2026 it had been overtaken internally by ink-2 and externally by a newer generation of streaming STT models.
Set against that, the model did what it was built to do. It took the most widely deployed open ASR model in the world, fixed the specific failure mode (fixed 30-second chunking) that made it awkward for live dialogue, priced it at $0.13 an hour, and plugged it into every major voice-agent framework within a week of launch. Its lasting significance is strategic: it moved Cartesia from a TTS vendor into a full-stack voice company, and it gave the ink-2 generation something to be measured against. As a benchmark leader it lasted about a year. As a bridge, it held.
Sources
- Cartesia, "Introducing Ink: speech-to-text models for real-time conversation": https://cartesia.ai/blog/introducing-ink-speech-to-text/
- Pricing, Cartesia Docs: https://docs.cartesia.ai/pricing
- Cartesia, "Announcing Sonic: a low-latency voice model for lifelike speech": https://cartesia.ai/blog/sonic/
- Cartesia, Research: https://cartesia.ai/research/
- Batch Speech-to-Text, Cartesia Docs: https://docs.cartesia.ai/api-reference/stt/transcribe
- Speech-to-text benchmarks, Soniox: https://soniox.com/benchmarks
- Older Models, Cartesia Docs: https://docs.cartesia.ai/build-with-cartesia/stt-models/older-models
- Cartesia, "State of voice AI 2024": https://cartesia.ai/blog/state-of-voice-ai-2024/
- Cartesia, Company: https://cartesia.ai/company/
- Fortune, "Exclusive: Cartesia, voice AI startup, raises $64 million Series A": https://fortune.com/2025/03/11/exclusive-cartesia-voice-ai-startup-raises-64-million-series-a/
- TechCrunch, "Cartesia claims its AI is efficient enough to run pretty much anywhere": https://techcrunch.com/2024/12/12/cartesia-claims-its-ai-is-efficient-enough-to-run-pretty-much-anywhere/
- Cartesia, Terms of Service: https://cartesia.ai/legal/terms/
- Introduction, Cartesia Docs (self-hosted): https://docs.cartesia.ai/self-hosted/introduction
- Compare STT Endpoints, Cartesia Docs: https://docs.cartesia.ai/use-the-api/compare-stt-endpoints
- Arjun Desai, Ink launch announcement on LinkedIn: https://www.linkedin.com/posts/arjun-desai-4a731ba4_at-cartesia-we-are-reimagining-the-fundamental-activity-7338301026401251329-qHi6
- "Need Suggestions & Advice - Best Stack for Cost Effective Voice Agent", r/AI_Agents: https://www.reddit.com/r/AI_Agents/comments/1n5uk8l/need_suggestions_advice_best_stack_for_cost/