Ink-Whisper: model profile

Ink-Whisper is a proprietary speech-to-text model from Cartesia, derived from OpenAI's whisper-large-v3-turbo and adapted for real-time conversational transcription.

Specifications

Developer	Cartesia
Released	June 10, 2025 (announcement); stable snapshot ink-whisper-2025-06-04 dated June 4, 2025
Model type	Speech-to-text; variant of OpenAI whisper-large-v3-turbo with dynamic chunking
Training data	Not publicly disclosed beyond the Whisper baseline; the original Whisper paper describes an encoder-decoder Transformer trained on 680,000 hours of multilingual, multitask supervision
Languages	100 supported languages (ISO-639-1 codes)
Modes (batch / streaming)	Batch (/stt) and realtime streaming (/stt/websocket)
Latency	Vendor-reported: 66 ms median TTCT (June 2025). Third-party evaluation: 266 ms median time to final segment (June 2026)
Deployment	Hosted API; self-hosting in customer cloud or on-prem, including Kubernetes and air-gapped installs with an offline license
Pricing	1 credit per second realtime, 1 credit per 2 seconds batch; marketed at launch as about $0.13/hour on the Scale plan
License	Proprietary API service; terms prohibit downloading, copying, licensing, selling, or creating derivative works from models obtained through the service, except where expressly permitted

Not disclosedParameters · Throughput / concurrency

Known limitations

Limited technical transparency: no public model card, no disclosed parameter count, and no training-data description beyond the "Whisper variant" characterization with dynamic chunking. Fine-tuning regime, dataset composition, safety filtering, and evaluation methodology details are unspecified in the reviewed public sources.
Proprietary and API-bound: terms prohibit downloading, copying, licensing, selling, or creating derivative works from models obtained through the service, except where expressly permitted.
Default terms allow Cartesia to use customer inputs, outputs, and interactions to train and improve its models unless otherwise agreed; Zero Data Retention is restricted to enterprise plans.
Documentation inconsistency around /stt/turns/websocket support: the endpoint-comparison page lists ink-2 only, while the pricing page lists a realtime-turns price for ink-whisper.
Vendor-reported WER slices were mixed at launch: Ink-Whisper led on background-noise and disfluency tests but not on phone calls, proper nouns, or the accent subset.
In the June 2026 public Pipecat benchmark snapshot, Ink-Whisper recorded a 3.92% mean semantic WER and 60.5% perfect-transcript rate, behind ink-2, Deepgram Nova-3, and ElevenLabs Scribe on those measures.
Superseded internally by ink-2, released May 22, 2026 and marketed as faster and more accurate with native turn detection.

The following fields are not publicly disclosed in the sources reviewed: parameter count, training corpus size and provenance, annotation pipeline, model size, exact checkpointing flow, full decoding strategy, batching policy, endpointing internals, confidence scores and token-level posteriors, per-language quality breakdowns, and throughput or concurrency figures.

Full technical breakdown9 sections

Overview

Ink-Whisper was Cartesia's first publicly launched speech-to-text model in the Ink family, announced on June 10, 2025. Cartesia's documentation lists the stable snapshot as ink-whisper-2025-06-04. The model is described by Cartesia as a variant of OpenAI's whisper-large-v3-turbo, with the key disclosed modification being dynamic chunking, which lets the model process semantically meaningful variable-length audio fragments instead of relying on Whisper's native 30-second chunking behavior.

Cartesia positioned Ink-Whisper as the input layer for production voice agents rather than as a generic transcription model for archival audio. The company's stated motivation was that standard Whisper was built for bulk processing rather than live dialogue, and that voice AI agents need to respond naturally after a person stops speaking. Launch messaging centered on time-to-complete-transcript (TTCT) rather than classic throughput.

On May 22, 2026, Cartesia released ink-2, and its documentation moved Ink-Whisper into the "Older Models" section while keeping it marked stable.

Capabilities and features

Real-time streaming transcription optimized for conversational voice agents, with TTCT as the primary latency measure.
Dynamic chunking: processing of shorter, semantically meaningful segments instead of fixed 30-second chunks. Cartesia states this reduces errors and hallucinations during silence and fragmented real-time audio.
Batch transcription through the /stt endpoint.
Support for common audio containers: WAV, MP3, FLAC, OGG, M4A, MP4, and WebM.
Optional word-level timestamps. Outputs include transcript text, language, and duration.
Intended use cases documented in official materials: customer-service voice agents, structured-data capture, phone-call transcription, and enterprise voice workflows involving dates, alphanumerics, IDs, proper nouns, and noisy telephony conditions.

Confidence scores and token-level posteriors are not listed among the documented outputs.

Language support

Cartesia's documentation lists 100 supported languages for ink-whisper, including English, Chinese, German, Spanish, Russian, Korean, French, Japanese, Portuguese, Turkish, Hindi, Arabic, Hebrew, Thai, Welsh, Bengali, Swahili, Yiddish, Javanese, Sundanese, and Cantonese (yue). Per-language quality breakdowns and language-specific fine-tuning details are not publicly disclosed.

Performance and benchmarks

Vendor-reported: launch benchmark (June 2025)

Cartesia's launch benchmark reported a 66 ms median TTCT for Ink-Whisper, compared with 74 ms for Deepgram Nova3 Streaming, 70 ms for Fireworks Whisper Streaming, and 737 ms for AssemblyAI Universal Streaming.

Cartesia's launch-time WER slices (lower is better):

Official launch-time WER slices	Ink-Whisper	Deepgram Nova3 Streaming	Fireworks Whisper Streaming	Assembly Streaming
Phone calls	0.19	0.18	0.28	0.23
Proper nouns	0.065	0.045	0.071	0.044
Background noises	0.033	0.038	0.099	0.027
Disfluencies	0.064	0.055	0.156	0.137
Speech Accent Archive subset	0.015	0.024	0.014	0.016

In these vendor-reported slices, Ink-Whisper posted the lowest error rates on the background-noise and disfluency tests among the compared streaming systems, and was competitive rather than lowest on phone calls, proper nouns, and accent data. Cartesia also states that Ink-Whisper outperformed baseline whisper-large-v3-turbo, but the launch post does not publish the baseline's row-by-row scores.

Third-party evaluation: public Pipecat benchmark as republished by Soniox (June 2026)

The Soniox benchmark page republishes results from the open Pipecat STT benchmark, which uses 1,000 real-world streaming samples and reports semantic WER plus time to final segment.

External benchmark snapshot in June 2026	Price per hour	Mean semantic WER	Perfect transcripts	Median time to final segment
Cartesia ink-2	$0.43	1.47%	84.2%	299 ms
Deepgram nova-3-general	$0.55	1.71%	76.5%	247 ms
ElevenLabs scribe_v2_realtime	$0.39	3.16%	81.3%	281 ms
OpenAI gpt-4o-transcribe	Not listed	3.24%	75.9%	637 ms
Cartesia ink-whisper	Not listed	3.92%	60.5%	266 ms
Google latest-long	$0.96	2.84%	69.0%	878 ms

In this June 2026 snapshot, Ink-Whisper recorded a 266 ms median time to final segment, a 3.92% mean semantic WER, and a 60.5% perfect-transcript rate, behind newer systems including Cartesia ink-2, Deepgram Nova-3, and ElevenLabs Scribe on the accuracy measures.

Research usage

There is no dedicated Cartesia paper for Ink-Whisper in the sources reviewed. The model is included as an API-hosted STT component in EVA-Bench, an end-to-end voice-agent evaluation framework on arXiv.

Latency and throughput

Vendor-reported: 66 ms median TTCT at launch (June 2025), against 74 ms for Deepgram Nova3 Streaming, 70 ms for Fireworks Whisper Streaming, and 737 ms for AssemblyAI Universal Streaming.

Third-party evaluation: 266 ms median time to final segment in the public Pipecat benchmark as of June 2026.

Throughput and concurrency figures are not publicly disclosed.

Deployment and integrations

Hosted API endpoints: batch transcription on /stt and realtime streaming on /stt/websocket.
Self-hosting in customer cloud or on-prem environments; "Ink Whisper" is listed as supported on Kubernetes; some self-hosted scenarios can run air-gapped with an offline license.
Enterprise customers can opt into Zero Data Retention (ZDR) for STT and TTS inference, under which audio input and transcript output are not retained, although operational metadata still is.
Integrations at or after launch: Vapi (same-week availability), Pipecat, and LiveKit; Cartesia's docs maintain partner pages for LiveKit, Pipecat, Tencent RTC, and Thoughtly; ServiceNow later publicized use of Ink-Whisper in its AI Voice Agents stack.
By August 2025, Cartesia said tens of thousands of developers had built voice agents with Sonic and Ink over the prior year, before launching Line as a code-first voice-agent platform.

Documentation inconsistency: Cartesia's endpoint-comparison page says /stt/turns/websocket supports ink-2 only, while the pricing page lists a realtime-turns price for ink-whisper. The source concludes that Ink-Whisper is supported on manual realtime and batch endpoints, while auto-turn endpoint support was either transitional or inconsistently documented at the time of review.

Pricing

Pricing has two layers: credits and subscription tiers.

Item	Rate
Realtime STT (/stt/websocket and /stt/turns/websocket)	1 credit per second
Batch STT (/stt)	1 credit per 2 seconds

Subscription plans on the public pricing page: Free (20K credits), Pro (100K credits plus a commercial-use license), Startup (1.25M credits), Scale (8M credits), and Enterprise. At launch, Cartesia marketed Ink-Whisper as "just 1 credit per second," or about $0.13/hour on the Scale plan.

Default data-rights terms: unless otherwise agreed, Cartesia's terms allow the company to use inputs, outputs, and user interactions to train and improve its models. ZDR changes retention behavior but is restricted to enterprise plans.

Development and ownership

Ink-Whisper was developed by Cartesia, founded in 2023 out of the Stanford AI Lab. The company states its mission as "architecting AI that learns and interacts like humans" and positions state space models (SSMs) as its architectural approach; its research lineage includes HiPPO, Efficiently Modeling Long Sequences with Structured State Spaces, and Mamba. Cartesia's product line comprises Sonic (text-to-speech), Ink (speech-to-text), and Line (voice agents), with additional capabilities including voice cloning, voice changing, and self-hosted deployments.

Person	Public role	Relevant connection to Ink-Whisper	Affiliation context
Karan Goel	CEO & founder	Leads Cartesia strategy and product direction	Stanford AI Lab background; coauthor on Cartesia audio/SSM research
Albert Gu	Chief Scientist & co-founder	Scientific leadership across Cartesia's model architecture work	Key figure in Mamba and earlier SSM research
Arjun Desai	Co-founder	Authored Ink launch post and public launch announcement	Stanford PhD; affiliated with Stanford AI Lab, CRFM, AIMI via personal site
Brandon Yang	Co-founder	Founding leadership across product/company evolution	Publicly presented as part of founding team in company/investor materials
Chris Ré	Founding team member	Research and Stanford lab lineage behind company formation	Stanford professor; associated with Cartesia's academic roots

Cartesia does not publicly identify a named "Ink-Whisper lab" or publish a model-specific author list; the launch post and public launch announcement are attributed to Arjun Desai.

Funding: TechCrunch reported a $22 million round led by Index Ventures in December 2024, bringing total raised to $27 million. Fortune reported a $64 million Series A in March 2025, bringing total funding to $91 million, and said Cartesia counted Quora, Cresta, and Rasa among its customers at that stage.

Release history

Release item	Public date	Status as of 2026-06-15	Publicly documented notes
Ink family announcement	2025-06-10	Historical milestone	Ink introduced as Cartesia's STT family; Ink-Whisper was the debut model
ink-whisper-2025-06-04	2025-06-04	Stable, older model	Docs list one stable snapshot; 100 languages; positioned as most affordable Ink model
ink-whisper alias	2025-06-10 onward	Active model family alias in APIs/docs	Used in batch and manual realtime docs
ink-2	2026-05-22	Stable flagship	Successor marketed as faster and more accurate, with native turn detection

The stable snapshot ink-whisper-2025-06-04 predates the June 10, 2025 announcement by six days. After the May 22, 2026 release of ink-2, Cartesia's documentation lists Ink-Whisper under "Older Models" while keeping it marked stable.

Sources

Cartesia, "Introducing Ink: speech-to-text models for real-time conversation" - https://cartesia.ai/blog/introducing-ink-speech-to-text/
Pricing - Cartesia Docs - https://docs.cartesia.ai/pricing
"Announcing Sonic: a low-latency voice model for lifelike..." - https://cartesia.ai/blog/sonic/?utm_source=chatgpt.com
Cartesia, Research - https://cartesia.ai/research/
Batch Speech-to-Text - Cartesia Docs - https://docs.cartesia.ai/api-reference/stt/transcribe
Speech-to-text benchmarks | Soniox - https://soniox.com/benchmarks
Older Models - Cartesia Docs - https://docs.cartesia.ai/build-with-cartesia/stt-models/older-models
Cartesia, "State of voice AI 2024" - https://cartesia.ai/blog/state-of-voice-ai-2024/
Cartesia, Company - https://cartesia.ai/company/
"Exclusive: Cartesia, voice AI startup, raises $64 million Series A," Fortune - https://fortune.com/2025/03/11/exclusive-cartesia-voice-ai-startup-raises-64-million-series-a/
"Cartesia claims its AI is efficient enough to run pretty much anywhere," TechCrunch - https://techcrunch.com/2024/12/12/cartesia-claims-its-ai-is-efficient-enough-to-run-pretty-much-anywhere/
Cartesia, Terms of Service - https://cartesia.ai/legal/terms/
Introduction - Cartesia Docs - https://docs.cartesia.ai/self-hosted/introduction
Compare STT Endpoints - Cartesia Docs - https://docs.cartesia.ai/use-the-api/compare-stt-endpoints?utm_source=chatgpt.com
Arjun Desai, Ink launch announcement, LinkedIn - https://www.linkedin.com/posts/arjun-desai-4a731ba4_at-cartesia-we-are-reimagining-the-fundamental-activity-7338301026401251329-qHi6
"Need Suggestions & Advice - Best Stack for Cost Effective Voice Agent," r/AI_Agents, Reddit - https://www.reddit.com/r/AI_Agents/comments/1n5uk8l/need_suggestions_advice_best_stack_for_cost/