Google Cloud Chirp 3: capabilities, costs, and where it actually wins

The first thing to get straight about Google Cloud Chirp 3 is that it isn't one product. It's a brand stretched across three surfaces: Chirp 3 Transcription inside Speech-to-Text V2, Chirp 3 HD voices inside Cloud Text-to-Speech, and Chirp 3 Instant Custom Voice for rapid voice cloning. Google is aiming all three at the same two enterprise jobs: better multilingual transcription, and more realistic, lower latency synthetic speech for agents, support operations, and media workflows. If you're evaluating "Chirp 3," you're really evaluating one of those three things, and the deployment math differs for each.

Where Chirp came from

Chirp doesn't have a founder story the way a startup does. Its public lineage starts with Google Research's Universal Speech Model work, which Google described in March 2023 as a major step toward its 1,000 Languages Initiative. Yu Zhang and James Qin presented USM on the research blog as a family of large speech models; the paper behind it lists 27 authors and describes a 2B-parameter multilingual ASR system trained on 12 million hours of speech and 28 billion sentences of text spanning 300+ languages. That origin matters. Chirp was built to push recognition quality beyond high-resource languages, not to shave a point off English dictation error rates.

Google Cloud productized the research in May 2023, when Cloud Speech product manager Calum Barnes introduced Chirp as a new foundation model for the Google Cloud Speech API: a 2B-parameter model built through self-supervised training on millions of hours of audio and 28 billion sentences across 100+ languages. The stated goal was to bring large-speech-model quality to enterprise APIs, especially for languages and accents that never had enough labeled data.

The Chirp 3 story proper is mostly a 2025 refresh. Per Google's Text-to-Speech release notes, Journey voices were rebranded as Chirp HD voices in February 2025, Chirp HD expanded to 31 locales in March 2025, and on April 2, 2025, Chirp 3 HD voices went GA with eight speakers and both real-time and batch support. On the transcription side, the Speech-to-Text release notes introduced chirp_3 in public preview with 85+ languages and locales, alongside StreamingRecognize, Recognize, speaker diarization, and language-agnostic transcription. At Google Cloud Next 2025, Google also announced Instant Custom Voice, which builds a custom voice from 10 seconds of audio.

One brand, three products

On the transcription side, Google describes Chirp 3: Transcription as the latest generation of its multilingual ASR-specific generative models. It's available only in Speech-to-Text API V2, under the model identifier chirp_3. The product page says the model was trained on millions of hours of audio and billions of text sentences to improve recognition across more languages and accents.

On the synthesis side, the docs describe Chirp 3 HD voices as the latest generation of Text-to-Speech, powered by Google's newest generative models for greater realism and emotional resonance. Instant Custom Voice sits alongside the HD voices as a cloning capability inside Cloud TTS rather than a separate product.

This split matters more than it sounds. A team evaluating Chirp 3 for contact-center transcription is looking at a different service surface than a team evaluating it for voice output or brand voice cloning. Same research lineage, same branding, but different APIs, different limits, different prices, and different regional footprints.

What the transcription model can do

Google maps Chirp 3 transcription to three modes. Speech.Recognize handles audio under one minute. Speech.BatchRecognize handles long audio, documented as generally good for 1 minute to 1 hour, though quotas allow longer. Speech.StreamingRecognize handles real-time streams. The quota page adds that a BatchRecognize request can contain up to 15 files, each up to 8 hours long, and that long-audio batch requests only accept input as a Cloud Storage URI. Note the gap here: the model page describes the recommended sweet spot, while the quota page describes the hard envelope. They are not the same numbers.

Feature coverage is solid. Chirp 3 supports automatic punctuation, automatic capitalization, utterance-level timestamps in streaming, speaker diarization in batch, speech adaptation for vocabulary biasing, language-agnostic audio transcription, and a custom prompt feature in preview. Two caveats deserve as much attention as the feature list. Word-level timestamps are supported only in Recognize and BatchRecognize and can degrade transcription quality. And the word-level confidence scores the API returns are not true confidence scores in the conventional sense. If you're building QA pipelines or compliance workflows on top of those numbers, read that fine print twice.

Abstract illustration of a live audio stream with timing gaps and cut points marked along a waveform, representing endpointing sensitivity and streaming session limits

For live systems, the documentation puts unusual weight on endpointing and voice activity. Chirp 3 exposes three endpointing_sensitivity levels, standard, short, and supershort, that trade accuracy against latency. Google explicitly warns that cranking up sensitivity for snappier bot responses can cause the system to "cut off" users who pause briefly mid-sentence. Separately, Speech-to-Text V2 can emit voice activity begin and end events, and can auto-close streams based on speech begin or speech end timeouts. Those timeouts must be greater than 500 ms and less than 60 s.

The streaming limits shape your architecture whether you like it or not. Each streaming request chunk is capped at 25 KB of audio, and a streaming session stays open for at most 5 minutes, after which the application has to rotate to a new stream or adopt Google's documented endless-streaming pattern. The voice-activity timeout page adds an engineering wrinkle: if you depend on precise timeout behavior, very large chunks reduce measurement accuracy, and it cites a 15,360-byte per-request chunk size in that context. So a production voice stack cannot hold one socket open forever and dump arbitrary buffers into it. Plan for stream rotation from day one.

One more thing to verify before you commit: regions. The Chirp 3 model page currently lists only the us and eu multi-regions as GA, while Google's broader STT V2 migration material talks about regionalized invocation in places like Belgium and Singapore. Google does regionalize, but model-specific availability and platform-wide residency claims are separate questions. Check the model page, not the marketing page.

HD voices and Instant Custom Voice

Google positions Chirp 3 HD voices as conversationally stronger than its older catalog. The HD voices page credits generative models for realism and emotional resonance, and the supported-voices page says they suit conversational agents, cover 30 distinct styles across many languages, and support low-latency real-time communication through text streaming. The bidirectional streaming quickstart goes further: streaming Cloud TTS is only compatible with Chirp 3 HD voices. That single line makes Chirp 3 HD the mandatory Google TTS surface for real-time voice agents, not just a nicer narrator.

Regional availability for HD voices currently spans global, us, eu, asia-southeast1, europe-west2, and asia-northeast1. The default output is LINEAR16, with ALAW, MULAW, OGG_OPUS, and PCM supported for streaming, and ALAW, MULAW, MP3, OGG_OPUS, and PCM for batch. The mu-law and A-law support means Chirp 3 HD works for telephony pipelines, not just high-quality app audio.

Instant Custom Voice is the most commercially interesting piece, because it collapses the traditional custom-voice project into something very light. Access is allowlist only. The feature creates personalized voice models and synthesizes speech through the Cloud TTS API for both streaming and long-form text. Google requires two recordings, a consent recording and a reference audio recording, each up to 10 seconds, and the consent statement must use Google's required script rather than a customer-written one. The output artifact is a voice cloning key, which Google says is stored client side and can be reused across multiple clients or devices.

Google's April 2025 blog frames the business intent. Instant Custom Voice became generally available through an allowlist, produces realistic custom voices from 10 seconds of input, and targets personalized call centers, accessible content, and brand voices. The same post says Google applies built-in safety features and a rigorous diligence process to verify voice-use permissions. For enterprise procurement, that's the signal: Google treats voice cloning as a governed, permissioned capability, not a self-serve commodity API. Whether that's a feature or friction depends on which side of the compliance desk you sit on.

Language coverage, latency, and controls

For transcription, the headline is simple: Chirp 3 launched publicly with 85+ languages and locales. Diarization covers a smaller set. The model page lists diarization support for en-US, en-GB, en-IN, fr-FR, fr-CA, es-ES, es-US, de-DE, hi-IN, it-IT, ja-JP, ko-KR, pt-BR, and Simplified Chinese. Anyone planning global transcription should track basic coverage and advanced-feature coverage as two separate columns in the spreadsheet.

For synthesis, Google's own numbers wobble between counting languages and counting locales. The April 2025 blog said Chirp 3 HD offered natural speech in over 35 languages with eight speaker options; the release notes described the GA catalog as 8 speakers and 31 locales. Both are probably accurate under their own counting rules. For procurement and QA, the actionable unit is the exact locale and voice ID, not the marketing count.

On latency, Google gives you controls rather than a published millisecond benchmark. On STT, endpointing sensitivity governs how quickly the model finalizes utterances, and V2 streaming can fire voice activity events before the corresponding transcription result arrives, which is useful for barge-in logic and end-of-turn orchestration. On TTS, Chirp 3 HD supports text streaming, so applications send text incrementally and receive audio incrementally. Compared with old batch-style TTS usage, that's a real step toward natural conversational loops, even if Google won't put a number on it.

The TTS control surface is deeper than a pick-a-voice API. Google documents pace control from 0.25x to 2.0x, pause control as an experimental feature, and custom pronunciations using IPA or X-SAMPA. SSML is supported in preview for synchronous Chirp 3 HD requests but not for streaming requests, which is an easy trap if you prototype synchronously and then move to streaming. For long content, asynchronous long-form synthesis handles up to 1 million bytes of input.

The quota sheet is worth memorizing before capacity planning. Standard synthesis caps content at 5,000 bytes per request. Projects get 200 Chirp 3 requests per minute, 100 concurrent streaming sessions, 30 Chirp voice cloning requests per minute, and 10 voice-cloning-key generations per minute. Instant Custom Voice also supports language transfer from en-US to de-DE, es-US, es-ES, fr-CA, fr-FR, and pt-BR, which is a quietly big deal for branded multilingual agents: one recorded voice, several markets.

What it costs

Abstract illustration of stacked horizontal bars in amber and sand descending in size against a slate-teal field, evoking volume-tiered pricing and cost scale

Google's base STT economics are aggressive. Speech-to-Text V2 standard recognition runs $0.016 per minute for the first 500,000 minutes per month, then tiers down to $0.01, $0.008, and $0.004 at higher volumes. Dynamic batch recognition for standard models, explicitly including chirp, drops to $0.003 per minute if you can tolerate lower-urgency processing. On the TTS side, Chirp 3 HD voices cost $30 per 1 million characters and Instant Custom Voice costs $60 per 1 million characters.

Here's how that plays out in four realistic configurations, at list price and excluding storage, networking, and any LLM costs layered on top.

Scenario	STT volume	TTS volume	Approx. monthly bill
Startup meeting copilot	10,000 min standard STT	1M chars Chirp 3 HD	~$190 ($160 STT + $30 TTS)
Real-time support center	250,000 min real-time STT	5M chars HD	~$4,150 ($4,000 STT + $150 TTS)
Content platform	100,000 min dynamic batch	50M chars narration	~$1,800 (~$300 STT + $1,500 TTS)
Large enterprise	1,000,000 min STT	20M HD chars + 5M custom-voice chars	~$13,900 (~$13,000 STT + $600 + $300)

Two patterns fall out of the math. For content-heavy businesses, speech generation can dominate the bill even when transcription volume looks large; the content platform pays five times more for TTS than STT. And at enterprise scale, the STT tiering starts working in your favor: at 1,000,000 minutes, part of the usage drops from $0.016 to $0.01 per minute, which is what brings that line to roughly $13,000. Branded synthetic voice stays a modest add-on unless you generate enormous volumes of speech.

The competitive picture

Abstract illustration of five distinct signal paths of different textures converging toward and diverging from a central waveform node, representing competing speech AI platforms occupying different lanes

Independent evidence says Chirp 3 is genuinely strong, and also that nobody has solved speech. In a 2026 academic evaluation on noisy Dutch semi-spontaneous speech, Google Chirp 3 posted the best average WER of eight tested ASR systems at 11.2%, beating Whisper-large-v3 at 15.8%. But another 2026 study on spoken U.S. street names found that leading systems from Google, OpenAI, Deepgram, and Microsoft all produced a very high average error rate on that named-entity task. Broad ASR quality is competitive; names, jargon, and speaker overlap remain live failure modes for everyone.

Against OpenAI, the tradeoff is breadth versus integration. OpenAI's speech-to-text guide now lists gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-transcribe-diarize, priced at roughly $0.006/minute and $0.003/minute for the two request-based transcription models and $0.017/minute for GPT-Realtime-Whisper. Its TTS ships 13 built-in voices that OpenAI says are currently optimized for English. If you're building an English-heavy realtime agent already living inside OpenAI's APIs, that tight coupling with the realtime reasoning stack is compelling. Google is the better fit when you need broad multilingual TTS and mature cloud controls around separately managed STT and TTS services.

Against ElevenLabs, the tradeoff runs almost exactly in reverse. ElevenLabs offers thousands of voices through its library and cloning workflows, TTS across 32 languages on its latest low-latency tiers, roughly 75 ms latency for Flash and Turbo, and both instant and professional cloning. Its published API pricing is much higher than Google's for premium TTS: $0.05 per 1K characters for Flash/Turbo and $0.10 per 1K for Multilingual v2/v3, against Google's $30 per 1M for Chirp 3 HD. ElevenLabs prices STT at $0.22/hour for Scribe v1/v2 and $0.39/hour for realtime. It stays the stronger pick for creator, character, and branded-voice work; Google wins for cloud-native enterprise systems that want one vendor for managed STT and managed multilingual TTS.

Deepgram is the specialist. Its docs describe Flux as a streaming model with model-native turn detection for agents, and the Nova streaming stack supports language detection, interim results, and configurable endpointing. Aura-2 TTS runs $0.030 per 1K characters, which essentially matches Google's $30 per 1M. When the primary KPI is telephony-grade realtime turn-taking, Deepgram's agent ergonomics are hard to beat. Google's counter is multilingual breadth plus everything else in Google Cloud.

Azure Speech is the closest enterprise peer. Microsoft documents real-time, fast, batch, and custom speech transcription, a large language matrix, 30 HD voices and 500+ other voices, plus custom-voice training and hosting. Microsoft's pricing page shows standard transcription at about $1/hour and standard/neural TTS at about $15 per 1M characters, and its HD-voice docs emphasize a subset of SSML with high-fidelity conversational rendering. AWS, meanwhile, remains strong on durable building blocks such as streaming transcription, language identification, and dual-channel audio, with Polly priced at $16 per 1M characters for neural voices and $30 per 1M for generative voices. Google's advantage over AWS is a more unified modern-speech-model story.

AssemblyAI and Whisper-based systems occupy a different lane. AssemblyAI focuses on speech-to-text, speech understanding, and voice-agent infrastructure rather than first-party standalone TTS, with Universal-3 Pro and Universal-2 positioned for production ASR and streaming support at sub-300 ms latency in its newer streaming docs. Whisper still matters historically; OpenAI notes its Audio API was originally backed by whisper-1, and Google's 2023 USM research benchmarked against Whisper with strong multilingual gains. Self-hosted Whisper stacks remain relevant as a baseline and for teams that want control over model hosting, but that's a different purchase than a fully managed multilingual speech platform.

Running it in production

A production Chirp 3 stack starts with standard GCP plumbing: create or select a project, enable billing, enable the APIs. Enabling Speech-to-Text or Text-to-Speech requires the Service Usage Admin role for whoever flips the switch. For STT V2, Google recommends recognizers, reusable stored recognition configurations that let teams group traffic logically and standardize model, language, and feature settings. For long audio, BatchRecognize expects source audio in Cloud Storage and returns a long-running operation whose output comes back inline or lands in Cloud Storage.

On the TTS side, real-time replies should use streaming synthesize with Chirp 3 HD, while book-length or narration jobs should use long-form audio synthesis. The long-audio path handles up to 1 million bytes of input and requires the output bucket to grant the Storage Object Creator and Storage Object Viewer roles. In practice, real-time agents and long-form media pipelines should be built as different services even though both carry the Chirp 3 name.

Compliance controls exist but aren't perfectly uniform across the family. Speech-to-Text supports IAM and audit logging. Text-to-Speech offers regional endpoints, and Google says using the us or eu regional endpoints keeps TTS data at rest and in use within those boundaries. Google also states that Cloud TTS does not log customer text or audio data. But remember the earlier caveat: Google markets STT V2 data residency as fully regionalized, while Chirp 3's own model page lists only the us and eu multi-regions for the transcription model. Verify model-specific availability before promising residency to your legal team.

On the abuse-prevention front, Google is notably more restrictive than consumer-facing voice platforms. Instant Custom Voice is allowlist only, requires a recorded consent statement, and Google says it performs diligence to verify voice-use permissions. In the parallel Vertex AI generative media launch materials for Chirp 3, Google also said it does not use customer data to train its models and processes customer data according to customer instructions. The implementation lesson: treat voice cloning, regionalization, and long-running transcription as governed workflows with explicit policy checks and fallback paths, not just API calls.

Chirp 3 is not "Google's new speech model." It's a modular speech platform family with a credible ASR research lineage, strong multilingual transcription today, increasingly capable real-time HD synthesis, and a tightly governed custom-voice feature. For organizations already standardized on Google Cloud, it's one of the most complete managed speech stacks you can buy right now. Teams optimizing for creator tooling or ultra-specialized realtime agent ergonomics will still find competitors that beat it in their narrow slice.

Sources

Ref	Source
	https://docs.cloud.google.com/speech-to-text/docs/models/chirp-3
	https://research.google/blog/universal-speech-model-usm-state-of-the-art-speech-ai-for-100-languages/
	https://cloud.google.com/blog/products/ai-machine-learning/bringing-power-large-models-google-clouds-speech-api
	https://docs.cloud.google.com/text-to-speech/docs/release-notes
	https://docs.cloud.google.com/text-to-speech/docs/chirp3-hd
	https://docs.cloud.google.com/speech-to-text/docs/quotas
	https://docs.cloud.google.com/text-to-speech/docs/chirp3-instant-custom-voice
	https://cloud.google.com/blog/products/ai-machine-learning/expanding-generative-media-for-enterprise-on-vertex-ai
	https://docs.cloud.google.com/speech-to-text/docs/release-notes
	https://docs.cloud.google.com/text-to-speech/quotas
	https://cloud.google.com/speech-to-text/pricing
	https://arxiv.org/html/2603.09725v1
	https://developers.openai.com/api/docs/guides/speech-to-text
	https://elevenlabs.io/docs/overview/capabilities/text-to-speech
	https://developers.deepgram.com/docs/models-languages-overview
	https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-to-text
	https://www.assemblyai.com/pricing
	https://docs.cloud.google.com/speech-to-text/docs/recognizers
	https://docs.cloud.google.com/text-to-speech/docs/create-audio-text-streaming
	https://docs.cloud.google.com/speech-to-text/docs/audit-logging