Google Cloud Chirp 3: model profile

Google Cloud Chirp 3 is a family of speech capabilities comprising Chirp 3 Transcription in Speech-to-Text V2, Chirp 3 HD voices in Cloud Text-to-Speech, and Chirp 3 Instant Custom Voice for voice cloning.

Specifications

Developer	Google Cloud
Released	chirp_3 transcription introduced in public preview per Speech-to-Text release notes; Chirp 3 HD voices generally available April 2, 2025; Instant Custom Voice announced at Google Cloud Next 2025
Model type	Multilingual ASR-specific generative model (transcription); generative TTS models (HD voices)
Parameters	Not publicly disclosed for Chirp 3. The predecessor Chirp model was described as a 2B-parameter speech model
Training data	"Millions of hours of audio and billions of text sentences" (vendor description)
Languages	85+ languages and locales for transcription; TTS described as over 35 languages, with the GA catalog listed as 8 speakers and 31 locales
Modes (batch / streaming)	Speech.Recognize (audio under one minute), Speech.BatchRecognize (long audio), Speech.StreamingRecognize (real-time)
Latency	No public millisecond benchmark. Developer controls include endpointing_sensitivity (standard, short, supershort) and voice activity events
Throughput / concurrency	STT streaming: 25 KB per chunk, 5-minute session limit. Batch: up to 15 files per request, each up to 8 hours. TTS: 200 Chirp 3 requests per minute per project, 100 concurrent streaming sessions per project
Deployment	Managed Google Cloud APIs (Speech-to-Text V2 and Cloud Text-to-Speech). Transcription model GA in us and eu multi-regions
Pricing	STT standard recognition $0.016/min (first 500,000 min/month), tiering to $0.004/min; dynamic batch $0.003/min; Chirp 3 HD $30 per 1M characters; Instant Custom Voice $60 per 1M characters
License	Not publicly disclosed. Offered as a managed cloud service; the source describes no self-hosted or open-license option

Known limitations

Word-level timestamps are supported only in Recognize and BatchRecognize and can degrade transcription quality.
Word-level confidence scores are returned but are not confidence scores in the conventional sense.
Increasing endpointing sensitivity for faster responses can cause the system to cut off users who pause briefly while speaking.
Streaming sessions are limited to 5 minutes and require stream rotation or the endless-streaming pattern for longer audio.
SSML tags (preview) are not supported for streaming Chirp 3 HD requests.
The Chirp 3 transcription model page lists only us and eu multi-regions as GA, narrower than platform-wide STT V2 regionalization claims.
Diarization supports a smaller language subset than base transcription.
Instant Custom Voice is allowlist-only and requires a consent recording using Google's required script.
A 2026 study on spoken U.S. street names found leading systems, including Google's, produced a very high average error rate on this named-entity task.
Chirp 3 parameter count, exact training data composition, and public latency benchmarks are not disclosed in the source material.

Full technical breakdown9 sections

Overview

Chirp 3 is a brand umbrella spanning automatic speech recognition, HD text-to-speech, and custom voice creation, each with different APIs, limits, prices, and regional footprints. Google describes Chirp 3: Transcription as the latest generation of its multilingual ASR-specific generative models, available only in Speech-to-Text API V2 under the model identifier chirp_3. Google's product page states that Speech-to-Text can use Chirp 3, trained on millions of hours of audio and billions of text sentences, to improve recognition across more languages and accents. Google positions Chirp 3 for multilingual transcription and for lower latency synthetic speech in agents, support operations, and media workflows.

The transcription side and the synthesis side are separate service surfaces. Deployment characteristics, compliance controls, and operational limits differ between the STT and TTS products despite the shared branding.

Capabilities and features

Speech-to-Text (chirp_3)

Chirp 3 transcription maps to three API modes: Speech.Recognize for audio shorter than one minute, Speech.BatchRecognize for long audio (documented as generally suited for 1 minute to 1 hour, though quotas allow longer), and Speech.StreamingRecognize for streaming and real-time audio. A BatchRecognize request can contain up to 15 files, each up to 8 hours long; long-audio batch requests accept input only as a Cloud Storage URI.

Documented features:

Automatic punctuation
Automatic capitalization
Utterance-level timestamps in streaming
Speaker diarization in batch
Speech adaptation for vocabulary biasing
Language-agnostic audio transcription
Custom prompt feature (preview)

Documented caveats: word-level timestamps are supported only in Recognize and BatchRecognize and can degrade transcription quality; word-level confidence scores are returned but are not confidence scores in the conventional sense.

For live systems, Chirp 3 exposes endpointing_sensitivity levels (standard, short, supershort) that trade accuracy against latency. Google's documentation warns that increasing sensitivity for faster bot responses can cause the system to cut off users who pause briefly while speaking. Speech-to-Text V2 can emit voice activity begin and end events and can auto-close streams based on speech begin or speech end timeouts; those timeouts must be greater than 500 ms and less than 60 s.

Text-to-Speech (Chirp 3 HD voices)

Google's documentation says Chirp 3 HD voices are powered by generative models, are suitable for conversational agents, cover 30 distinct styles across many languages, and support low-latency real-time communication through text streaming. Streaming Cloud TTS is only compatible with Chirp 3 HD voices.

Output formats: LINEAR16 is the default; ALAW, MULAW, OGG_OPUS, and PCM are supported for streaming; ALAW, MULAW, MP3, OGG_OPUS, and PCM are supported for batch.

Control features:

Pace control from 0.25x to 2.0x
Pause control (experimental)
Custom pronunciations using IPA or X-SAMPA
SSML support in preview for synchronous Chirp 3 HD requests; those SSML tags are not currently supported for streaming requests
Asynchronous long-form synthesis of up to 1 million bytes of input

Instant Custom Voice

Instant Custom Voice creates personalized voice models and synthesizes speech through the Cloud TTS API for both streaming and long-form text. Access is allowlist-only. Google requires both a consent recording and a reference audio recording, each up to 10 seconds, with the consent statement using Google's required script rather than a customer-written alternative. The generated artifact is a voice cloning key, stored on the client side and reusable across multiple clients or devices. Google's April 2025 blog states the feature enables realistic custom voices from 10 seconds of input, targets use cases such as personalized call centers, accessible content, and brand voices, and that Google uses built-in safety features and a diligence process to verify voice-use permissions.

Instant Custom Voice supports language transfer from en-US to several other locales, including de-DE, es-US, es-ES, fr-CA, fr-FR, and pt-BR.

Language support

Transcription: Chirp 3 launched publicly with 85+ languages and locales.

Diarization: the supported subset is smaller. The Chirp 3 model page lists diarization support for specific languages including en-US, en-GB, en-IN, fr-FR, fr-CA, es-ES, es-US, de-DE, hi-IN, it-IT, ja-JP, ko-KR, pt-BR, and Simplified Chinese.

Synthesis: Google's April 2025 blog said Chirp 3 HD voices offered natural speech in over 35 languages with eight speaker options, while release notes described the GA catalog as 8 speakers and 31 locales. The two figures count languages and locale variants respectively.

Performance and benchmarks

Vendor-reported: Google does not publish WER figures for Chirp 3 in the source material. The 2023 USM research that preceded Chirp benchmarked against Whisper with multilingual gains.

Third-party evaluation: a 2026 academic evaluation on noisy Dutch semi-spontaneous speech found Google Chirp 3 achieved the best average WER among eight tested ASR systems at 11.2%, compared with Whisper-large-v3 at 15.8%. A separate 2026 study on spoken U.S. street names found that leading systems from Google, OpenAI, Deepgram, and Microsoft produced a very high average error rate on this named-entity task.

Latency and throughput

Google publishes developer controls rather than a public millisecond latency benchmark.

Limit	Value
STT streaming chunk size	25 KB of audio per request chunk
STT streaming session duration	Up to 5 minutes; applications should then rotate to a new stream or use the documented endless-streaming pattern
STT chunk size for precise timeout behavior	15,360 bytes per request noted in the voice-activity timeout documentation
STT voice activity timeouts	Greater than 500 ms and less than 60 s
BatchRecognize	Up to 15 files per request, each up to 8 hours; input via Cloud Storage URI only
TTS standard synthesis input	5,000 bytes per request
TTS long-form synthesis input	Up to 1 million bytes
TTS Chirp 3 request rate	200 requests per minute per project
TTS concurrent streaming sessions	100 per project
Voice cloning requests	30 Chirp voice cloning requests per minute per project
Voice-cloning-key generations	10 per minute per project

On STT, endpointing sensitivity controls how quickly the model finalizes utterances, and voice activity events can fire before the corresponding transcription result. On TTS, Chirp 3 HD supports text streaming, letting applications send text incrementally and receive audio incrementally.

Deployment and integrations

Chirp 3 transcription is available only in Speech-to-Text API V2 under the model identifier chirp_3. The Chirp 3 model page lists us and eu multi-regions as GA for the transcription model; Google's STT V2 migration material says the broader platform supports regionalized invocation in locations such as Belgium and Singapore, so model-specific regional availability must be checked separately from platform-wide regionalization.

Chirp 3 HD voice documentation lists regional availability in global, us, eu, asia-southeast1, europe-west2, and asia-northeast1.

Setup requires a Google Cloud project with billing and the relevant APIs enabled; enabling Speech-to-Text or Text-to-Speech requires the Service Usage Admin role for the person enabling the APIs. For STT V2, Google recommends recognizers, which are reusable stored recognition configurations for grouping traffic and standardizing model, language, and feature settings. BatchRecognize expects source audio in Cloud Storage and returns a long-running operation whose output can be returned inline or written back to Cloud Storage. Long-form TTS requires the output bucket to grant Storage Object Creator and Storage Object Viewer roles.

Compliance controls: Speech-to-Text supports IAM and audit logging; Text-to-Speech offers regional endpoints, and Google says using us or eu regional endpoints keeps TTS data at rest and in use within those geographic boundaries. Google states that Cloud TTS does not log customer text or audio data. In Vertex AI generative media launch materials for Chirp 3, Google said it does not use customer data to train its models and that customer data is processed according to customer instructions.

Pricing

Item	Price
STT V2 standard recognition, first 500,000 min/month	$0.016 per minute
STT V2 standard recognition, higher volume tiers	$0.01, $0.008, and $0.004 per minute
STT dynamic batch recognition (standard models including chirp)	$0.003 per minute
Chirp 3 HD voices	$30 per 1 million characters
Instant Custom Voice	$60 per 1 million characters

Cost scenarios given in the source:

Scenario	Usage	Approximate monthly cost
Startup meeting copilot	10,000 min standard STT plus 1M characters Chirp 3 HD	About $190 ($160 STT plus $30 TTS), excluding storage, networking, and LLM costs
Real-time support center	250,000 min real-time STT plus 5M characters TTS	About $4,150 at list price ($4,000 STT plus $150 TTS)
Content platform	100,000 min dynamic batch STT plus 50M characters TTS	About $1,800 (around $300 STT and $1,500 TTS)
Large enterprise deployment	1,000,000 min STT plus 20M HD characters plus 5M Instant Custom Voice characters	About $13,900 (about $13,000 STT with tiering, $600 HD TTS, $300 custom voice)

Comparative pricing reported in the source: OpenAI lists about $0.006/minute and $0.003/minute for its two request-based transcription models and $0.017/minute for GPT-Realtime-Whisper. ElevenLabs lists $0.05 per 1K characters for Flash/Turbo TTS, $0.10 per 1K for Multilingual v2/v3, $0.22/hour for Scribe v1/v2 STT, and $0.39/hour for realtime STT. Deepgram lists Aura-2 TTS at $0.030 per 1K characters. Microsoft pricing snippets show standard transcription at about $1/hour and standard/neural TTS at about $15 per 1M characters. AWS Polly spans $16 per 1M characters for neural voices and $30 per 1M for generative voices.

Development and ownership

Chirp 3 is developed by Google Cloud. Its lineage begins with Google Research's Universal Speech Model (USM) work, described in March 2023 as part of Google's 1,000 Languages Initiative. Yu Zhang and James Qin presented USM in the research blog; the corresponding paper lists 27 authors and describes a 2B-parameter multilingual ASR system trained on 12 million hours of speech and 28 billion sentences of text spanning 300+ languages.

Google Cloud productized that work in May 2023, when Cloud Speech product manager Calum Barnes introduced Chirp as a foundation model for the Google Cloud Speech API, described as a 2B-parameter speech model built through self-supervised training on millions of hours of audio and 28 billion sentences across 100+ languages.

Release history

Date	Event
March 2023	Google Research publishes USM work as part of the 1,000 Languages Initiative
May 2023	Chirp introduced as a foundation model for the Google Cloud Speech API
February 2025	Journey voices rebranded as Chirp HD voices per TTS release notes
March 2025	Chirp HD expanded to 31 locales
April 2, 2025	Chirp 3 HD voices GA with eight speakers and real-time plus batch support
2025 (Speech-to-Text release notes)	chirp_3 introduced in public preview with 85+ languages and locales, StreamingRecognize, Recognize, speaker diarization, and language-agnostic transcription
Google Cloud Next 2025	Instant Custom Voice announced, creating a custom voice from 10 seconds of audio

Sources

https://docs.cloud.google.com/speech-to-text/docs/models/chirp-3

https://research.google/blog/universal-speech-model-usm-state-of-the-art-speech-ai-for-100-languages/

https://cloud.google.com/blog/products/ai-machine-learning/bringing-power-large-models-google-clouds-speech-api

https://docs.cloud.google.com/text-to-speech/docs/release-notes

https://docs.cloud.google.com/text-to-speech/docs/chirp3-hd

https://docs.cloud.google.com/speech-to-text/docs/quotas

https://docs.cloud.google.com/text-to-speech/docs/chirp3-instant-custom-voice

https://cloud.google.com/blog/products/ai-machine-learning/expanding-generative-media-for-enterprise-on-vertex-ai

https://docs.cloud.google.com/speech-to-text/docs/release-notes

https://docs.cloud.google.com/text-to-speech/quotas

https://cloud.google.com/speech-to-text/pricing

https://arxiv.org/html/2603.09725v1

https://developers.openai.com/api/docs/guides/speech-to-text

https://elevenlabs.io/docs/overview/capabilities/text-to-speech

https://developers.deepgram.com/docs/models-languages-overview

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-to-text

https://www.assemblyai.com/pricing

https://docs.cloud.google.com/speech-to-text/docs/recognizers

https://docs.cloud.google.com/text-to-speech/docs/create-audio-text-streaming

https://docs.cloud.google.com/speech-to-text/docs/audit-logging