Google Cloud Chirp 3: model profile
Reference profile of Google Cloud Chirp 3: multilingual speech-to-text in Speech-to-Text V2, Chirp 3 HD voices, pricing, limits, and benchmarks.
Google Cloud Chirp 3 is a family of speech capabilities comprising Chirp 3 Transcription in Speech-to-Text V2, Chirp 3 HD voices in Cloud Text-to-Speech, and Chirp 3 Instant Custom Voice for voice cloning.
Specifications
| Developer | Google Cloud |
| Released | chirp_3 transcription introduced in public preview per Speech-to-Text release notes; Chirp 3 HD voices generally available April 2, 2025; Instant Custom Voice announced at Google Cloud Next 2025 |
| Model type | Multilingual ASR-specific generative model (transcription); generative TTS models (HD voices) |
| Parameters | Not publicly disclosed for Chirp 3. The predecessor Chirp model was described as a 2B-parameter speech model |
| Training data | "Millions of hours of audio and billions of text sentences" (vendor description) |
| Languages | 85+ languages and locales for transcription; TTS described as over 35 languages, with the GA catalog listed as 8 speakers and 31 locales |
| Modes (batch / streaming) | Speech.Recognize (audio under one minute), Speech.BatchRecognize (long audio), Speech.StreamingRecognize (real-time) |
| Latency | No public millisecond benchmark. Developer controls include endpointing_sensitivity (standard, short, supershort) and voice activity events |
| Throughput / concurrency | STT streaming: 25 KB per chunk, 5-minute session limit. Batch: up to 15 files per request, each up to 8 hours. TTS: 200 Chirp 3 requests per minute per project, 100 concurrent streaming sessions per project |
| Deployment | Managed Google Cloud APIs (Speech-to-Text V2 and Cloud Text-to-Speech). Transcription model GA in us and eu multi-regions |
| Pricing | STT standard recognition $0.016/min (first 500,000 min/month), tiering to $0.004/min; dynamic batch $0.003/min; Chirp 3 HD $30 per 1M characters; Instant Custom Voice $60 per 1M characters |
| License | Not publicly disclosed. Offered as a managed cloud service; the source describes no self-hosted or open-license option |
Full technical breakdown9 sections
Overview
Chirp 3 is a brand umbrella spanning automatic speech recognition, HD text-to-speech, and custom voice creation, each with different APIs, limits, prices, and regional footprints. Google describes Chirp 3: Transcription as the latest generation of its multilingual ASR-specific generative models, available only in Speech-to-Text API V2 under the model identifier chirp_3. Google's product page states that Speech-to-Text can use Chirp 3, trained on millions of hours of audio and billions of text sentences, to improve recognition across more languages and accents. Google positions Chirp 3 for multilingual transcription and for lower latency synthetic speech in agents, support operations, and media workflows.
The transcription side and the synthesis side are separate service surfaces. Deployment characteristics, compliance controls, and operational limits differ between the STT and TTS products despite the shared branding.
Capabilities and features
Speech-to-Text (chirp_3)
Chirp 3 transcription maps to three API modes: Speech.Recognize for audio shorter than one minute, Speech.BatchRecognize for long audio (documented as generally suited for 1 minute to 1 hour, though quotas allow longer), and Speech.StreamingRecognize for streaming and real-time audio. A BatchRecognize request can contain up to 15 files, each up to 8 hours long; long-audio batch requests accept input only as a Cloud Storage URI.
Documented features:
- Automatic punctuation
- Automatic capitalization
- Utterance-level timestamps in streaming
- Speaker diarization in batch
- Speech adaptation for vocabulary biasing
- Language-agnostic audio transcription
- Custom prompt feature (preview)
Documented caveats: word-level timestamps are supported only in Recognize and BatchRecognize and can degrade transcription quality; word-level confidence scores are returned but are not confidence scores in the conventional sense.
For live systems, Chirp 3 exposes endpointing_sensitivity levels (standard, short, supershort) that trade accuracy against latency. Google's documentation warns that increasing sensitivity for faster bot responses can cause the system to cut off users who pause briefly while speaking. Speech-to-Text V2 can emit voice activity begin and end events and can auto-close streams based on speech begin or speech end timeouts; those timeouts must be greater than 500 ms and less than 60 s.
Text-to-Speech (Chirp 3 HD voices)
Google's documentation says Chirp 3 HD voices are powered by generative models, are suitable for conversational agents, cover 30 distinct styles across many languages, and support low-latency real-time communication through text streaming. Streaming Cloud TTS is only compatible with Chirp 3 HD voices.
Output formats: LINEAR16 is the default; ALAW, MULAW, OGG_OPUS, and PCM are supported for streaming; ALAW, MULAW, MP3, OGG_OPUS, and PCM are supported for batch.
Control features:
- Pace control from 0.25x to 2.0x
- Pause control (experimental)
- Custom pronunciations using IPA or X-SAMPA
- SSML support in preview for synchronous Chirp 3 HD requests; those SSML tags are not currently supported for streaming requests
- Asynchronous long-form synthesis of up to 1 million bytes of input
Instant Custom Voice
Instant Custom Voice creates personalized voice models and synthesizes speech through the Cloud TTS API for both streaming and long-form text. Access is allowlist-only. Google requires both a consent recording and a reference audio recording, each up to 10 seconds, with the consent statement using Google's required script rather than a customer-written alternative. The generated artifact is a voice cloning key, stored on the client side and reusable across multiple clients or devices. Google's April 2025 blog states the feature enables realistic custom voices from 10 seconds of input, targets use cases such as personalized call centers, accessible content, and brand voices, and that Google uses built-in safety features and a diligence process to verify voice-use permissions.
Instant Custom Voice supports language transfer from en-US to several other locales, including de-DE, es-US, es-ES, fr-CA, fr-FR, and pt-BR.
Language support
Transcription: Chirp 3 launched publicly with 85+ languages and locales.
Diarization: the supported subset is smaller. The Chirp 3 model page lists diarization support for specific languages including en-US, en-GB, en-IN, fr-FR, fr-CA, es-ES, es-US, de-DE, hi-IN, it-IT, ja-JP, ko-KR, pt-BR, and Simplified Chinese.
Synthesis: Google's April 2025 blog said Chirp 3 HD voices offered natural speech in over 35 languages with eight speaker options, while release notes described the GA catalog as 8 speakers and 31 locales. The two figures count languages and locale variants respectively.
Performance and benchmarks
Vendor-reported: Google does not publish WER figures for Chirp 3 in the source material. The 2023 USM research that preceded Chirp benchmarked against Whisper with multilingual gains.
Third-party evaluation: a 2026 academic evaluation on noisy Dutch semi-spontaneous speech found Google Chirp 3 achieved the best average WER among eight tested ASR systems at 11.2%, compared with Whisper-large-v3 at 15.8%. A separate 2026 study on spoken U.S. street names found that leading systems from Google, OpenAI, Deepgram, and Microsoft produced a very high average error rate on this named-entity task.
Latency and throughput
Google publishes developer controls rather than a public millisecond latency benchmark.
| Limit | Value |
|---|---|
| STT streaming chunk size | 25 KB of audio per request chunk |
| STT streaming session duration | Up to 5 minutes; applications should then rotate to a new stream or use the documented endless-streaming pattern |
| STT chunk size for precise timeout behavior | 15,360 bytes per request noted in the voice-activity timeout documentation |
| STT voice activity timeouts | Greater than 500 ms and less than 60 s |
| BatchRecognize | Up to 15 files per request, each up to 8 hours; input via Cloud Storage URI only |
| TTS standard synthesis input | 5,000 bytes per request |
| TTS long-form synthesis input | Up to 1 million bytes |
| TTS Chirp 3 request rate | 200 requests per minute per project |
| TTS concurrent streaming sessions | 100 per project |
| Voice cloning requests | 30 Chirp voice cloning requests per minute per project |
| Voice-cloning-key generations | 10 per minute per project |
On STT, endpointing sensitivity controls how quickly the model finalizes utterances, and voice activity events can fire before the corresponding transcription result. On TTS, Chirp 3 HD supports text streaming, letting applications send text incrementally and receive audio incrementally.
Deployment and integrations
Chirp 3 transcription is available only in Speech-to-Text API V2 under the model identifier chirp_3. The Chirp 3 model page lists us and eu multi-regions as GA for the transcription model; Google's STT V2 migration material says the broader platform supports regionalized invocation in locations such as Belgium and Singapore, so model-specific regional availability must be checked separately from platform-wide regionalization.
Chirp 3 HD voice documentation lists regional availability in global, us, eu, asia-southeast1, europe-west2, and asia-northeast1.
Setup requires a Google Cloud project with billing and the relevant APIs enabled; enabling Speech-to-Text or Text-to-Speech requires the Service Usage Admin role for the person enabling the APIs. For STT V2, Google recommends recognizers, which are reusable stored recognition configurations for grouping traffic and standardizing model, language, and feature settings. BatchRecognize expects source audio in Cloud Storage and returns a long-running operation whose output can be returned inline or written back to Cloud Storage. Long-form TTS requires the output bucket to grant Storage Object Creator and Storage Object Viewer roles.
Compliance controls: Speech-to-Text supports IAM and audit logging; Text-to-Speech offers regional endpoints, and Google says using us or eu regional endpoints keeps TTS data at rest and in use within those geographic boundaries. Google states that Cloud TTS does not log customer text or audio data. In Vertex AI generative media launch materials for Chirp 3, Google said it does not use customer data to train its models and that customer data is processed according to customer instructions.
Pricing
| Item | Price |
|---|---|
| STT V2 standard recognition, first 500,000 min/month | $0.016 per minute |
| STT V2 standard recognition, higher volume tiers | $0.01, $0.008, and $0.004 per minute |
| STT dynamic batch recognition (standard models including chirp) | $0.003 per minute |
| Chirp 3 HD voices | $30 per 1 million characters |
| Instant Custom Voice | $60 per 1 million characters |
Cost scenarios given in the source:
| Scenario | Usage | Approximate monthly cost |
|---|---|---|
| Startup meeting copilot | 10,000 min standard STT plus 1M characters Chirp 3 HD | About $190 ($160 STT plus $30 TTS), excluding storage, networking, and LLM costs |
| Real-time support center | 250,000 min real-time STT plus 5M characters TTS | About $4,150 at list price ($4,000 STT plus $150 TTS) |
| Content platform | 100,000 min dynamic batch STT plus 50M characters TTS | About $1,800 (around $300 STT and $1,500 TTS) |
| Large enterprise deployment | 1,000,000 min STT plus 20M HD characters plus 5M Instant Custom Voice characters | About $13,900 (about $13,000 STT with tiering, $600 HD TTS, $300 custom voice) |
Comparative pricing reported in the source: OpenAI lists about $0.006/minute and $0.003/minute for its two request-based transcription models and $0.017/minute for GPT-Realtime-Whisper. ElevenLabs lists $0.05 per 1K characters for Flash/Turbo TTS, $0.10 per 1K for Multilingual v2/v3, $0.22/hour for Scribe v1/v2 STT, and $0.39/hour for realtime STT. Deepgram lists Aura-2 TTS at $0.030 per 1K characters. Microsoft pricing snippets show standard transcription at about $1/hour and standard/neural TTS at about $15 per 1M characters. AWS Polly spans $16 per 1M characters for neural voices and $30 per 1M for generative voices.
Development and ownership
Chirp 3 is developed by Google Cloud. Its lineage begins with Google Research's Universal Speech Model (USM) work, described in March 2023 as part of Google's 1,000 Languages Initiative. Yu Zhang and James Qin presented USM in the research blog; the corresponding paper lists 27 authors and describes a 2B-parameter multilingual ASR system trained on 12 million hours of speech and 28 billion sentences of text spanning 300+ languages.
Google Cloud productized that work in May 2023, when Cloud Speech product manager Calum Barnes introduced Chirp as a foundation model for the Google Cloud Speech API, described as a 2B-parameter speech model built through self-supervised training on millions of hours of audio and 28 billion sentences across 100+ languages.
Release history
| Date | Event |
|---|---|
| March 2023 | Google Research publishes USM work as part of the 1,000 Languages Initiative |
| May 2023 | Chirp introduced as a foundation model for the Google Cloud Speech API |
| February 2025 | Journey voices rebranded as Chirp HD voices per TTS release notes |
| March 2025 | Chirp HD expanded to 31 locales |
| April 2, 2025 | Chirp 3 HD voices GA with eight speakers and real-time plus batch support |
| 2025 (Speech-to-Text release notes) | chirp_3 introduced in public preview with 85+ languages and locales, StreamingRecognize, Recognize, speaker diarization, and language-agnostic transcription |
| Google Cloud Next 2025 | Instant Custom Voice announced, creating a custom voice from 10 seconds of audio |
Sources
https://docs.cloud.google.com/speech-to-text/docs/models/chirp-3
https://docs.cloud.google.com/text-to-speech/docs/release-notes
https://docs.cloud.google.com/text-to-speech/docs/chirp3-hd
https://docs.cloud.google.com/speech-to-text/docs/quotas
https://docs.cloud.google.com/text-to-speech/docs/chirp3-instant-custom-voice
https://docs.cloud.google.com/speech-to-text/docs/release-notes
https://docs.cloud.google.com/text-to-speech/quotas
https://cloud.google.com/speech-to-text/pricing
https://arxiv.org/html/2603.09725v1
https://developers.openai.com/api/docs/guides/speech-to-text
https://elevenlabs.io/docs/overview/capabilities/text-to-speech
https://developers.deepgram.com/docs/models-languages-overview
https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-to-text
https://www.assemblyai.com/pricing
https://docs.cloud.google.com/speech-to-text/docs/recognizers
https://docs.cloud.google.com/text-to-speech/docs/create-audio-text-streaming
https://docs.cloud.google.com/speech-to-text/docs/audit-logging