Chirp 3: inside Google Cloud's 2025 speech stack, from HD voices to transcription
What Google Cloud Chirp 3 actually is: release timeline, WER and Elo benchmarks, pricing, known issues, and how it stacks up against Azure and ElevenLabs.

Ask around about "Chirp 3" and you get two very different answers. One is a castable fish-finder. The other is Google Cloud's current speech model family, and that second one is what this report covers. Google uses the exact name in official documentation for three related cloud services: Chirp 3: HD voices for text-to-speech, Chirp 3: Instant Custom Voice for voice cloning, and Chirp 3: Transcription for speech-to-text. The hardware near-match, the Deeper Smart Sonar CHIRP+ 3, is officially branded with a plus sign and lives in fishing and sonar contexts, not speech AI. The exact naming, the breadth of primary documentation, and the current ecosystem evidence all point to the Google Cloud interpretation as the strongest fit.
The family launched in phases rather than on a single date. Chirp 3: HD voices reached GA on April 2, 2025 after March 2025 rollout changes. Instant Custom Voice was announced as GA through an allowlist on April 9, 2025. On the recognition side, Chirp 3: Transcription entered Private Preview on April 11, 2025, Public Preview on August 29, 2025, and GA on October 13, 2025.
One framing matters before anything else: Chirp 3 is a cloud-managed speech stack, not a firmware-based device. There is no public firmware image and no consumer software versioning in the usual device sense. Capability changes show up through release notes, region rollouts, model identifiers, and documentation updates. Its strongest differentiators are multilingual speech recognition, native diarization and built-in denoising for STT, streaming HD TTS, and fast custom voice creation from short audio samples. Its main frictions are allowlist gating for custom voice, documentation drift across pages, regional and data-residency caveats, and several user-reported quality or latency issues in specific locales and workflows.
Third-party evidence puts Chirp 3 in the competitive tier, not the dominant one. Artificial Analysis' streaming STT benchmark found that Chirp 3 Streaming led partial-transcript performance on VoxPopuli at 2.2% WER, while noting that no single model led across all tested datasets. In TTS, the same firm's selected-voice leaderboard placed Chirp 3: HD at Elo 1,056 and $30 per 1M characters, below Azure HD 2.5 and Eleven v3 on its naturalness ranking snapshot. Google's own customer case studies show real adoption, including Il Foglio choosing Chirp 3 for natural Italian editorial audio and HBX Group using it for more natural voice-channel customer experiences.
There is also a strategic wrinkle. Chirp 3 is a serious enterprise speech stack, especially for teams already on Google Cloud, but it is no longer the endpoint of Google's voice roadmap. Google now describes Gemini-TTS as the latest evolution of Cloud TTS, with broader prompt-based control and native multi-speaker options while reusing voice identities similar to Chirp 3 HD. For greenfield TTS projects the useful question is no longer "Chirp 3 or not?" but "Chirp 3 vs. Gemini-TTS vs. external speech suites."
Which Chirp 3 do you mean?
The name is ambiguous enough that a careful report should not assume a single product without checking. The two strongest matches:
| Possible match | Why it fits | Why it is less likely here | Source |
|---|---|---|---|
| Google Cloud Chirp 3 | Exact product name appears in official Google Cloud docs for STT and TTS. Google uses "Chirp 3" as a named speech model family. | None of significance; this is the best exact-name match. | |
| Deeper Smart Sonar CHIRP+ 3 | Prominent commercial hardware product; web search often surfaces it when users mean a device. | Official product name is CHIRP+ 3, not plain "Chirp 3," and its category is castable fish-finder sonar, not software or speech AI. |
The Google Cloud reading wins for three reasons: it matches the broad framing of a product that could be a device, software, or something else; Google uses the exact term in official documentation; and the dimensions worth researching (release notes, model variants, support channels, pricing, benchmarks, update history) fit a cloud speech platform unusually well. The Deeper product stays on the table as the most important alternate interpretation if a hardware device was actually meant.
What the family actually contains
The Chirp lineage predates Chirp 3. Google introduced Chirp as a speech foundation model in 2023 and described that earlier generation as a 2B-parameter speech model delivering 98% English speech recognition accuracy and large relative gains in some low-resource languages. Current Chirp 3 materials read differently: they emphasize product capabilities, rollout stages, and API behavior rather than architectural disclosures such as parameter count. That shift matters. The Chirp 3 product story is practical and API-centric, not model-card transparent.
| Variant | Official identity | Model ID or naming | Core function | Current technical snapshot | Source |
|---|---|---|---|---|---|
| Chirp 3: HD voices | Google Cloud Text-to-Speech | Voice names such as en-US-Chirp3-HD-Charon | High-fidelity TTS for real-time and batch synthesis | Current dedicated docs list 30 named voices, 53 supported languages/locales, GA endpoints in global, us, eu, asia-southeast1, europe-west2, and asia-northeast1, streaming and batch output formats, and text streaming support. Launch GA milestone was 8 speakers / 31 locales on April 2, 2025. | |
| Chirp 3: Instant Custom Voice | Google Cloud Text-to-Speech | Voice cloning key generated per project/request | Fast voice cloning for custom branded or personal voices | Restricted to allowlisted users; supports streaming and batch synthesis, supports LINEAR16, PCM, MP3, and M4A input encodings, pace control from 0.25x to 2x, experimental pause tags and custom pronunciations, and multilingual transfer from en-US to six listed locales. | |
| Chirp 3: Transcription | Google Cloud Speech-to-Text V2 | chirp_3 | Multilingual automatic speech recognition | Available only in Speech-to-Text V2; supports StreamingRecognize, Recognize, and BatchRecognize; documentation lists 29 GA transcription locales, many additional Preview locales, 14 diarization locales, built-in denoiser, language-agnostic transcription, and speech adaptation. |
A few technical details carry outsized weight in practice. For TTS, text streaming is exclusive to Chirp 3 HD voices in Google's Cloud TTS stack, which matters if you are building low-latency voice agents. For custom voice, Google requires a spoken consent statement, recommends clean 10-second recordings, stores the resulting voice-cloning key client-side, and permits 10 new keys per minute per project with no stated absolute limit on total keys. For STT, Chirp 3 supports speaker diarization, automatic punctuation, automatic capitalization, speech adaptation with up to 1,000 phrases, a custom prompt feature in Preview, and a built-in denoiser that can reduce music, rain, and street noise but not background human voices.
Then there is documentation drift, which is subtle but bites integrators. Google's generic "supported voices and languages" page still says Chirp 3 HD doesn't support SSML input or certain pitch and rate controls, and that it is available only on global, eu, and us. The later dedicated Chirp 3 HD docs list more endpoints, and the release notes say limited SSML support was added on October 17, 2025. The dedicated product page and release notes are more current than the generic voice-list page, so when the docs disagree, treat the release notes as the source of truth.

Release history and official support
Because Chirp 3 is a managed cloud service, its "version history" is really a mix of model identifiers, feature rollouts, region rollouts, and service release notes. There is no public firmware track and no downloadable desktop or mobile package attached to the product itself.
The official history, compressed into one view:
| Date | Milestone | What changed | Source |
|---|---|---|---|
| Feb 10, 2025 | Pre-launch rename | Journey voices were rebranded as Chirp HD voices. | |
| Mar 6, 2025 | TTS rollout expansion | Chirp 3 HD added 8 speakers in 31 locales. | |
| Apr 2, 2025 | TTS GA | Chirp 3 HD voices became GA with 8 speakers / 31 locales, real-time streaming, batch support, and supported regional endpoints. | |
| Apr 9, 2025 | Instant Custom Voice GA announcement | Google announced Instant Custom Voice as GA through an allowlist and also announced transcription with diarization in preview/allowlist. | |
| Apr 11, 2025 | STT Private Preview | chirp_3 launched in private preview for Speech-to-Text V2. | |
| May 7, 2025 | TTS controls expansion | Pace control, pause control, and custom pronunciations were released for Chirp 3 HD voices. | |
| Jun 18, 2025 | ICV locale expansion | Instant Custom Voice added ja-JP, pushing support to more than 30 locales. | |
| Aug 21-27, 2025 | ICV and TTS endpoint upgrades | Instant Custom Voice added PCM, MP3, and M4A input encodings; Chirp 3 HD became available on europe-west2. | |
| Aug 29, 2025 | STT Public Preview | chirp_3 public preview launched with 85+ languages/locales in preview coverage and improved speed and accuracy messaging. | |
| Sep 15, 2025 | TTS endpoint expansion | Chirp 3 HD became available on asia-northeast1. | |
| Oct 13, 2025 | STT GA | Chirp 3: Transcription reached GA in Speech-to-Text V2. | |
| Oct 17, 2025 | Limited SSML support | Chirp 3 HD added support for <phoneme>, <p>, <s>, <sub>, and <say-as>. | |
| Nov-Dec 2025 | Regional and language expansion | STT preview regions expanded; TTS added a wide set of European languages, then Punjabi and Cantonese in preview. |
Official support paths are unusually strong for an API product. Google provides reference documentation, release notes with RSS support, Vertex AI Studio and console entry points, Colab and GitHub notebooks, community forums, Cloud support, system status, and sales-led access for allowlisted features like Instant Custom Voice. The support footprint is enterprise grade, but some access paths, especially for custom voice, still feel sales-driven rather than self-serve.
Benchmarks, adoption, and what users actually say
Google's public performance storytelling for Chirp 3 is noticeably lighter than it was for the original 2023 Chirp launch. Back then, Google published concrete claims: 98% English recognition accuracy and 300% relative improvement in some low-resource languages. Current Chirp 3 launch materials emphasize feature breadth, speed improvements, diarization, language detection, and voice realism, but there is no public benchmark sheet of equal granularity. That does not mean Chirp 3 is weak. It means the public evidence base leans on product marketing and case studies rather than benchmark transparency.
| Evidence area | What the evidence says | Interpretation |
|---|---|---|
| Independent STT benchmark | Artificial Analysis reported that in its streaming benchmark, Google's Chirp 3 Streaming led partial-transcript performance on VoxPopuli at 2.2% WER, while also noting that no single model leads everywhere. | Chirp 3 looks strong in real-time multilingual settings, but not categorically dominant across all datasets or latency conditions. |
| Independent TTS benchmark | Artificial Analysis' selected-voice leaderboard snapshot showed Chirp 3: HD at Elo 1,056 and $30.0 / 1M characters, below Azure HD 2.5 at Elo 1,127 and Eleven v3 at Elo 1,179. | Chirp 3 HD is competitive, but the benchmark snapshot does not place it at the very top of TTS naturalness. |
| Official real-world media use | Il Foglio said Chirp 3 HD offered the most natural Italian intonation among tested options, turned editorials into audio in minutes, and helped the paper reach the top three of its podcast offerings. | Strong evidence that Chirp 3 performs well in editorial long-form audio, especially when language-specific naturalness matters. |
| Official enterprise localization use | Adya reported localization across 20+ Indian languages with low latency using Chirp 3. | Suggests practical multilingual deployment, especially in enterprise localization. |
| Official contact-center use | HBX Group said Chirp 3 voices created a more natural, less robotic caller experience. | Supports Google's positioning in customer-experience voice channels. |
User sentiment is directionally positive without being glowing. On G2, users reviewing Google Cloud Speech-to-Text consistently praise ease of use, speed, and meeting-transcription productivity, while recurring negatives include cost sensitivity and the need for manual correction when accuracy is not perfect. For Google Cloud Text-to-Speech, review summaries emphasize natural voice quality and simple API integration, but some users still describe the output as robotic in certain scenarios or languages, and some reviewers complain about pricing opacity or cost escalation.
The more technical sentiment in Google's own forums is mixed. Early enthusiasm centered on new voice quality, then posts started surfacing UI confusion, long-audio latency regressions, markup and SSML limitations, locale-specific pronunciation bugs, and allowlist access friction. That pattern is a familiar enterprise-AI trajectory: high capability, uneven operational maturity.

Known issues and their current status
Chirp 3's most common issues are not fatal flaws so much as integration and maturity problems. Some were clearly fixed in later releases, some remain workflow constraints, and some are documentation or transparency issues rather than core model failures.
| Issue or limitation | Evidence | Current status or best-known fix |
|---|---|---|
| Instant Custom Voice is not self-serve | Google's docs state that access is restricted to allowlisted users, and forum users continued asking for a simpler access flow in 2026. | Still gated. Operationally, the "fix" is to work through sales/allowlisting. |
| SSML / markup support was incomplete at launch | Generic docs originally said SSML input was unsupported for Chirp 3 HD, while release notes later added limited SSML support in October 2025. | Partially improved, but only selected tags are supported. |
| Console UI confusion about missing legacy voices | Google staff clarified that only Chirp voices were intentionally shown in the Cloud Console UI even though other TTS voices still existed through the API. | Workaround: use the API rather than relying on the narrowed UI. |
| Long-audio performance regression | A September 2025 forum thread reported long-audio jobs stalling or taking much longer than before. | Public forum evidence shows a real complaint, but no official public postmortem or universal fix was captured in the source set. |
| Locale-specific pronunciation failures | Users reported mispronunciation of French contractions and related markup/pronunciation issues. | Workarounds include phoneme tags and pronunciation controls, where supported. |
| STT diarization and language coverage are not universal | Chirp 3 STT has broad language coverage, but diarization is listed for only a smaller subset of languages, and the model's regional rollouts occurred gradually. | Fix is usually architectural rather than toggled: choose supported language/region combinations and the right recognition method. |
Pricing and availability
Chirp 3 is sold as a Google Cloud API service, not a boxed product, so there are no retailers in the traditional sense. Consumption happens through the Cloud TTS API, the Speech-to-Text V2 API, Vertex AI Studio, and Google's console and notebook ecosystem. Region support differs by sub-product: Chirp 3 HD currently lists six GA endpoints, while Chirp 3 Transcription documents GA in us and eu multi-regions plus release-note preview expansions into additional regions in late 2025. Instant Custom Voice also lists region availability beyond the initial TTS endpoints.
| Chirp 3 commercial surface | Publicly visible pricing | Availability notes | Source |
|---|---|---|---|
| Chirp 3: HD voices | $30 per 1M characters after the free tier; 1M characters free monthly on the pricing page. | GA; available via Cloud TTS and Vertex AI Studio. | |
| Chirp 3: Instant Custom Voice | Public pricing was not fully surfaced in the captured pricing excerpt; the feature is allowlisted. | Restricted access; requires sales/allowlist. | |
| Chirp 3: Transcription | Google's Speech-to-Text V2 page lists standard recognition at $0.016/min up to 500k min, then lower volume-tier rates, and dynamic batch at $0.003/min; the public pricing page still labels the included V2 speech model family as "chirp" rather than explicitly chirp_3. | Available only in Speech-to-Text V2; GA and preview regions differ. |
One caveat worth flagging: Google's public pricing nomenclature lags the model nomenclature. The STT pricing page references "chirp (Speech-to-Text V2 only)" while the current model docs and release notes use chirp_3. The most reasonable reading is that Chirp-family Speech-to-Text V2 pricing applies, but the public pricing page is not as current or precise as the model documentation.

How it stacks up against the field
Chirp 3 competes less as a single model than as a speech suite. The most relevant alternatives are Microsoft Azure AI Speech, AWS Polly plus Amazon Transcribe, and ElevenLabs. Google's strongest competitive arguments are integrated STT and TTS within Google Cloud, good multilingual breadth, fast custom voice creation, and solid enterprise tooling. Its weakest points are operational complexity, allowlist friction, documentation inconsistency, and the fact that its own next-generation TTS roadmap now points toward Gemini-TTS for the most advanced controllable voice work.
| Platform | STT | TTS | Custom voice / cloning | Real-time / streaming | Diarization / language ID | Public pricing signal | Analytical reading |
|---|---|---|---|---|---|---|---|
| Google Chirp 3 | Yes, via chirp_3 in Speech-to-Text V2 | Yes, via Chirp 3 HD | Yes, via Instant Custom Voice from ~10 seconds and consent flow | Yes for STT and TTS; Chirp 3 HD uniquely supports text streaming in Cloud TTS | Yes; diarization, language-agnostic transcription, denoiser, adaptation | TTS $30 / 1M chars; STT standard $0.016/min list tier; dynamic batch $0.003/min | Best fit for Google Cloud-native teams that want one vendor for speech generation and transcription. |
| Microsoft Azure AI Speech | Yes | Yes, including HD voices | Yes, via Custom Voice and Personal Voice | Yes; docs include real-time diarization quickstart and broad speech workflows | Yes; official docs highlight language detection, custom speech, diarization | Official pricing page shows per-second STT and per-character TTS billing plus a free tier, but exact post-free rates were not reliably visible in the HTML capture | Strong enterprise alternative, especially where Microsoft identity/compliance stack matters. |
| AWS Polly + Amazon Transcribe | Yes | Yes | TTS customization exists via lexicons and voice families; no instant 10-second clone captured in the source set | Yes; Polly returns audio streams, Transcribe supports streaming | Yes; Transcribe supports diarization and automatic language identification in relevant workflows | Polly Generative $30 / 1M chars; Transcribe Tier 1 $0.03/min in us-east-1 example | Strong for AWS-native teams; TTS price for Polly Generative roughly matches Chirp 3 HD, but STT list price in the cited example is higher. |
| ElevenLabs | Yes, Scribe v2 / v2 Realtime | Yes | Yes, voice cloning front and center | Yes; realtime STT marketed at ~150 ms latency | Yes; diarization, word-level timestamps, multilingual handling | Scribe v2 $0.22/hour; Scribe v2 Realtime $0.39/hour; TTS pricing is model- and plan-dependent | Best fit when pure voice experience and rapid productization matter more than hyperscaler platform consolidation. |
Two competitive findings deserve emphasis. In TTS quality, the benchmark snapshot used here does not place Chirp 3 HD at the top of the field: Artificial Analysis showed it trailing both Azure HD 2.5 and Eleven v3 on the selected-voice naturalness leaderboard. And Google itself is nudging advanced TTS users toward Gemini-TTS, which offers prompt-based control, multi-speaker generation, and similar voice options to Chirp 3 HD. Chirp 3 is still important, but its nearest existential competitor may be Google's own next API generation rather than another vendor.
The practical decision rule looks like this. If the goal is high-volume multilingual transcription inside Google Cloud, Chirp 3 remains compelling. If the goal is maximum expressive TTS control or multi-speaker dialogue synthesis, the analysis shifts: ElevenLabs and Gemini-TTS often look stronger at the frontier, while Azure AI Speech remains the broadest enterprise speech-suite peer to compare against.
Sourcing notes and open questions
All cited sources were accessed on June 15, 2026.
The most important primary sources were Google Cloud's Chirp 3 HD voices, Instant Custom Voice, and Chirp 3 Transcription documentation, the Text-to-Speech and Speech-to-Text release notes, the pricing pages, and the Gemini-TTS documentation. The strongest secondary and independent sources were Artificial Analysis for benchmark context, Google Cloud customer case studies (Il Foglio, Adya, HBX Group) for real-world usage, G2 for end-user sentiment, and Google Developer forums for operational issues and bug reports. The ambiguity check relied on official Google documentation and Deeper's CHIRP+ 3 materials.
Some gaps remain. Google does not publicly expose a Chirp 3 architectural model card comparable to the 2023 Chirp write-up, so parameter count and training-corpus detail for Chirp 3 are not available in the current source set. Public pricing visibility for Instant Custom Voice and some Azure post-free consumption rates was incomplete in the captured HTML. And benchmark coverage for speech models is moving fast enough that any leaderboard statement should be read as a time-stamped snapshot, not a permanent ranking.
Bottom line: if you meant Google Cloud Chirp 3, it is a high-capability, enterprise-grade speech family with a strong 2025 rollout, very good multilingual breadth, compelling custom-voice mechanics, and solid real-world traction. Its biggest risks are operational rough edges and product overlap with newer Google voice offerings. If you instead meant Deeper CHIRP+ 3, that is a separate hardware fish-finding product and would need a different report.
Sources
- Google Cloud, Chirp 3: Transcription model docs. https://docs.cloud.google.com/speech-to-text/docs/models/chirp-3
- Google Cloud Text-to-Speech release notes. https://docs.cloud.google.com/text-to-speech/docs/release-notes
- Artificial Analysis, streaming speech-to-text benchmark (AA-WER Streaming). https://artificialanalysis.ai/articles/new-streaming-speech-to-text-benchmark-aa-wer-streaming
- Google Cloud, Gemini-TTS documentation. https://docs.cloud.google.com/text-to-speech/docs/gemini-tts
- Deeper Smart Sonar CHIRP+ 3 product page. https://deepersonar.com/en-all/products/deeper-chirp-3
- Google Cloud blog, "Bringing the power of large models to Google Cloud's Speech API" (2023 Chirp launch). https://cloud.google.com/blog/products/ai-machine-learning/bringing-power-large-models-google-clouds-speech-api
- Google Cloud, Chirp 3: HD voices documentation. https://docs.cloud.google.com/text-to-speech/docs/chirp3-hd
- Google Cloud, Chirp 3: Instant Custom Voice documentation. https://docs.cloud.google.com/text-to-speech/docs/chirp3-instant-custom-voice
- Google Cloud, text streaming audio synthesis documentation. https://docs.cloud.google.com/text-to-speech/docs/create-audio-text-streaming
- Google Cloud, supported voices and types list. https://docs.cloud.google.com/text-to-speech/docs/list-voices-and-types
- Google Cloud blog, "Expanding generative media for enterprise on Vertex AI." https://cloud.google.com/blog/products/ai-machine-learning/expanding-generative-media-for-enterprise-on-vertex-ai
- Google Cloud Speech-to-Text release notes. https://docs.cloud.google.com/speech-to-text/docs/release-notes
- Artificial Analysis, text-to-speech selected-voice leaderboard. https://artificialanalysis.ai/text-to-speech/leaderboard/selected-voice
- Google Cloud customer story: Il Foglio. https://cloud.google.com/customers/il-foglio
- Google Cloud customer story: Adya. https://cloud.google.com/customers/adya-ai
- Google Cloud customer story: HBX Group. https://cloud.google.com/customers/hbx-group
- G2 reviews, Google Cloud Speech-to-Text. https://www.g2.com/products/google-cloud-speech-to-text/reviews
- Google Developer forums, "Google Text to Speech only showing Chirp voices." https://discuss.google.dev/t/google-text-to-speech-only-showing-chirp-voices/184456
- Google Developer forums, "Severe latency regression with Chirp 3 HD long audio." https://discuss.google.dev/t/severe-latency-regression-with-chirp-3-hd-long-audio/262049
- Google Developer forums, "Incorrect pronunciation of French contractions with Chirp 3 HD voices." https://discuss.google.dev/t/incorrect-pronunciation-of-french-contractions-with-chirp-3-hd-voices/271804
- Google Cloud Text-to-Speech pricing. https://cloud.google.com/text-to-speech/pricing
- Google Cloud Speech-to-Text pricing. https://cloud.google.com/speech-to-text/pricing
- Microsoft Learn, Azure AI Speech speech-to-text documentation. https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-to-text
- AWS Polly pricing. https://aws.amazon.com/polly/pricing/
- ElevenLabs API pricing. https://elevenlabs.io/pricing/api