Chirp 3: inside Google Cloud's 2025 speech stack, from HD voices to transcription

Ask around about "Chirp 3" and you get two very different answers. One is a castable fish-finder. The other is Google Cloud's current speech model family, and that second one is what this report covers. Google uses the exact name in official documentation for three related cloud services: Chirp 3: HD voices for text-to-speech, Chirp 3: Instant Custom Voice for voice cloning, and Chirp 3: Transcription for speech-to-text. The hardware near-match, the Deeper Smart Sonar CHIRP+ 3, is officially branded with a plus sign and lives in fishing and sonar contexts, not speech AI. The exact naming, the breadth of primary documentation, and the current ecosystem evidence all point to the Google Cloud interpretation as the strongest fit.

The family launched in phases rather than on a single date. Chirp 3: HD voices reached GA on April 2, 2025 after March 2025 rollout changes. Instant Custom Voice was announced as GA through an allowlist on April 9, 2025. On the recognition side, Chirp 3: Transcription entered Private Preview on April 11, 2025, Public Preview on August 29, 2025, and GA on October 13, 2025.

One framing matters before anything else: Chirp 3 is a cloud-managed speech stack, not a firmware-based device. There is no public firmware image and no consumer software versioning in the usual device sense. Capability changes show up through release notes, region rollouts, model identifiers, and documentation updates. Its strongest differentiators are multilingual speech recognition, native diarization and built-in denoising for STT, streaming HD TTS, and fast custom voice creation from short audio samples. Its main frictions are allowlist gating for custom voice, documentation drift across pages, regional and data-residency caveats, and several user-reported quality or latency issues in specific locales and workflows.

Third-party evidence puts Chirp 3 in the competitive tier, not the dominant one. Artificial Analysis' streaming STT benchmark found that Chirp 3 Streaming led partial-transcript performance on VoxPopuli at 2.2% WER, while noting that no single model led across all tested datasets. In TTS, the same firm's selected-voice leaderboard placed Chirp 3: HD at Elo 1,056 and $30 per 1M characters, below Azure HD 2.5 and Eleven v3 on its naturalness ranking snapshot. Google's own customer case studies show real adoption, including Il Foglio choosing Chirp 3 for natural Italian editorial audio and HBX Group using it for more natural voice-channel customer experiences.

There is also a strategic wrinkle. Chirp 3 is a serious enterprise speech stack, especially for teams already on Google Cloud, but it is no longer the endpoint of Google's voice roadmap. Google now describes Gemini-TTS as the latest evolution of Cloud TTS, with broader prompt-based control and native multi-speaker options while reusing voice identities similar to Chirp 3 HD. For greenfield TTS projects the useful question is no longer "Chirp 3 or not?" but "Chirp 3 vs. Gemini-TTS vs. external speech suites."

Which Chirp 3 do you mean?

The name is ambiguous enough that a careful report should not assume a single product without checking. The two strongest matches:

Possible match	Why it fits	Why it is less likely here	Source
Google Cloud Chirp 3	Exact product name appears in official Google Cloud docs for STT and TTS. Google uses "Chirp 3" as a named speech model family.	None of significance; this is the best exact-name match.
Deeper Smart Sonar CHIRP+ 3	Prominent commercial hardware product; web search often surfaces it when users mean a device.	Official product name is CHIRP+ 3, not plain "Chirp 3," and its category is castable fish-finder sonar, not software or speech AI.

The Google Cloud reading wins for three reasons: it matches the broad framing of a product that could be a device, software, or something else; Google uses the exact term in official documentation; and the dimensions worth researching (release notes, model variants, support channels, pricing, benchmarks, update history) fit a cloud speech platform unusually well. The Deeper product stays on the table as the most important alternate interpretation if a hardware device was actually meant.

What the family actually contains

The Chirp lineage predates Chirp 3. Google introduced Chirp as a speech foundation model in 2023 and described that earlier generation as a 2B-parameter speech model delivering 98% English speech recognition accuracy and large relative gains in some low-resource languages. Current Chirp 3 materials read differently: they emphasize product capabilities, rollout stages, and API behavior rather than architectural disclosures such as parameter count. That shift matters. The Chirp 3 product story is practical and API-centric, not model-card transparent.

Variant	Official identity	Model ID or naming	Core function	Current technical snapshot
Chirp 3: HD voices	Google Cloud Text-to-Speech	Voice names such as en-US-Chirp3-HD-Charon	High-fidelity TTS for real-time and batch synthesis	Current dedicated docs list 30 named voices, 53 supported languages/locales, GA endpoints in global, us, eu, asia-southeast1, europe-west2, and asia-northeast1, streaming and batch output formats, and text streaming support. Launch GA milestone was 8 speakers / 31 locales on April 2, 2025.
Chirp 3: Instant Custom Voice	Google Cloud Text-to-Speech	Voice cloning key generated per project/request	Fast voice cloning for custom branded or personal voices	Restricted to allowlisted users; supports streaming and batch synthesis, supports LINEAR16, PCM, MP3, and M4A input encodings, pace control from 0.25x to 2x, experimental pause tags and custom pronunciations, and multilingual transfer from en-US to six listed locales.
Chirp 3: Transcription	Google Cloud Speech-to-Text V2	chirp_3	Multilingual automatic speech recognition	Available only in Speech-to-Text V2; supports StreamingRecognize, Recognize, and BatchRecognize; documentation lists 29 GA transcription locales, many additional Preview locales, 14 diarization locales, built-in denoiser, language-agnostic transcription, and speech adaptation.

A few technical details carry outsized weight in practice. For TTS, text streaming is exclusive to Chirp 3 HD voices in Google's Cloud TTS stack, which matters if you are building low-latency voice agents. For custom voice, Google requires a spoken consent statement, recommends clean 10-second recordings, stores the resulting voice-cloning key client-side, and permits 10 new keys per minute per project with no stated absolute limit on total keys. For STT, Chirp 3 supports speaker diarization, automatic punctuation, automatic capitalization, speech adaptation with up to 1,000 phrases, a custom prompt feature in Preview, and a built-in denoiser that can reduce music, rain, and street noise but not background human voices.

Then there is documentation drift, which is subtle but bites integrators. Google's generic "supported voices and languages" page still says Chirp 3 HD doesn't support SSML input or certain pitch and rate controls, and that it is available only on global, eu, and us. The later dedicated Chirp 3 HD docs list more endpoints, and the release notes say limited SSML support was added on October 17, 2025. The dedicated product page and release notes are more current than the generic voice-list page, so when the docs disagree, treat the release notes as the source of truth.

Timeline of Chirp 3 feature rollouts rendered as spaced amber signal pulses along a horizontal path

Release history and official support

Because Chirp 3 is a managed cloud service, its "version history" is really a mix of model identifiers, feature rollouts, region rollouts, and service release notes. There is no public firmware track and no downloadable desktop or mobile package attached to the product itself.

The official history, compressed into one view:

Date	Milestone	What changed
Feb 10, 2025	Pre-launch rename	Journey voices were rebranded as Chirp HD voices.
Mar 6, 2025	TTS rollout expansion	Chirp 3 HD added 8 speakers in 31 locales.
Apr 2, 2025	TTS GA	Chirp 3 HD voices became GA with 8 speakers / 31 locales, real-time streaming, batch support, and supported regional endpoints.
Apr 9, 2025	Instant Custom Voice GA announcement	Google announced Instant Custom Voice as GA through an allowlist and also announced transcription with diarization in preview/allowlist.
Apr 11, 2025	STT Private Preview	chirp_3 launched in private preview for Speech-to-Text V2.
May 7, 2025	TTS controls expansion	Pace control, pause control, and custom pronunciations were released for Chirp 3 HD voices.
Jun 18, 2025	ICV locale expansion	Instant Custom Voice added ja-JP, pushing support to more than 30 locales.
Aug 21-27, 2025	ICV and TTS endpoint upgrades	Instant Custom Voice added PCM, MP3, and M4A input encodings; Chirp 3 HD became available on europe-west2.
Aug 29, 2025	STT Public Preview	chirp_3 public preview launched with 85+ languages/locales in preview coverage and improved speed and accuracy messaging.
Sep 15, 2025	TTS endpoint expansion	Chirp 3 HD became available on asia-northeast1.
Oct 13, 2025	STT GA	Chirp 3: Transcription reached GA in Speech-to-Text V2.
Oct 17, 2025	Limited SSML support	Chirp 3 HD added support for <phoneme>, <p>, <s>, <sub>, and <say-as>.
Nov-Dec 2025	Regional and language expansion	STT preview regions expanded; TTS added a wide set of European languages, then Punjabi and Cantonese in preview.

Official support paths are unusually strong for an API product. Google provides reference documentation, release notes with RSS support, Vertex AI Studio and console entry points, Colab and GitHub notebooks, community forums, Cloud support, system status, and sales-led access for allowlisted features like Instant Custom Voice. The support footprint is enterprise grade, but some access paths, especially for custom voice, still feel sales-driven rather than self-serve.

Benchmarks, adoption, and what users actually say

Google's public performance storytelling for Chirp 3 is noticeably lighter than it was for the original 2023 Chirp launch. Back then, Google published concrete claims: 98% English recognition accuracy and 300% relative improvement in some low-resource languages. Current Chirp 3 launch materials emphasize feature breadth, speed improvements, diarization, language detection, and voice realism, but there is no public benchmark sheet of equal granularity. That does not mean Chirp 3 is weak. It means the public evidence base leans on product marketing and case studies rather than benchmark transparency.

Evidence area	What the evidence says	Interpretation
Independent STT benchmark	Artificial Analysis reported that in its streaming benchmark, Google's Chirp 3 Streaming led partial-transcript performance on VoxPopuli at 2.2% WER, while also noting that no single model leads everywhere.	Chirp 3 looks strong in real-time multilingual settings, but not categorically dominant across all datasets or latency conditions.
Independent TTS benchmark	Artificial Analysis' selected-voice leaderboard snapshot showed Chirp 3: HD at Elo 1,056 and $30.0 / 1M characters, below Azure HD 2.5 at Elo 1,127 and Eleven v3 at Elo 1,179.	Chirp 3 HD is competitive, but the benchmark snapshot does not place it at the very top of TTS naturalness.
Official real-world media use	Il Foglio said Chirp 3 HD offered the most natural Italian intonation among tested options, turned editorials into audio in minutes, and helped the paper reach the top three of its podcast offerings.	Strong evidence that Chirp 3 performs well in editorial long-form audio, especially when language-specific naturalness matters.
Official enterprise localization use	Adya reported localization across 20+ Indian languages with low latency using Chirp 3.	Suggests practical multilingual deployment, especially in enterprise localization.
Official contact-center use	HBX Group said Chirp 3 voices created a more natural, less robotic caller experience.	Supports Google's positioning in customer-experience voice channels.

User sentiment is directionally positive without being glowing. On G2, users reviewing Google Cloud Speech-to-Text consistently praise ease of use, speed, and meeting-transcription productivity, while recurring negatives include cost sensitivity and the need for manual correction when accuracy is not perfect. For Google Cloud Text-to-Speech, review summaries emphasize natural voice quality and simple API integration, but some users still describe the output as robotic in certain scenarios or languages, and some reviewers complain about pricing opacity or cost escalation.

The more technical sentiment in Google's own forums is mixed. Early enthusiasm centered on new voice quality, then posts started surfacing UI confusion, long-audio latency regressions, markup and SSML limitations, locale-specific pronunciation bugs, and allowlist access friction. That pattern is a familiar enterprise-AI trajectory: high capability, uneven operational maturity.

Abstract illustration of a branching signal path where some routes pass cleanly and others hit geometric obstructions, on a slate-teal field

Known issues and their current status

Chirp 3's most common issues are not fatal flaws so much as integration and maturity problems. Some were clearly fixed in later releases, some remain workflow constraints, and some are documentation or transparency issues rather than core model failures.

Issue or limitation	Evidence	Current status or best-known fix
Instant Custom Voice is not self-serve	Google's docs state that access is restricted to allowlisted users, and forum users continued asking for a simpler access flow in 2026.	Still gated. Operationally, the "fix" is to work through sales/allowlisting.
SSML / markup support was incomplete at launch	Generic docs originally said SSML input was unsupported for Chirp 3 HD, while release notes later added limited SSML support in October 2025.	Partially improved, but only selected tags are supported.
Console UI confusion about missing legacy voices	Google staff clarified that only Chirp voices were intentionally shown in the Cloud Console UI even though other TTS voices still existed through the API.	Workaround: use the API rather than relying on the narrowed UI.
Long-audio performance regression	A September 2025 forum thread reported long-audio jobs stalling or taking much longer than before.	Public forum evidence shows a real complaint, but no official public postmortem or universal fix was captured in the source set.
Locale-specific pronunciation failures	Users reported mispronunciation of French contractions and related markup/pronunciation issues.	Workarounds include phoneme tags and pronunciation controls, where supported.
STT diarization and language coverage are not universal	Chirp 3 STT has broad language coverage, but diarization is listed for only a smaller subset of languages, and the model's regional rollouts occurred gradually.	Fix is usually architectural rather than toggled: choose supported language/region combinations and the right recognition method.

Pricing and availability

Chirp 3 is sold as a Google Cloud API service, not a boxed product, so there are no retailers in the traditional sense. Consumption happens through the Cloud TTS API, the Speech-to-Text V2 API, Vertex AI Studio, and Google's console and notebook ecosystem. Region support differs by sub-product: Chirp 3 HD currently lists six GA endpoints, while Chirp 3 Transcription documents GA in us and eu multi-regions plus release-note preview expansions into additional regions in late 2025. Instant Custom Voice also lists region availability beyond the initial TTS endpoints.

Chirp 3 commercial surface	Publicly visible pricing	Availability notes
Chirp 3: HD voices	$30 per 1M characters after the free tier; 1M characters free monthly on the pricing page.	GA; available via Cloud TTS and Vertex AI Studio.
Chirp 3: Instant Custom Voice	Public pricing was not fully surfaced in the captured pricing excerpt; the feature is allowlisted.	Restricted access; requires sales/allowlist.
Chirp 3: Transcription	Google's Speech-to-Text V2 page lists standard recognition at $0.016/min up to 500k min, then lower volume-tier rates, and dynamic batch at $0.003/min; the public pricing page still labels the included V2 speech model family as "chirp" rather than explicitly chirp_3.	Available only in Speech-to-Text V2; GA and preview regions differ.

One caveat worth flagging: Google's public pricing nomenclature lags the model nomenclature. The STT pricing page references "chirp (Speech-to-Text V2 only)" while the current model docs and release notes use chirp_3. The most reasonable reading is that Chirp-family Speech-to-Text V2 pricing applies, but the public pricing page is not as current or precise as the model documentation.

Four distinct abstract waveform signatures arranged side by side for comparison, each with a different geometric texture, in amber and sage on deep slate

How it stacks up against the field

Chirp 3 competes less as a single model than as a speech suite. The most relevant alternatives are Microsoft Azure AI Speech, AWS Polly plus Amazon Transcribe, and ElevenLabs. Google's strongest competitive arguments are integrated STT and TTS within Google Cloud, good multilingual breadth, fast custom voice creation, and solid enterprise tooling. Its weakest points are operational complexity, allowlist friction, documentation inconsistency, and the fact that its own next-generation TTS roadmap now points toward Gemini-TTS for the most advanced controllable voice work.

Platform	STT	TTS	Custom voice / cloning	Real-time / streaming	Diarization / language ID	Public pricing signal	Analytical reading
Google Chirp 3	Yes, via chirp_3 in Speech-to-Text V2	Yes, via Chirp 3 HD	Yes, via Instant Custom Voice from ~10 seconds and consent flow	Yes for STT and TTS; Chirp 3 HD uniquely supports text streaming in Cloud TTS	Yes; diarization, language-agnostic transcription, denoiser, adaptation	TTS $30 / 1M chars; STT standard $0.016/min list tier; dynamic batch $0.003/min	Best fit for Google Cloud-native teams that want one vendor for speech generation and transcription.
Microsoft Azure AI Speech	Yes	Yes, including HD voices	Yes, via Custom Voice and Personal Voice	Yes; docs include real-time diarization quickstart and broad speech workflows	Yes; official docs highlight language detection, custom speech, diarization	Official pricing page shows per-second STT and per-character TTS billing plus a free tier, but exact post-free rates were not reliably visible in the HTML capture	Strong enterprise alternative, especially where Microsoft identity/compliance stack matters.
AWS Polly + Amazon Transcribe	Yes	Yes	TTS customization exists via lexicons and voice families; no instant 10-second clone captured in the source set	Yes; Polly returns audio streams, Transcribe supports streaming	Yes; Transcribe supports diarization and automatic language identification in relevant workflows	Polly Generative $30 / 1M chars; Transcribe Tier 1 $0.03/min in us-east-1 example	Strong for AWS-native teams; TTS price for Polly Generative roughly matches Chirp 3 HD, but STT list price in the cited example is higher.
ElevenLabs	Yes, Scribe v2 / v2 Realtime	Yes	Yes, voice cloning front and center	Yes; realtime STT marketed at ~150 ms latency	Yes; diarization, word-level timestamps, multilingual handling	Scribe v2 $0.22/hour; Scribe v2 Realtime $0.39/hour; TTS pricing is model- and plan-dependent	Best fit when pure voice experience and rapid productization matter more than hyperscaler platform consolidation.

Two competitive findings deserve emphasis. In TTS quality, the benchmark snapshot used here does not place Chirp 3 HD at the top of the field: Artificial Analysis showed it trailing both Azure HD 2.5 and Eleven v3 on the selected-voice naturalness leaderboard. And Google itself is nudging advanced TTS users toward Gemini-TTS, which offers prompt-based control, multi-speaker generation, and similar voice options to Chirp 3 HD. Chirp 3 is still important, but its nearest existential competitor may be Google's own next API generation rather than another vendor.

The practical decision rule looks like this. If the goal is high-volume multilingual transcription inside Google Cloud, Chirp 3 remains compelling. If the goal is maximum expressive TTS control or multi-speaker dialogue synthesis, the analysis shifts: ElevenLabs and Gemini-TTS often look stronger at the frontier, while Azure AI Speech remains the broadest enterprise speech-suite peer to compare against.

Sourcing notes and open questions

All cited sources were accessed on June 15, 2026.

The most important primary sources were Google Cloud's Chirp 3 HD voices, Instant Custom Voice, and Chirp 3 Transcription documentation, the Text-to-Speech and Speech-to-Text release notes, the pricing pages, and the Gemini-TTS documentation. The strongest secondary and independent sources were Artificial Analysis for benchmark context, Google Cloud customer case studies (Il Foglio, Adya, HBX Group) for real-world usage, G2 for end-user sentiment, and Google Developer forums for operational issues and bug reports. The ambiguity check relied on official Google documentation and Deeper's CHIRP+ 3 materials.

Some gaps remain. Google does not publicly expose a Chirp 3 architectural model card comparable to the 2023 Chirp write-up, so parameter count and training-corpus detail for Chirp 3 are not available in the current source set. Public pricing visibility for Instant Custom Voice and some Azure post-free consumption rates was incomplete in the captured HTML. And benchmark coverage for speech models is moving fast enough that any leaderboard statement should be read as a time-stamped snapshot, not a permanent ranking.

Bottom line: if you meant Google Cloud Chirp 3, it is a high-capability, enterprise-grade speech family with a strong 2025 rollout, very good multilingual breadth, compelling custom-voice mechanics, and solid real-world traction. Its biggest risks are operational rough edges and product overlap with newer Google voice offerings. If you instead meant Deeper CHIRP+ 3, that is a separate hardware fish-finding product and would need a different report.

Sources

Google Cloud, Chirp 3: Transcription model docs. https://docs.cloud.google.com/speech-to-text/docs/models/chirp-3
Google Cloud Text-to-Speech release notes. https://docs.cloud.google.com/text-to-speech/docs/release-notes
Artificial Analysis, streaming speech-to-text benchmark (AA-WER Streaming). https://artificialanalysis.ai/articles/new-streaming-speech-to-text-benchmark-aa-wer-streaming
Google Cloud, Gemini-TTS documentation. https://docs.cloud.google.com/text-to-speech/docs/gemini-tts
Deeper Smart Sonar CHIRP+ 3 product page. https://deepersonar.com/en-all/products/deeper-chirp-3
Google Cloud blog, "Bringing the power of large models to Google Cloud's Speech API" (2023 Chirp launch). https://cloud.google.com/blog/products/ai-machine-learning/bringing-power-large-models-google-clouds-speech-api
Google Cloud, Chirp 3: HD voices documentation. https://docs.cloud.google.com/text-to-speech/docs/chirp3-hd
Google Cloud, Chirp 3: Instant Custom Voice documentation. https://docs.cloud.google.com/text-to-speech/docs/chirp3-instant-custom-voice
Google Cloud, text streaming audio synthesis documentation. https://docs.cloud.google.com/text-to-speech/docs/create-audio-text-streaming
Google Cloud, supported voices and types list. https://docs.cloud.google.com/text-to-speech/docs/list-voices-and-types
Google Cloud blog, "Expanding generative media for enterprise on Vertex AI." https://cloud.google.com/blog/products/ai-machine-learning/expanding-generative-media-for-enterprise-on-vertex-ai
Google Cloud Speech-to-Text release notes. https://docs.cloud.google.com/speech-to-text/docs/release-notes
Artificial Analysis, text-to-speech selected-voice leaderboard. https://artificialanalysis.ai/text-to-speech/leaderboard/selected-voice
Google Cloud customer story: Il Foglio. https://cloud.google.com/customers/il-foglio
Google Cloud customer story: Adya. https://cloud.google.com/customers/adya-ai
Google Cloud customer story: HBX Group. https://cloud.google.com/customers/hbx-group
G2 reviews, Google Cloud Speech-to-Text. https://www.g2.com/products/google-cloud-speech-to-text/reviews
Google Developer forums, "Google Text to Speech only showing Chirp voices." https://discuss.google.dev/t/google-text-to-speech-only-showing-chirp-voices/184456
Google Developer forums, "Severe latency regression with Chirp 3 HD long audio." https://discuss.google.dev/t/severe-latency-regression-with-chirp-3-hd-long-audio/262049
Google Developer forums, "Incorrect pronunciation of French contractions with Chirp 3 HD voices." https://discuss.google.dev/t/incorrect-pronunciation-of-french-contractions-with-chirp-3-hd-voices/271804
Google Cloud Text-to-Speech pricing. https://cloud.google.com/text-to-speech/pricing
Google Cloud Speech-to-Text pricing. https://cloud.google.com/speech-to-text/pricing
Microsoft Learn, Azure AI Speech speech-to-text documentation. https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-to-text
AWS Polly pricing. https://aws.amazon.com/polly/pricing/
ElevenLabs API pricing. https://elevenlabs.io/pricing/api