Chirp 3: model profile

Chirp 3 is Google Cloud's speech model family, delivered as managed cloud services for speech-to-text (Chirp 3: Transcription), high-fidelity text-to-speech (Chirp 3: HD voices), and voice cloning (Chirp 3: Instant Custom Voice).

Specifications

Developer	Google (Google Cloud)
Released	Phased 2025 rollout. TTS (HD voices) GA April 2, 2025; Instant Custom Voice GA (allowlist) April 9, 2025; STT private preview April 11, 2025, public preview August 29, 2025, GA October 13, 2025.
Model type	Managed cloud speech model family: multilingual automatic speech recognition (model ID chirp_3), high-fidelity TTS, and instant custom voice.
Parameters	Not publicly disclosed. The 2023 predecessor Chirp was described by Google as a 2B-parameter speech model; no equivalent disclosure exists for Chirp 3.
Languages	STT: 29 GA transcription locales, many additional preview locales, 85+ languages/locales in preview coverage, 14 diarization locales. TTS: 30 named voices across 53 supported languages/locales.
Modes (batch / streaming)	STT: StreamingRecognize, Recognize, and BatchRecognize. TTS: streaming and batch output; text streaming is exclusive to Chirp 3 HD voices within Cloud TTS.
Throughput / concurrency	Not publicly disclosed. Instant Custom Voice permits 10 new voice cloning keys per minute per project, with no stated absolute limit on total keys.
Deployment	Google Cloud API service: Cloud Text-to-Speech API, Speech-to-Text V2 API, Vertex AI Studio, and Google's console and notebook ecosystem. No boxed or downloadable product.
Pricing	STT standard recognition $0.016/min up to 500k minutes, lower volume-tier rates beyond, dynamic batch $0.003/min. TTS $30 per 1M characters after a free tier of 1M characters monthly. Instant Custom Voice pricing not fully surfaced publicly.
License	Proprietary managed cloud service; consumed via API, not distributed as software.

Not disclosedTraining data · Latency

Known limitations

Issue or limitation	Evidence	Current status or best-known fix
Instant Custom Voice is not self-serve	Google's docs state that access is restricted to allowlisted users, and forum users continued asking for a simpler access flow in 2026.	Still gated. Operationally, the "fix" is to work through sales/allowlisting.
SSML / markup support was incomplete at launch	Generic docs originally said SSML input was unsupported for Chirp 3 HD, while release notes later added limited SSML support in October 2025.	Partially improved, but only selected tags are supported.
Console UI confusion about missing legacy voices	Google staff clarified that only Chirp voices were intentionally shown in the Cloud Console UI even though other TTS voices still existed through the API.	Workaround: use the API rather than relying on the narrowed UI.
Long-audio performance regression	A September 2025 forum thread reported long-audio jobs stalling or taking much longer than before.	Public forum evidence shows a real complaint, but no official public postmortem or universal fix was captured in the source set.
Locale-specific pronunciation failures	Users reported mispronunciation of French contractions and related markup/pronunciation issues.	Workarounds include phoneme tags and pronunciation controls, where supported.
STT diarization and language coverage are not universal	Chirp 3 STT has broad language coverage, but diarization is listed for only a smaller subset of languages, and the model's regional rollouts occurred gradually.	Fix is usually architectural rather than toggled: choose supported language/region combinations and the right recognition method.

Additional documented gaps:

Google does not publicly expose a Chirp 3 architectural model card comparable to the 2023 Chirp write-up; parameter count and training-corpus detail are not available in the current source set.
Documentation drift exists across pages: the generic "supported voices and languages" page says Chirp 3 HD does not support SSML input, certain pitch/rate controls, and is available only on global, eu, and us, while the dedicated Chirp 3 HD docs list more endpoints and the release notes record SSML support added October 17, 2025. The source treats release notes as the source of truth when the docs disagree.
The built-in STT denoiser can reduce music, rain, and street noise but not background human voices.
Public pricing visibility for Instant Custom Voice was incomplete in the captured HTML.
The STT pricing page references "chirp (Speech-to-Text V2 only)" rather than chirp_3.

Full technical breakdown9 sections

Overview

Google uses the name "Chirp 3" in official documentation for three related cloud services: Chirp 3: HD voices for text-to-speech, Chirp 3: Instant Custom Voice for voice cloning, and Chirp 3: Transcription for speech-to-text. The family launched in phases during 2025 rather than on a single date: Chirp 3: HD voices reached general availability on April 2, 2025, Instant Custom Voice was announced as GA through an allowlist on April 9, 2025, and Chirp 3: Transcription entered private preview on April 11, 2025, public preview on August 29, 2025, and GA on October 13, 2025.

Chirp 3 is a cloud-managed speech stack, not a firmware-based device. There is no public firmware image or consumer software versioning; capability changes are surfaced through release notes, region rollouts, model identifiers, and documentation updates.

The name "Chirp 3" is ambiguous with a hardware product. The source resolves the ambiguity as follows.

Possible match	Why it fits	Why it is less likely here	Source
Google Cloud Chirp 3	Exact product name appears in official Google Cloud docs for STT and TTS. Google uses "Chirp 3" as a named speech model family.	None of significance; this is the best exact-name match.
Deeper Smart Sonar CHIRP+ 3	Prominent commercial hardware product; web search often surfaces it when users mean a device.	Official product name is CHIRP+ 3, not plain "Chirp 3," and its category is castable fish-finder sonar, not software or speech AI.

Capabilities and features

The family comprises three variants.

Variant	Official identity	Model ID or naming	Core function	Current technical snapshot
Chirp 3: HD voices	Google Cloud Text-to-Speech	Voice names such as en-US-Chirp3-HD-Charon	High-fidelity TTS for real-time and batch synthesis	Current dedicated docs list 30 named voices, 53 supported languages/locales, GA endpoints in global, us, eu, asia-southeast1, europe-west2, asia-northeast1, streaming and batch output formats, and text streaming support. Launch GA milestone was 8 speakers / 31 locales on April 2, 2025.
Chirp 3: Instant Custom Voice	Google Cloud Text-to-Speech	Voice cloning key generated per project/request	Fast voice cloning / custom branded or personal voices	Restricted to allowlisted users; supports streaming and batch synthesis, supports LINEAR16, PCM, MP3, M4A input encodings, pace control from 0.25x to 2x, experimental pause tags and custom pronunciations, and multilingual transfer from en-US to six listed locales.
Chirp 3: Transcription	Google Cloud Speech-to-Text V2	chirp_3	Multilingual automatic speech recognition	Available only in Speech-to-Text V2; supports StreamingRecognize, Recognize, and BatchRecognize; documentation lists 29 GA transcription locales, many additional Preview locales, 14 diarization locales, built-in denoiser, language-agnostic transcription, and speech adaptation.

For STT, Chirp 3 supports speaker diarization, automatic punctuation, automatic capitalization, speech adaptation with up to 1,000 phrases, a custom prompt feature in Preview, and a built-in denoiser that can reduce music, rain, and street noise but not background human voices.

For TTS, text streaming is exclusive to Chirp 3 HD voices in Google's Cloud TTS stack. Limited SSML support was added on October 17, 2025, covering the phoneme, p, s, sub, and say-as tags.

For custom voice, Google requires a spoken consent statement, recommends clean 10-second recordings, stores the resulting voice-cloning key client-side, and permits 10 new keys per minute per project with no stated absolute limit on total keys.

Language support

Chirp 3: Transcription documentation lists 29 GA transcription locales, many additional Preview locales, and 14 diarization locales, plus language-agnostic transcription.
At public preview (August 29, 2025), chirp_3 was announced with 85+ languages/locales in preview coverage.
Chirp 3: HD voices documentation lists 30 named voices and 53 supported languages/locales.
Instant Custom Voice added ja-JP on June 18, 2025, pushing support to more than 30 locales, and supports multilingual transfer from en-US to six listed locales.
In November and December 2025, STT preview regions expanded and TTS added a wide set of European languages, then Punjabi and Cantonese in preview.

Performance and benchmarks

Vendor-reported: Google's 2023 launch materials for the original Chirp claimed 98% English recognition accuracy and 300% relative improvement in some low-resource languages. For Chirp 3, Google's launch materials emphasize feature breadth, speed improvements, diarization, language detection, and voice realism, and do not provide a public benchmark sheet of equal granularity.

Third-party evaluation: Artificial Analysis' streaming STT benchmark found that Chirp 3 Streaming led partial-transcript performance on VoxPopuli at 2.2% WER, while noting that no single model led across all tested datasets. Artificial Analysis' selected-voice TTS leaderboard snapshot placed Chirp 3: HD at Elo 1,056 and $30.0 per 1M characters, below Azure HD 2.5 at Elo 1,127 and Eleven v3 at Elo 1,179.

Evidence area	What the evidence says	Interpretation
Independent STT benchmark	Artificial Analysis reported that in its streaming benchmark, Google's Chirp 3 Streaming led partial-transcript performance on VoxPopuli at 2.2% WER, while also noting that no single model leads everywhere.	Chirp 3 looks strong in real-time multilingual settings, but not categorically dominant across all datasets or latency conditions.
Independent TTS benchmark	Artificial Analysis' selected-voice leaderboard snapshot showed Chirp 3: HD at Elo 1,056 and $30.0 / 1M characters, below Azure HD 2.5 at Elo 1,127 and Eleven v3 at Elo 1,179.	Chirp 3 HD is competitive, but the benchmark snapshot does not place it at the very top of TTS naturalness.
Official real-world media use	Il Foglio said Chirp 3 HD offered the most natural Italian intonation among tested options, turned editorials into audio in minutes, and helped the paper reach the top three of its podcast offerings.	Evidence that Chirp 3 performs well in editorial long-form audio, especially when language-specific naturalness matters.
Official enterprise localization use	Adya reported localization across 20+ Indian languages with low latency using Chirp 3.	Suggests practical multilingual deployment, especially in enterprise localization.
Official contact-center use	HBX Group said Chirp 3 voices created a more natural, less robotic caller experience.	Supports Google's positioning in customer-experience voice channels.

User sentiment reported by the source: G2 reviewers of Google Cloud Speech-to-Text praise ease of use, speed, and meeting-transcription productivity; recurring negatives include cost sensitivity and the need for manual correction when accuracy is not perfect. For Google Cloud Text-to-Speech, review summaries emphasize natural voice quality and simple API integration, while some users describe output as robotic in some scenarios or languages and some complain about pricing opacity or cost escalation. Google developer forum threads surfaced UI confusion, long-audio latency regressions, markup/SSML limitations, locale-specific pronunciation bugs, and allowlist/access friction.

The source compares Chirp 3 against alternative platforms as follows.

Platform	STT	TTS	Custom voice / cloning	Real-time / streaming	Diarization / language ID	Public pricing signal	Analytical reading
Google Chirp 3	Yes, via chirp_3 in Speech-to-Text V2	Yes, via Chirp 3 HD	Yes, via Instant Custom Voice from ~10 seconds and consent flow	Yes for STT and TTS; Chirp 3 HD uniquely supports text streaming in Cloud TTS	Yes; diarization, language-agnostic transcription, denoiser, adaptation	TTS $30 / 1M chars; STT standard $0.016/min list tier; dynamic batch $0.003/min	Best fit for Google Cloud-native teams that want one vendor for speech generation and transcription.
Microsoft Azure AI Speech	Yes	Yes, including HD voices	Yes, via Custom Voice and Personal Voice	Yes; docs include real-time diarization quickstart and broad speech workflows	Yes; official docs highlight language detection, custom speech, diarization	Official pricing page shows per-second STT and per-character TTS billing plus a free tier, but exact post-free rates were not reliably visible in the HTML capture	Strong enterprise alternative, especially where Microsoft identity/compliance stack matters.
AWS Polly + Amazon Transcribe	Yes	Yes	TTS customization exists via lexicons and voice families; no instant 10-second clone captured in the source set	Yes; Polly returns audio streams, Transcribe supports streaming	Yes; Transcribe supports diarization and automatic language identification in relevant workflows	Polly Generative $30 / 1M chars; Transcribe Tier 1 $0.03/min in us-east-1 example	Strong for AWS-native teams; TTS price for Polly Generative roughly matches Chirp 3 HD, but STT list price in the cited example is higher.
ElevenLabs	Yes, Scribe v2 / v2 Realtime	Yes	Yes, voice cloning front-and-center	Yes; realtime STT marketed at ~150 ms latency	Yes; diarization, word-level timestamps, multilingual handling	Scribe v2 $0.22/hour; Scribe v2 Realtime $0.39/hour; TTS pricing is model- and plan-dependent	Best fit when pure voice experience and rapid productization matter more than hyperscaler platform consolidation.

Latency and throughput

Specific latency figures for Chirp 3 are not publicly disclosed in the source set. The source records the following latency-related facts:

Text streaming is exclusive to Chirp 3 HD voices in Google's Cloud TTS stack, which the source describes as relevant for low-latency voice agents.
The chirp_3 public preview announcement (August 29, 2025) included improved speed/accuracy messaging.
Adya reported localization across 20+ Indian languages with low latency using Chirp 3.
A September 2025 forum thread reported long-audio jobs stalling or taking much longer than before; no official public postmortem or universal fix was captured in the source set.
Instant Custom Voice permits 10 new voice cloning keys per minute per project with no stated absolute limit on total keys.

Deployment and integrations

Chirp 3 is sold as a Google Cloud API service, not a boxed consumer product. Consumption happens through the Cloud TTS API, Speech-to-Text V2 API, Vertex AI Studio, and Google's console and notebook ecosystem.

Region support differs by sub-product: Chirp 3 HD lists six GA endpoints (global, us, eu, asia-southeast1, europe-west2, asia-northeast1), while Chirp 3 Transcription documents GA in us and eu multi-regions with release-note preview expansions into additional regions in late 2025. Instant Custom Voice lists region availability beyond the initial TTS endpoints.

Chirp 3: Transcription is available only in Speech-to-Text V2 and supports the StreamingRecognize, Recognize, and BatchRecognize methods.

Official support paths include reference documentation, release notes with RSS support, Vertex AI Studio and console entry points, Colab and GitHub notebooks, community forums, Cloud support, system status, and sales-led access for allowlisted features such as Instant Custom Voice.

Google describes Gemini-TTS as the latest evolution of Cloud TTS, with broader prompt-based control and native multi-speaker options while reusing voice identities similar to Chirp 3 HD.

Pricing

Chirp 3 commercial surface	Publicly visible pricing	Availability notes
Chirp 3: HD voices	$30 per 1M characters after the free tier; 1M characters free monthly on the pricing page.	GA; available via Cloud TTS and Vertex AI Studio.
Chirp 3: Instant Custom Voice	Public pricing was not fully surfaced in the captured pricing excerpt; the feature is allowlisted.	Restricted access; requires sales/allowlist.
Chirp 3: Transcription	Google's Speech-to-Text V2 page lists standard recognition at $0.016/min up to 500k min, then lower volume-tier rates, and dynamic batch at $0.003/min; the public pricing page still labels the included V2 speech model family as "chirp" rather than explicitly chirp_3.	Available only in Speech-to-Text V2; GA and preview regions differ.

The source notes that Google's public pricing nomenclature lags the model nomenclature: the STT pricing page references "chirp (Speech-to-Text V2 only)" while the current model docs and release notes use chirp_3. The source's reading is that Chirp-family Speech-to-Text V2 pricing applies, but the public pricing page is not as current or precise as the model documentation.

Development and ownership

Chirp 3 is developed and operated by Google as part of Google Cloud. The Chirp family predates Chirp 3: Google introduced Chirp as a speech foundation model in 2023 and described that generation as a 2B-parameter speech model delivering 98% English speech recognition accuracy and large relative gains in some low-resource languages. Current Chirp 3 materials emphasize product capabilities, rollout stages, and API behavior rather than architectural disclosures such as parameter count.

Release history

Date	Milestone	What changed
Feb 10, 2025	Pre-launch rename	Journey voices were rebranded as Chirp HD voices.
Mar 6, 2025	TTS rollout expansion	Chirp 3 HD added 8 speakers in 31 locales.
Apr 2, 2025	TTS GA	Chirp 3 HD voices became GA with 8 speakers / 31 locales, real-time streaming, batch support, and supported regional endpoints.
Apr 9, 2025	Instant Custom Voice GA announcement	Google announced Instant Custom Voice as GA through an allowlist and also announced transcription with diarization in preview/allowlist.
Apr 11, 2025	STT Private Preview	chirp_3 launched in private preview for Speech-to-Text V2.
May 7, 2025	TTS controls expansion	Pace control, pause control, and custom pronunciations were released for Chirp 3 HD voices.
Jun 18, 2025	ICV locale expansion	Instant Custom Voice added ja-JP, pushing support to more than 30 locales.
Aug 21 to 27, 2025	ICV and TTS endpoint upgrades	Instant Custom Voice added PCM, MP3, and M4A input encodings; Chirp 3 HD became available on europe-west2.
Aug 29, 2025	STT Public Preview	chirp_3 public preview launched with 85+ languages/locales in preview coverage and improved speed/accuracy messaging.
Sep 15, 2025	TTS endpoint expansion	Chirp 3 HD became available on asia-northeast1.
Oct 13, 2025	STT GA	Chirp 3: Transcription reached GA in Speech-to-Text V2.
Oct 17, 2025	Limited SSML support	Chirp 3 HD added support for phoneme, p, s, sub, and say-as tags.
Nov to Dec 2025	Regional and language expansion	STT preview regions expanded; TTS added a wide set of European languages, then Punjabi and Cantonese in preview.

Sources

All cited sources were accessed on June 15, 2026.