OpenTranscription/ Blog
2026-07-03 · MODEL PROFILE

Chirp 3: model profile

Reference profile of Google Cloud Chirp 3, a managed speech model family covering multilingual transcription, HD text-to-speech, and instant custom voice.

GoogleCLOUD
Model profile Google (Google Cloud)

Chirp 3 is Google Cloud's speech model family, delivered as managed cloud services for speech-to-text (Chirp 3: Transcription), high-fidelity text-to-speech (Chirp 3: HD voices), and voice cloning (Chirp 3: Instant Custom Voice).

Specifications

DeveloperGoogle (Google Cloud)
ReleasedPhased 2025 rollout. TTS (HD voices) GA April 2, 2025; Instant Custom Voice GA (allowlist) April 9, 2025; STT private preview April 11, 2025, public preview August 29, 2025, GA October 13, 2025.
Model typeManaged cloud speech model family: multilingual automatic speech recognition (model ID chirp_3), high-fidelity TTS, and instant custom voice.
ParametersNot publicly disclosed. The 2023 predecessor Chirp was described by Google as a 2B-parameter speech model; no equivalent disclosure exists for Chirp 3.
LanguagesSTT: 29 GA transcription locales, many additional preview locales, 85+ languages/locales in preview coverage, 14 diarization locales. TTS: 30 named voices across 53 supported languages/locales.
Modes (batch / streaming)STT: StreamingRecognize, Recognize, and BatchRecognize. TTS: streaming and batch output; text streaming is exclusive to Chirp 3 HD voices within Cloud TTS.
Throughput / concurrencyNot publicly disclosed. Instant Custom Voice permits 10 new voice cloning keys per minute per project, with no stated absolute limit on total keys.
DeploymentGoogle Cloud API service: Cloud Text-to-Speech API, Speech-to-Text V2 API, Vertex AI Studio, and Google's console and notebook ecosystem. No boxed or downloadable product.
PricingSTT standard recognition $0.016/min up to 500k minutes, lower volume-tier rates beyond, dynamic batch $0.003/min. TTS $30 per 1M characters after a free tier of 1M characters monthly. Instant Custom Voice pricing not fully surfaced publicly.
LicenseProprietary managed cloud service; consumed via API, not distributed as software.

Not disclosedTraining data · Latency

Full technical breakdown9 sections

Overview

Google uses the name "Chirp 3" in official documentation for three related cloud services: Chirp 3: HD voices for text-to-speech, Chirp 3: Instant Custom Voice for voice cloning, and Chirp 3: Transcription for speech-to-text. The family launched in phases during 2025 rather than on a single date: Chirp 3: HD voices reached general availability on April 2, 2025, Instant Custom Voice was announced as GA through an allowlist on April 9, 2025, and Chirp 3: Transcription entered private preview on April 11, 2025, public preview on August 29, 2025, and GA on October 13, 2025.

Chirp 3 is a cloud-managed speech stack, not a firmware-based device. There is no public firmware image or consumer software versioning; capability changes are surfaced through release notes, region rollouts, model identifiers, and documentation updates.

The name "Chirp 3" is ambiguous with a hardware product. The source resolves the ambiguity as follows.

Possible match Why it fits Why it is less likely here Source
Google Cloud Chirp 3 Exact product name appears in official Google Cloud docs for STT and TTS. Google uses "Chirp 3" as a named speech model family. None of significance; this is the best exact-name match.
Deeper Smart Sonar CHIRP+ 3 Prominent commercial hardware product; web search often surfaces it when users mean a device. Official product name is CHIRP+ 3, not plain "Chirp 3," and its category is castable fish-finder sonar, not software or speech AI.

Capabilities and features

The family comprises three variants.

Variant Official identity Model ID or naming Core function Current technical snapshot Source
Chirp 3: HD voices Google Cloud Text-to-Speech Voice names such as en-US-Chirp3-HD-Charon High-fidelity TTS for real-time and batch synthesis Current dedicated docs list 30 named voices, 53 supported languages/locales, GA endpoints in global, us, eu, asia-southeast1, europe-west2, asia-northeast1, streaming and batch output formats, and text streaming support. Launch GA milestone was 8 speakers / 31 locales on April 2, 2025.
Chirp 3: Instant Custom Voice Google Cloud Text-to-Speech Voice cloning key generated per project/request Fast voice cloning / custom branded or personal voices Restricted to allowlisted users; supports streaming and batch synthesis, supports LINEAR16, PCM, MP3, M4A input encodings, pace control from 0.25x to 2x, experimental pause tags and custom pronunciations, and multilingual transfer from en-US to six listed locales.
Chirp 3: Transcription Google Cloud Speech-to-Text V2 chirp_3 Multilingual automatic speech recognition Available only in Speech-to-Text V2; supports StreamingRecognize, Recognize, and BatchRecognize; documentation lists 29 GA transcription locales, many additional Preview locales, 14 diarization locales, built-in denoiser, language-agnostic transcription, and speech adaptation.

For STT, Chirp 3 supports speaker diarization, automatic punctuation, automatic capitalization, speech adaptation with up to 1,000 phrases, a custom prompt feature in Preview, and a built-in denoiser that can reduce music, rain, and street noise but not background human voices.

For TTS, text streaming is exclusive to Chirp 3 HD voices in Google's Cloud TTS stack. Limited SSML support was added on October 17, 2025, covering the phoneme, p, s, sub, and say-as tags.

For custom voice, Google requires a spoken consent statement, recommends clean 10-second recordings, stores the resulting voice-cloning key client-side, and permits 10 new keys per minute per project with no stated absolute limit on total keys.

Language support

  • Chirp 3: Transcription documentation lists 29 GA transcription locales, many additional Preview locales, and 14 diarization locales, plus language-agnostic transcription.
  • At public preview (August 29, 2025), chirp_3 was announced with 85+ languages/locales in preview coverage.
  • Chirp 3: HD voices documentation lists 30 named voices and 53 supported languages/locales.
  • Instant Custom Voice added ja-JP on June 18, 2025, pushing support to more than 30 locales, and supports multilingual transfer from en-US to six listed locales.
  • In November and December 2025, STT preview regions expanded and TTS added a wide set of European languages, then Punjabi and Cantonese in preview.

Performance and benchmarks

Vendor-reported: Google's 2023 launch materials for the original Chirp claimed 98% English recognition accuracy and 300% relative improvement in some low-resource languages. For Chirp 3, Google's launch materials emphasize feature breadth, speed improvements, diarization, language detection, and voice realism, and do not provide a public benchmark sheet of equal granularity.

Third-party evaluation: Artificial Analysis' streaming STT benchmark found that Chirp 3 Streaming led partial-transcript performance on VoxPopuli at 2.2% WER, while noting that no single model led across all tested datasets. Artificial Analysis' selected-voice TTS leaderboard snapshot placed Chirp 3: HD at Elo 1,056 and $30.0 per 1M characters, below Azure HD 2.5 at Elo 1,127 and Eleven v3 at Elo 1,179.

Evidence area What the evidence says Interpretation
Independent STT benchmark Artificial Analysis reported that in its streaming benchmark, Google's Chirp 3 Streaming led partial-transcript performance on VoxPopuli at 2.2% WER, while also noting that no single model leads everywhere. Chirp 3 looks strong in real-time multilingual settings, but not categorically dominant across all datasets or latency conditions.
Independent TTS benchmark Artificial Analysis' selected-voice leaderboard snapshot showed Chirp 3: HD at Elo 1,056 and $30.0 / 1M characters, below Azure HD 2.5 at Elo 1,127 and Eleven v3 at Elo 1,179. Chirp 3 HD is competitive, but the benchmark snapshot does not place it at the very top of TTS naturalness.
Official real-world media use Il Foglio said Chirp 3 HD offered the most natural Italian intonation among tested options, turned editorials into audio in minutes, and helped the paper reach the top three of its podcast offerings. Evidence that Chirp 3 performs well in editorial long-form audio, especially when language-specific naturalness matters.
Official enterprise localization use Adya reported localization across 20+ Indian languages with low latency using Chirp 3. Suggests practical multilingual deployment, especially in enterprise localization.
Official contact-center use HBX Group said Chirp 3 voices created a more natural, less robotic caller experience. Supports Google's positioning in customer-experience voice channels.

User sentiment reported by the source: G2 reviewers of Google Cloud Speech-to-Text praise ease of use, speed, and meeting-transcription productivity; recurring negatives include cost sensitivity and the need for manual correction when accuracy is not perfect. For Google Cloud Text-to-Speech, review summaries emphasize natural voice quality and simple API integration, while some users describe output as robotic in some scenarios or languages and some complain about pricing opacity or cost escalation. Google developer forum threads surfaced UI confusion, long-audio latency regressions, markup/SSML limitations, locale-specific pronunciation bugs, and allowlist/access friction.

The source compares Chirp 3 against alternative platforms as follows.

Platform STT TTS Custom voice / cloning Real-time / streaming Diarization / language ID Public pricing signal Analytical reading
Google Chirp 3 Yes, via chirp_3 in Speech-to-Text V2 Yes, via Chirp 3 HD Yes, via Instant Custom Voice from ~10 seconds and consent flow Yes for STT and TTS; Chirp 3 HD uniquely supports text streaming in Cloud TTS Yes; diarization, language-agnostic transcription, denoiser, adaptation TTS $30 / 1M chars; STT standard $0.016/min list tier; dynamic batch $0.003/min Best fit for Google Cloud-native teams that want one vendor for speech generation and transcription.
Microsoft Azure AI Speech Yes Yes, including HD voices Yes, via Custom Voice and Personal Voice Yes; docs include real-time diarization quickstart and broad speech workflows Yes; official docs highlight language detection, custom speech, diarization Official pricing page shows per-second STT and per-character TTS billing plus a free tier, but exact post-free rates were not reliably visible in the HTML capture Strong enterprise alternative, especially where Microsoft identity/compliance stack matters.
AWS Polly + Amazon Transcribe Yes Yes TTS customization exists via lexicons and voice families; no instant 10-second clone captured in the source set Yes; Polly returns audio streams, Transcribe supports streaming Yes; Transcribe supports diarization and automatic language identification in relevant workflows Polly Generative $30 / 1M chars; Transcribe Tier 1 $0.03/min in us-east-1 example Strong for AWS-native teams; TTS price for Polly Generative roughly matches Chirp 3 HD, but STT list price in the cited example is higher.
ElevenLabs Yes, Scribe v2 / v2 Realtime Yes Yes, voice cloning front-and-center Yes; realtime STT marketed at ~150 ms latency Yes; diarization, word-level timestamps, multilingual handling Scribe v2 $0.22/hour; Scribe v2 Realtime $0.39/hour; TTS pricing is model- and plan-dependent Best fit when pure voice experience and rapid productization matter more than hyperscaler platform consolidation.

Latency and throughput

Specific latency figures for Chirp 3 are not publicly disclosed in the source set. The source records the following latency-related facts:

  • Text streaming is exclusive to Chirp 3 HD voices in Google's Cloud TTS stack, which the source describes as relevant for low-latency voice agents.
  • The chirp_3 public preview announcement (August 29, 2025) included improved speed/accuracy messaging.
  • Adya reported localization across 20+ Indian languages with low latency using Chirp 3.
  • A September 2025 forum thread reported long-audio jobs stalling or taking much longer than before; no official public postmortem or universal fix was captured in the source set.
  • Instant Custom Voice permits 10 new voice cloning keys per minute per project with no stated absolute limit on total keys.

Deployment and integrations

Chirp 3 is sold as a Google Cloud API service, not a boxed consumer product. Consumption happens through the Cloud TTS API, Speech-to-Text V2 API, Vertex AI Studio, and Google's console and notebook ecosystem.

Region support differs by sub-product: Chirp 3 HD lists six GA endpoints (global, us, eu, asia-southeast1, europe-west2, asia-northeast1), while Chirp 3 Transcription documents GA in us and eu multi-regions with release-note preview expansions into additional regions in late 2025. Instant Custom Voice lists region availability beyond the initial TTS endpoints.

Chirp 3: Transcription is available only in Speech-to-Text V2 and supports the StreamingRecognize, Recognize, and BatchRecognize methods.

Official support paths include reference documentation, release notes with RSS support, Vertex AI Studio and console entry points, Colab and GitHub notebooks, community forums, Cloud support, system status, and sales-led access for allowlisted features such as Instant Custom Voice.

Google describes Gemini-TTS as the latest evolution of Cloud TTS, with broader prompt-based control and native multi-speaker options while reusing voice identities similar to Chirp 3 HD.

Pricing

Chirp 3 commercial surface Publicly visible pricing Availability notes Source
Chirp 3: HD voices $30 per 1M characters after the free tier; 1M characters free monthly on the pricing page. GA; available via Cloud TTS and Vertex AI Studio.
Chirp 3: Instant Custom Voice Public pricing was not fully surfaced in the captured pricing excerpt; the feature is allowlisted. Restricted access; requires sales/allowlist.
Chirp 3: Transcription Google's Speech-to-Text V2 page lists standard recognition at $0.016/min up to 500k min, then lower volume-tier rates, and dynamic batch at $0.003/min; the public pricing page still labels the included V2 speech model family as "chirp" rather than explicitly chirp_3. Available only in Speech-to-Text V2; GA and preview regions differ.

The source notes that Google's public pricing nomenclature lags the model nomenclature: the STT pricing page references "chirp (Speech-to-Text V2 only)" while the current model docs and release notes use chirp_3. The source's reading is that Chirp-family Speech-to-Text V2 pricing applies, but the public pricing page is not as current or precise as the model documentation.

Development and ownership

Chirp 3 is developed and operated by Google as part of Google Cloud. The Chirp family predates Chirp 3: Google introduced Chirp as a speech foundation model in 2023 and described that generation as a 2B-parameter speech model delivering 98% English speech recognition accuracy and large relative gains in some low-resource languages. Current Chirp 3 materials emphasize product capabilities, rollout stages, and API behavior rather than architectural disclosures such as parameter count.

Release history

Date Milestone What changed Source
Feb 10, 2025 Pre-launch rename Journey voices were rebranded as Chirp HD voices.
Mar 6, 2025 TTS rollout expansion Chirp 3 HD added 8 speakers in 31 locales.
Apr 2, 2025 TTS GA Chirp 3 HD voices became GA with 8 speakers / 31 locales, real-time streaming, batch support, and supported regional endpoints.
Apr 9, 2025 Instant Custom Voice GA announcement Google announced Instant Custom Voice as GA through an allowlist and also announced transcription with diarization in preview/allowlist.
Apr 11, 2025 STT Private Preview chirp_3 launched in private preview for Speech-to-Text V2.
May 7, 2025 TTS controls expansion Pace control, pause control, and custom pronunciations were released for Chirp 3 HD voices.
Jun 18, 2025 ICV locale expansion Instant Custom Voice added ja-JP, pushing support to more than 30 locales.
Aug 21 to 27, 2025 ICV and TTS endpoint upgrades Instant Custom Voice added PCM, MP3, and M4A input encodings; Chirp 3 HD became available on europe-west2.
Aug 29, 2025 STT Public Preview chirp_3 public preview launched with 85+ languages/locales in preview coverage and improved speed/accuracy messaging.
Sep 15, 2025 TTS endpoint expansion Chirp 3 HD became available on asia-northeast1.
Oct 13, 2025 STT GA Chirp 3: Transcription reached GA in Speech-to-Text V2.
Oct 17, 2025 Limited SSML support Chirp 3 HD added support for phoneme, p, s, sub, and say-as tags.
Nov to Dec 2025 Regional and language expansion STT preview regions expanded; TTS added a wide set of European languages, then Punjabi and Cantonese in preview.

Sources

All cited sources were accessed on June 15, 2026.

The platform

Put these benchmarks to work

The same evaluations behind these dispatches drive OpenTranscription — one API that routes every job to the right speech model for your audio, language, and budget.

© 2026 OpenTranscription · Signal is our journal.Set in system grotesque, serif & mono