Google Cloud Speech-to-Text default: model profile
Reference profile of Google Cloud Speech-to-Text's default model, a general-purpose legacy baseline retained for backwards compatibility.
Google Cloud Speech-to-Text default is the general-purpose baseline model tag in Google Cloud's Speech-to-Text service, documented for transcribing audio that does not fit the service's more specialized models.
Specifications
| Developer | Google (Google Cloud) |
| Released | Not publicly disclosed as a distinct date. Cloud Speech-to-Text V1 launched in April 2017; model selection entered beta in 2018 and became generally available in February 2019. |
| Model type | Speech-to-text model tag; described by Google as "mostly based on classic non-conformer architectures" |
| Languages | All available languages in the V1 API, shared with command_and_search |
| Modes (batch / streaming) | Synchronous recognition for clips under 60 seconds; batch recognition for long audio; streaming with a recommended 100 ms frame size |
| Throughput / concurrency | V2 batch requests accept up to 15 files and 8 hours per file |
| Deployment | Google Cloud Speech-to-Text API (V1 and V2); grouped under V2 "Standard" models on the pricing page |
| Pricing | V2 standard recognition: $0.016 per minute for the first 500,000 minutes per month (about $0.96 per hour); dynamic batch recognition: $0.003 per minute (about $0.18 per hour) |
Not disclosedParameters · Training data · Latency · License
Full technical breakdown9 sections
Overview
The default model is a model tag within Google Cloud Speech-to-Text rather than a separately branded model architecture. Google's documentation states that default "can be used to transcribe any audio type," supports all available languages in the V1 API together with command_and_search, and is best for audio that "does not fit the other audio models, like long-form audio or dictation."
Google's model-selection page states that default, command_and_search, phone_call, and video are "mostly based on classic non-conformer architectures" and are "primarily kept for legacy and backwards-compatibility reasons."
Selecting the default model is distinct from leaving the model unspecified. Google's documentation says that if no model is specified, Cloud Speech-to-Text "attempts to select the model that best fits the settings" in the RecognitionConfig. Auto-selection and the literal default model tag are separate behaviors.
Google documents default functionally rather than architecturally: the documentation describes what it is good for, what kinds of audio it accepts, and when other models will likely do better. Google does not publicly describe default the way it describes Conformer, USM, or Chirp; the "latest" models are explicitly described as based on Conformer, and Google Research and Google Cloud publish training scale and technical lineage for USM and Chirp.
Google's V2 model-comparison page foregrounds chirp_3, chirp_2, and telephony, and the V2 supported-languages matrix prominently lists chirp_3, chirp_2, long, short, and telephony variants. Google's pricing page still includes default among V2 "Standard" models.
Capabilities and features
Google's documentation states that default can be used to transcribe any audio type and is best for audio that does not fit the other audio models, such as long-form audio or dictation.
The documentation also states that video audio will likely be transcribed at lower quality with default than with the video model, that telephony models produce more accurate results on phone audio than latest_short or latest_long, and that enhanced phone and video models historically beat default when the audio source matches their training domain.
Google's V1 documentation says latest_long can be used "in place of the default model" and describes latest_long as appropriate for long-form content, spontaneous speech, and conversations. Google directs new users to the V2 API.
Google's model lineup, as summarized from the V1 and V2 model-selection docs, latest-model docs, Chirp docs, and supported-language pages:
| Model family | What Google says it is for | What that implies for default |
|---|---|---|
| default | Audio that does not fit other model types; long-form audio or dictation; any audio type | The broad fallback / baseline |
| command_and_search | Short commands and voice search | Separate short-utterance legacy path |
| phone_call / telephony / telephony_short | Phone-originated audio, often 8 kHz | Specialization beats generic fallback for telephony |
| video | Video, podcasts, multi-speaker, noisy or high-quality mic audio | Stronger than default for media-like audio |
| medical_* | Medical dictation and patient-professional conversation | Premium, regulated-domain specialization |
| latest_long / latest_short or V2 long / short | Newer general models built on Conformer | Modern replacements for much of default's former job |
| chirp_2 / chirp_3 | Large multilingual ASR-specific generative models | Google's research-forward multilingual direction |
Language support
In the V1 API, default and command_and_search support all available languages, whereas enhanced or specialized models are more selective.
The V2 supported-languages matrix prominently lists chirp_3, chirp_2, long, short, and telephony variants.
Performance and benchmarks
Google does not publish exact public latency or benchmark data for the default model itself.
Vendor-reported guidance in the documentation: default can transcribe any audio type, but video audio will likely be transcribed at lower quality than with the video model; telephony models produce more accurate results on phone audio than latest_short or latest_long; enhanced phone and video models historically beat default when the audio source matches their training domain.
Google's "latest models" documentation states that the latest models are designed to expose newer machine learning research directly to Cloud users and can often provide higher accuracy than older available models, although some features still lag older models.
Latency and throughput
Google does not publish exact latency figures for default. Documented service constraints and tuning guidance include: synchronous recognition for clips under 60 seconds; batch recognition for long audio; up to 15 files and 8 hours per file in V2 batch requests; best-practice guidance recommending 16 kHz or higher audio where possible, lossless codecs, proper model selection, and clean audio capture. Google recommends 100 ms frame sizes for streaming as a balance between latency and efficiency.
Deployment and integrations
The default model is available through the Google Cloud Speech-to-Text API. Google's pricing page includes default among V2 "Standard" models, alongside latest_short, latest_long, phone_call, video, and Chirp.
Google directs new users to the V2 API.
The service is part of an integrated cloud offering with regionalization, reusable recognizers, audit logging, and encryption options. Google states that Speech-to-Text processes more than 1 billion voice minutes per month for enterprise customers and that V2 was built in part to meet regionalization, security, and regulatory requirements.
Pricing
| Recognition mode | Price | Approximate hourly rate |
|---|---|---|
| V2 standard recognition | $0.016 per minute for the first 500,000 minutes per month | About $0.96 per hour |
| V2 dynamic batch recognition | $0.003 per minute | About $0.18 per hour |
Google groups default with other "Standard" models such as latest_short, latest_long, phone_call, video, and Chirp on the pricing page.
Google's documentation states that latest_long and latest_short have the same usage costs as default and command_and_search.
For comparison, the source cites a tier-1 Amazon Transcribe example price of $0.024 per minute in US East, or about $1.44 per hour; the exact comparison depends on region, tier, and feature set. Azure pricing is mode- and region-dependent and is presented differently across its pricing surfaces.
Development and ownership
The model is developed and operated by Google as part of Google Cloud Speech-to-Text.
Google's speech systems moved from Gaussian Mixture Model acoustic modeling into DNNs and then LSTM-based systems in the early 2010s. Google Research's 2015 explanation of Google Voice transcription says the service had used GMMs since 2009, that DNNs "revolutionized" speech recognition around 2012, and that Google rebuilt its older voicemail transcription around LSTM RNNs.
Google's later model families include the "latest" models introduced in 2022, based on Conformer, which Google Research described as a convolution-augmented Transformer architecture that achieved state-of-the-art ASR accuracy in its publication; USM, described by Google Research in 2023 as a 2B-parameter family trained on 12 million hours of speech and 28 billion sentences of text spanning 300+ languages, using self-supervised and multilingual pretraining; and Chirp, packaged by Google Cloud as a foundation model for Speech AI, continued with Chirp 2 and Chirp 3 in Speech-to-Text V2 in 2025 and 2026.
Release history
| Date | Event |
|---|---|
| 2009 | Google Voice transcription used GMM acoustic modeling |
| Around 2012 | DNNs adopted in Google speech recognition |
| 2015 | Google Research describes rebuilding voicemail transcription around LSTM RNNs |
| April 2017 | Cloud Speech-to-Text V1 launches |
| 2018 | Beta support for choosing different speech recognition models, including a model optimized for video |
| February 2019 | Model selection and enhanced models become generally available |
| 2022 | "Latest" Conformer-based models introduced |
| 2023 | Google Research describes USM; Google Cloud packages the research direction as Chirp |
| 2025 and 2026 | Chirp 2 and Chirp 3 released in Speech-to-Text V2 |
Sources
Cloud Speech-to-Text V1 supported languages | Google Cloud Documentation https://docs.cloud.google.com/speech-to-text/docs/v1/speech-to-text-supported-languages
Select a transcription model | Cloud Speech-to-Text | Google Cloud Documentation https://docs.cloud.google.com/speech-to-text/docs/v1/transcription-model
Introduction to Latest Models | Cloud Speech-to-Text | Google Cloud Documentation https://docs.cloud.google.com/speech-to-text/docs/v1/latest-models
Compare transcription models | Cloud Speech-to-Text | Google Cloud Documentation https://docs.cloud.google.com/speech-to-text/docs/transcription-model
The neural networks behind Google Voice transcription https://research.google/blog/the-neural-networks-behind-google-voice-transcription/
Speech-to-Text release notes | Google Cloud Documentation https://docs.cloud.google.com/speech-to-text/docs/release-notes
Conformer: Convolution-augmented Transformer for Speech Recognition https://research.google/pubs/conformer-convolution-augmented-transformer-for-speech-recognition/
Universal Speech Model (USM): State-of-the-art speech AI for 100+ languages https://research.google/blog/universal-speech-model-usm-state-of-the-art-speech-ai-for-100-languages/
Speech-to-Text API Pricing | Google Cloud https://cloud.google.com/speech-to-text/pricing
Compare transcription models | Cloud Speech-to-Text https://docs.cloud.google.com/speech-to-text/docs/transcription-model
What is Amazon Transcribe? - Amazon Transcribe https://docs.aws.amazon.com/transcribe/latest/dg/what-is.html
Models & Languages Overview https://developers.deepgram.com/docs/models-languages-overview
Google Cloud Chirp model for Speech AI | Google Cloud Blog https://cloud.google.com/blog/products/ai-machine-learning/bringing-power-large-models-google-clouds-speech-api
Speech-to-Text. Unexpected Transcribing Numbers as Digits - AI APIs - Google Developer forums https://discuss.google.dev/t/speech-to-text-unexpected-transcribing-numbers-as-digits/181656