Google Cloud Speech-to-Text default: model profile

Google Cloud Speech-to-Text default is the general-purpose baseline model tag in Google Cloud's Speech-to-Text service, documented for transcribing audio that does not fit the service's more specialized models.

Specifications

Developer	Google (Google Cloud)
Released	Not publicly disclosed as a distinct date. Cloud Speech-to-Text V1 launched in April 2017; model selection entered beta in 2018 and became generally available in February 2019.
Model type	Speech-to-text model tag; described by Google as "mostly based on classic non-conformer architectures"
Languages	All available languages in the V1 API, shared with command_and_search
Modes (batch / streaming)	Synchronous recognition for clips under 60 seconds; batch recognition for long audio; streaming with a recommended 100 ms frame size
Throughput / concurrency	V2 batch requests accept up to 15 files and 8 hours per file
Deployment	Google Cloud Speech-to-Text API (V1 and V2); grouped under V2 "Standard" models on the pricing page
Pricing	V2 standard recognition: $0.016 per minute for the first 500,000 minutes per month (about $0.96 per hour); dynamic batch recognition: $0.003 per minute (about $0.18 per hour)

Not disclosedParameters · Training data · Latency · License

Known limitations

Google states that default, command_and_search, phone_call, and video are "mostly based on classic non-conformer architectures" and are "primarily kept for legacy and backwards-compatibility reasons."
Google does not publish exact latency or benchmark data for default.
Parameter count, training data, and architecture details for default are not publicly disclosed; Google documents default as a product setting rather than a technical program.
Google's documentation says video audio will likely be transcribed at lower quality with default than with the video model, and that telephony models produce more accurate results on phone audio.
Third-party and community reports: Google's own forums include threads where support suggests experimenting between default and latest_long for longer audio, and where developers report mismatches between documented language/model support and observed behavior in V2. A GitHub issue against Google's samples repository in late 2023 reported that V2 documentation was overly Python-centric and thin for migration. These sources are anecdotal.

Full technical breakdown9 sections

Overview

The default model is a model tag within Google Cloud Speech-to-Text rather than a separately branded model architecture. Google's documentation states that default "can be used to transcribe any audio type," supports all available languages in the V1 API together with command_and_search, and is best for audio that "does not fit the other audio models, like long-form audio or dictation."

Google's model-selection page states that default, command_and_search, phone_call, and video are "mostly based on classic non-conformer architectures" and are "primarily kept for legacy and backwards-compatibility reasons."

Selecting the default model is distinct from leaving the model unspecified. Google's documentation says that if no model is specified, Cloud Speech-to-Text "attempts to select the model that best fits the settings" in the RecognitionConfig. Auto-selection and the literal default model tag are separate behaviors.

Google documents default functionally rather than architecturally: the documentation describes what it is good for, what kinds of audio it accepts, and when other models will likely do better. Google does not publicly describe default the way it describes Conformer, USM, or Chirp; the "latest" models are explicitly described as based on Conformer, and Google Research and Google Cloud publish training scale and technical lineage for USM and Chirp.

Google's V2 model-comparison page foregrounds chirp_3, chirp_2, and telephony, and the V2 supported-languages matrix prominently lists chirp_3, chirp_2, long, short, and telephony variants. Google's pricing page still includes default among V2 "Standard" models.

Capabilities and features

Google's documentation states that default can be used to transcribe any audio type and is best for audio that does not fit the other audio models, such as long-form audio or dictation.

The documentation also states that video audio will likely be transcribed at lower quality with default than with the video model, that telephony models produce more accurate results on phone audio than latest_short or latest_long, and that enhanced phone and video models historically beat default when the audio source matches their training domain.

Google's V1 documentation says latest_long can be used "in place of the default model" and describes latest_long as appropriate for long-form content, spontaneous speech, and conversations. Google directs new users to the V2 API.

Google's model lineup, as summarized from the V1 and V2 model-selection docs, latest-model docs, Chirp docs, and supported-language pages:

Model family	What Google says it is for	What that implies for default
default	Audio that does not fit other model types; long-form audio or dictation; any audio type	The broad fallback / baseline
command_and_search	Short commands and voice search	Separate short-utterance legacy path
phone_call / telephony / telephony_short	Phone-originated audio, often 8 kHz	Specialization beats generic fallback for telephony
video	Video, podcasts, multi-speaker, noisy or high-quality mic audio	Stronger than default for media-like audio
medical_*	Medical dictation and patient-professional conversation	Premium, regulated-domain specialization
latest_long / latest_short or V2 long / short	Newer general models built on Conformer	Modern replacements for much of default's former job
chirp_2 / chirp_3	Large multilingual ASR-specific generative models	Google's research-forward multilingual direction

Language support

In the V1 API, default and command_and_search support all available languages, whereas enhanced or specialized models are more selective.

The V2 supported-languages matrix prominently lists chirp_3, chirp_2, long, short, and telephony variants.

Performance and benchmarks

Google does not publish exact public latency or benchmark data for the default model itself.

Vendor-reported guidance in the documentation: default can transcribe any audio type, but video audio will likely be transcribed at lower quality than with the video model; telephony models produce more accurate results on phone audio than latest_short or latest_long; enhanced phone and video models historically beat default when the audio source matches their training domain.

Google's "latest models" documentation states that the latest models are designed to expose newer machine learning research directly to Cloud users and can often provide higher accuracy than older available models, although some features still lag older models.

Latency and throughput

Google does not publish exact latency figures for default. Documented service constraints and tuning guidance include: synchronous recognition for clips under 60 seconds; batch recognition for long audio; up to 15 files and 8 hours per file in V2 batch requests; best-practice guidance recommending 16 kHz or higher audio where possible, lossless codecs, proper model selection, and clean audio capture. Google recommends 100 ms frame sizes for streaming as a balance between latency and efficiency.

Deployment and integrations

The default model is available through the Google Cloud Speech-to-Text API. Google's pricing page includes default among V2 "Standard" models, alongside latest_short, latest_long, phone_call, video, and Chirp.

Google directs new users to the V2 API.

The service is part of an integrated cloud offering with regionalization, reusable recognizers, audit logging, and encryption options. Google states that Speech-to-Text processes more than 1 billion voice minutes per month for enterprise customers and that V2 was built in part to meet regionalization, security, and regulatory requirements.

Pricing

Recognition mode	Price	Approximate hourly rate
V2 standard recognition	$0.016 per minute for the first 500,000 minutes per month	About $0.96 per hour
V2 dynamic batch recognition	$0.003 per minute	About $0.18 per hour

Google groups default with other "Standard" models such as latest_short, latest_long, phone_call, video, and Chirp on the pricing page.

Google's documentation states that latest_long and latest_short have the same usage costs as default and command_and_search.

For comparison, the source cites a tier-1 Amazon Transcribe example price of $0.024 per minute in US East, or about $1.44 per hour; the exact comparison depends on region, tier, and feature set. Azure pricing is mode- and region-dependent and is presented differently across its pricing surfaces.

Development and ownership

The model is developed and operated by Google as part of Google Cloud Speech-to-Text.

Google's speech systems moved from Gaussian Mixture Model acoustic modeling into DNNs and then LSTM-based systems in the early 2010s. Google Research's 2015 explanation of Google Voice transcription says the service had used GMMs since 2009, that DNNs "revolutionized" speech recognition around 2012, and that Google rebuilt its older voicemail transcription around LSTM RNNs.

Google's later model families include the "latest" models introduced in 2022, based on Conformer, which Google Research described as a convolution-augmented Transformer architecture that achieved state-of-the-art ASR accuracy in its publication; USM, described by Google Research in 2023 as a 2B-parameter family trained on 12 million hours of speech and 28 billion sentences of text spanning 300+ languages, using self-supervised and multilingual pretraining; and Chirp, packaged by Google Cloud as a foundation model for Speech AI, continued with Chirp 2 and Chirp 3 in Speech-to-Text V2 in 2025 and 2026.

Release history

Date	Event
2009	Google Voice transcription used GMM acoustic modeling
Around 2012	DNNs adopted in Google speech recognition
2015	Google Research describes rebuilding voicemail transcription around LSTM RNNs
April 2017	Cloud Speech-to-Text V1 launches
2018	Beta support for choosing different speech recognition models, including a model optimized for video
February 2019	Model selection and enhanced models become generally available
2022	"Latest" Conformer-based models introduced
2023	Google Research describes USM; Google Cloud packages the research direction as Chirp
2025 and 2026	Chirp 2 and Chirp 3 released in Speech-to-Text V2

Sources

Cloud Speech-to-Text V1 supported languages | Google Cloud Documentation https://docs.cloud.google.com/speech-to-text/docs/v1/speech-to-text-supported-languages

Select a transcription model | Cloud Speech-to-Text | Google Cloud Documentation https://docs.cloud.google.com/speech-to-text/docs/v1/transcription-model

Introduction to Latest Models | Cloud Speech-to-Text | Google Cloud Documentation https://docs.cloud.google.com/speech-to-text/docs/v1/latest-models

Compare transcription models | Cloud Speech-to-Text | Google Cloud Documentation https://docs.cloud.google.com/speech-to-text/docs/transcription-model

The neural networks behind Google Voice transcription https://research.google/blog/the-neural-networks-behind-google-voice-transcription/

Speech-to-Text release notes | Google Cloud Documentation https://docs.cloud.google.com/speech-to-text/docs/release-notes

Conformer: Convolution-augmented Transformer for Speech Recognition https://research.google/pubs/conformer-convolution-augmented-transformer-for-speech-recognition/

Universal Speech Model (USM): State-of-the-art speech AI for 100+ languages https://research.google/blog/universal-speech-model-usm-state-of-the-art-speech-ai-for-100-languages/

Speech-to-Text API Pricing | Google Cloud https://cloud.google.com/speech-to-text/pricing

Compare transcription models | Cloud Speech-to-Text https://docs.cloud.google.com/speech-to-text/docs/transcription-model

What is Amazon Transcribe? - Amazon Transcribe https://docs.aws.amazon.com/transcribe/latest/dg/what-is.html

Models & Languages Overview https://developers.deepgram.com/docs/models-languages-overview

Google Cloud Chirp model for Speech AI | Google Cloud Blog https://cloud.google.com/blog/products/ai-machine-learning/bringing-power-large-models-google-clouds-speech-api

Speech-to-Text. Unexpected Transcribing Numbers as Digits - AI APIs - Google Developer forums https://discuss.google.dev/t/speech-to-text-unexpected-transcribing-numbers-as-digits/181656