Google Cloud Speech-to-Text latest_long: model profile

latest_long is Google Cloud Speech-to-Text's general-purpose long-form automatic speech recognition model for media, conversations, and other extended audio.

Specifications

Developer	Google (Google Cloud)
Released	April 21, 2022: release notes announced "Latest" models available in more than 20 languages, using "new end-to-end machine learning techniques"
Model type	Long-form, end-to-end ASR model based on Google's Conformer speech-model technology
Training data	Google says Cloud STT models are trained by analyzing millions of examples of human speech; specialized models are trained on audio from specific sources. Corpus breakdown not publicly disclosed.
Languages	"Latest" family available in more than 20 languages and more than 50 variants; feature support varies by language
Modes (batch / streaming)	Synchronous recognition, asynchronous long-running recognition, and streaming recognition
Latency	No formal P50/P95 benchmark published. Streaming returns results in real time; long-running recognition often completes in about half the source audio length
Throughput / concurrency	V1 quotas: 900 recognition requests per 60 seconds; 480 hours of audio processing per day
Deployment	Google Cloud Speech-to-Text API (V1); US and EU regional endpoints in V1, broader regionalization in V2
Pricing	V1 standard model: $0.016/minute above 60 monthly minutes with data logging, $0.024/minute without; first 60 minutes per month free
License	Not publicly disclosed. Offered as a GA model under the Speech-to-Text SLA.

Not disclosedParameters

Known limitations

Confidence scores: Google states that the latest models do not support true confidence scores. The API may return a value, but Google says it is "not truly a confidence score."
Domain mismatch: Google says telephony models outperform latest_long on phone audio, and describes video as often superior for noisy, high-quality, multi-speaker media.
Feature variability by language: Google notes that support for punctuation, diarization, and other features varies by language; the supported-language matrix is the current source of truth.
Model opacity: Google documents the Conformer lineage and exposed API parameters, but not the exact parameter count, data mix, decoder design, training-corpus breakdown, or internal serving architecture for latest_long.
Rolling updates: Google says "Latest" models may be updated or refreshed more frequently than other models, with changes that can affect accuracy or latency; there is no frozen, versioned checkpoint.
Underlying technology: Google warns that the Conformer-based underlying technology "may change in the future."
Legacy feature lag: Google notes that some legacy features may lag behind older models in the "Latest" family.
No published benchmarks: Google does not publish WER benchmarks, a dedicated robustness benchmark, or P50/P95 latency figures for latest_long in the sources reviewed.
No native PII redaction: transcript redaction requires a separate step using Sensitive Data Protection.

Full technical breakdown9 sections

Overview

Google defines latest_long as the model to use for "any kind of long form content such as media or spontaneous speech and conversations," and says it can be used in place of the video model when video is unavailable in the target language, or in place of the legacy default model. Google describes the "Latest" family as a way to access the "latest speech technology and machine learning research" from Google, with the caveat that some legacy features may lag behind older models.

Google positions latest_long as the recommended replacement for many video and default workloads in Cloud Speech-to-Text V1, and states that the "Latest" family is based on Google's Conformer speech-model technology. Google's documentation states that "Latest" models may be updated or refreshed more frequently than other models, with changes that can affect accuracy or latency; latest_long is a product label for a moving production model, not a frozen research checkpoint.

latest_long sits in the classic Speech-to-Text API line. Google now recommends that new users use the V2 API, whose long-form generic model is typically shown with the identifier long, while the versioned Chirp 2 and Chirp 3 models are Google's newer, explicitly documented multilingual ASR models.

Capabilities and features

latest_long can be used with all three classic Speech-to-Text recognition modes: synchronous recognition, asynchronous long-running recognition, and streaming recognition. Model selection applies to speech:recognize, speech:longrunningrecognize, and Streaming.

Documented features in the V1 RecognitionConfig include adaptation, speechContexts, transcriptNormalization, word timestamps, word confidence, punctuation, diarization, profanity filtering, and spoken punctuation and spoken emoji controls. When adaptation is set, it supersedes speechContexts.

Automatic punctuation is optional and, when enabled, inserts periods, commas, and question marks; Google says it automatically capitalizes the first letter after each period and question mark. Spoken punctuation and spoken emojis are separate toggles. Feature availability depends on language support.

Speaker diarization is supported by the API. Google's release notes identify October 2022 as the point when diarization became available for "Latest" models in en-US. In streaming mode, diarization causes the service to resend the accumulating word history on consecutive responses so that speaker tags can improve over time. For non-streaming requests, diarization is emitted on the final result.

Model adaptation is Google's domain-biasing mechanism: phrase sets, class tokens, and boost values bias recognition toward names, jargon, addresses, or other rare or domain-specific content. Google says adaptation is helpful when the audio contains frequent specialized terms, rare words, or noise or unclear speech. For latest_long, adaptation became available in 13 languages in January 2024.

For deeper customization, Google's Custom Speech-to-Text models in V2 use a pre-trained, Conformer-based architecture and are fine-tuned from a base model. Google's current docs list latest_long as the base model for supported custom-model locales such as de-DE, en-AU, en-GB, en-IN, en-US, es-US, and es-ES. Training requires at least 100 audio hours of training data and 10 audio hours of validation data.

Google's Speech-to-Text docs do not advertise native latest_long PII redaction as a model feature. Google documents Sensitive Data Protection as the product for classification, redaction, and de-identification of text content.

Language support

Google says the "Latest" family is available in more than 20 languages and more than 50 variants, with feature support varying by language. Google's supported-language pages are the source of truth for the current matrix.

For multilingual behavior, the V1 API allows a primary languageCode plus up to three alternativeLanguageCodes, with analogous multiple-language recognition in V2. Google notes that this is best kept to the minimum necessary list, because too many alternatives reduce the odds of selecting the correct language. In V2, Google states that alternative-language recognition works with the long, short, and telephony models.

Model adaptation for latest_long is available in 13 languages as of January 9, 2024.

Performance and benchmarks

Google does not publish WER benchmarks, robustness benchmarks, or formal latency benchmarks for latest_long in the sources reviewed.

Vendor-reported relative performance statements from Google's documentation:

For telephony audio, Google says the newer telephony or telephony_short models will outperform latest_long and latest_short on phone audio.
For noisy, multi-speaker video-style media, Google describes the legacy video model as often the best choice, especially for high-quality microphones and heavy background noise, while recommending latest_long as a replacement in many cases.
Google warns that lossy codecs plus background noise can reduce accuracy, and recommends lossless audio such as FLAC or LINEAR16.
Google says model adaptation can help when audio is noisy or unclear.

Google's Conformer research paper reports parameter efficiency and state-of-the-art results at publication time, including a small 10M-parameter variant on LibriSpeech. Google has not stated that those paper numbers describe the production latest_long model.

Google states that the latest models do not support true confidence scores. The API may return a value, but Google says it is "not truly a confidence score."

Latency and throughput

Google does not publish a formal P50/P95 latency benchmark for latest_long. Google states that streaming returns results in real time as the audio is processed, and that long-running recognition often completes in about half the source audio length, depending on the source. These are practical service expectations, not an SLA-level latency commitment.

Content limits:

Limit	V1	V2
Synchronous request	~1 minute	1 minute / 10 MB
Asynchronous request	~480 minutes	Batch files up to 8 hours each, up to 15 files per BatchRecognizeRequest
Streaming request	~5 minutes per request	5-minute streaming session limit
Inline local content	10 MB request limit	Not stated separately in the source.

Streaming audio must be sent at roughly real-time speed.

V1 quotas: 900 recognition requests per 60 seconds and 480 hours of audio processing per day. Multi-channel audio is billed per channel, while quota accounting is based on file duration rather than multiplied channel time.

Deployment and integrations

latest_long is accessed through the Google Cloud Speech-to-Text V1 API using the model identifier latest_long. In V2 migration examples, Google's client code typically uses the model identifier long alongside recognizer resources such as projects/{project}/locations/{location}/recognizers/_. Google's V2 multi-language docs refer to long, short, and telephony models. Google's docs do not, in the pages reviewed, state explicitly that latest_long equals long; the mapping is operationally apparent rather than formally named in one place.

Google offers US and EU regional endpoints in V1 and broader regionalization in V2. Google supports customer-managed encryption keys for Speech-to-Text resources through Cloud KMS.

V2 introduces recognizers as reusable configuration resources and broader regionalization, and is the exclusive home of the Chirp family. Google says new Cloud Speech-to-Text users should use the V2 API.

Data handling: by default, Cloud Speech-to-Text does not log customer audio data or transcripts. If a customer opts into data logging, Google may use the logged data to improve the service, and the customer may receive discounted pricing. Without opt-in, Google says it does not use customer content except to provide the service, and it does not claim ownership of the audio or returned transcript. For streaming and synchronous endpoints, audio is processed in memory and customer data is not stored, though some request metadata is temporarily logged for abuse prevention and service improvement. For asynchronous requests, the resulting transcript is stored for approximately five days for retrieval; the input audio is not stored by the API service.

Related models

Model	API surface	Best fit	Key notes
latest_long	V1 identifier latest_long.	Long-form media, spontaneous speech, conversations.	Conformer-based "Latest" family; standard pricing; intended replacement for many default / video cases; confidence score not truly calibrated.
latest_short	V1 identifier latest_short.	Short utterances, commands, one-shot directed speech.	Intended replacement for command_and_search; quality improved in 2024 release notes.
video	V1 legacy premium model.	Video clips, podcasts, multi-speaker media, high-quality mics, background noise.	Often still best for noisy or multi-speaker media, but Google recommends latest_long as a replacement in many languages and workloads.
default	V1 legacy general model.	General audio not covered by specialized models.	Google says legacy models are "mostly based on classic non-conformer architectures" and kept mainly for backward compatibility.
phone_call	V1 legacy phone model.	8 kHz phone audio.	Superseded in practice by telephony / telephony_short.
telephony / telephony_short	Newer phone models documented in V1/V2 materials.	Phone-call audio, especially 8 kHz.	Google says these correspond to the most recent versions of phone_call.
chirp / Chirp family	V2-only family in pricing and blog context; current docs emphasize chirp_2 and chirp_3.	Multilingual, newer foundation-style ASR.	Original Chirp was presented as a 2B-parameter large speech model; Chirp 2 and Chirp 3 are the currently documented versioned models.
chirp_2	V2 only.	Multilingual transcription and translation.	USM-derived; supports timestamps, adaptation, and translation; docs say no diarization and no language detection.
chirp_3	V2 only.	Google's newest documented multilingual ASR model.	Adds diarization and automatic language detection; supports streaming, sync, and batch.

Pricing

Google bills latest_long and latest_short as Standard models, not premium media models.

Speech-to-Text V1, standard models including latest_long:

Plan	Price
With data logging opt-in	$0.016/minute above 60 monthly minutes
Without data logging	$0.024/minute above 60 monthly minutes
Free tier	First 60 minutes per month per account, in both cases

Speech-to-Text V2: standard recognition is priced by monthly tier, starting at $0.016/minute and falling with volume; dynamic batch recognition is listed at $0.003/minute. Google's pricing page places latest_long, latest_short, and chirp in the standard model set for billing purposes.

On November 11, 2022, pricing changed so enhanced models were no longer priced differently, and requests moved to one-second rounding.

Development and ownership

latest_long is developed and operated by Google as part of Google Cloud Speech-to-Text. Google does not publish a page naming the launch engineering roster for latest_long itself. Public attribution is indirect: product managers who publicly announced adjacent Speech-to-Text platform changes, and Google researchers tied to the Conformer and modern ASR work that Google says underlies the "Latest" models. Google has not, in the sources reviewed, published a statement naming the engineers who developed this specific model.

Person or group	Public role or affiliation	Publicly documented relevance
Calum Barnes	Product Manager, Cloud Speech.	Co-authored Google Cloud's V2/Chirp GA announcement, indicating product-side ownership of the Cloud Speech platform.
Haris Ioannou	Product Manager, Cloud Speech.	Co-authored the same Google Cloud announcement.
Conformer author group: Anmol Gulati, Chung-Cheng Chiu, James Qin, Jiahui Yu, Niki Parmar, Ruoming Pang, Shibo Wang, Wei Han, Yonghui Wu, Yu Zhang, Zhengdong Zhang	Google Research publication authors.	Google's latest_long docs say the "Latest" family is currently based on Conformer technology.
Yu Zhang	Research Scientist, Google Research.	Co-authored the Google Research USM blog and is also a Conformer author.
James Qin	Software Engineer, Google Research.	Co-authored the Google Research USM blog and is also a Conformer author.
Tara N. Sainath	Principal Research Scientist at Google, working on deep neural networks for ASR.	Senior public Google ASR researcher appearing on related papers such as Dual-mode ASR and Modular Domain Adaptation.
Rohit Prabhavalkar	Senior Staff Research Scientist in Google's Speech Technologies group.	Publicly associated with end-to-end ASR research and long-form ASR work.
Ian McGraw	Staff Software Engineer in Google's Speech group, managing on-device speech-recognition teams.	Publicly tied to speech-recognition engineering and confidence-related ASR work.

Release history

Date	Public milestone	Context
February 20, 2019	Google made transcription model selection generally available in Cloud Speech-to-Text, including specialized models such as video and phone_call.	Pre-latest_long baseline: model selection existed before the "Latest" family.
March 15, 2021	Model Adaptation launched for Speech-to-Text.	Adaptation later became relevant to latest_long.
April 21, 2022	Release notes announced that "Latest" models were available in more than 20 languages and used "new end-to-end machine learning techniques."	The clearest official public availability milestone for latest_long.
October 3, 2022	Speaker diarization became available for "Latest" models in en-US.	Feature expansion for multi-speaker transcription.
November 11, 2022	Pricing changed so enhanced models were no longer priced differently; requests moved to one-second rounding.	Pricing simplification around the modern model family.
August 9, 2023	Speech-to-Text V2 became GA; Google said V2 migrated existing functionality and introduced Chirp, its latest 2B-parameter large speech model.	Split between the classic latest_long line and the newer Chirp family.
November 6, 2023	Google launched telephony and telephony_short as the most recent versions of phone_call.	Narrows the domain where latest_long applies.
January 9, 2024	Model adaptation became available for latest_long in 13 languages.	Adds domain-term and vocabulary biasing to latest_long.
January 27, 2025	Google announced broader availability for Chirp 2.	Newer multilingual branch of Google's speech lineup.
October 13, 2025	Google announced Chirp 3 GA, adding diarization and automatic language detection in the versioned Chirp line.	Google's newest ASR emphasis moved from "Latest" branding to versioned multilingual foundation models.

Google does not publish a distinct "first preview" date for latest_long.

Sources

Introduction to Latest Models, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/v1/latest-models
Speech-to-Text release notes, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/release-notes
Best practices, Cloud Speech-to-Text. https://docs.cloud.google.com/speech-to-text/docs/v1/best-practices
Select a transcription model, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/v1/transcription-model
Google Cloud Speech-to-Text V2 API, Google Cloud Blog. https://cloud.google.com/blog/products/ai-machine-learning/google-cloud-speech-to-text-v2-api
Conformer: Convolution-augmented Transformer for Speech Recognition, Google Research. https://research.google/pubs/conformer-convolution-augmented-transformer-for-speech-recognition/
Universal Speech Model (USM): State-of-the-art speech AI for 100+ languages, Google Research blog. https://research.google/blog/universal-speech-model-usm-state-of-the-art-speech-ai-for-100-languages/
Tara N. Sainath, Google Research. https://research.google/people/tarasainath/
Rohit Prabhavalkar, Google Research. https://research.google/people/rohitprabhavalkar/
Ian McGraw, Google Research. https://research.google/people/106845/
Compare transcription models, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/transcription-model
Quotas and limits, Cloud Speech-to-Text V1. https://docs.cloud.google.com/speech-to-text/docs/v1/quotas
Transcribe audio from streaming input, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/streaming-recognize
RecognitionConfig reference, Cloud Speech-to-Text V1. https://docs.cloud.google.com/speech-to-text/docs/reference/rest/v1/RecognitionConfig
Improve transcription results with model adaptation, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/adaptation-model
Custom Speech-to-Text models. https://docs.cloud.google.com/speech-to-text/docs/custom-speech-models
Speech-to-Text pricing. https://cloud.google.com/speech-to-text/pricing
Chirp 2: Enhanced multilingual accuracy, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/models/chirp-2
Chirp 3 Transcription: Enhanced multilingual accuracy, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/models/chirp-3
Migration guide. https://docs.cloud.google.com/speech-to-text/docs/migration
Data logging, Cloud Speech-to-Text V1. https://docs.cloud.google.com/speech-to-text/docs/v1/data-logging
Data usage FAQ, Cloud Speech-to-Text V1. https://docs.cloud.google.com/speech-to-text/docs/v1/data-usage-faq
Endpoints, Cloud Speech-to-Text V1. https://docs.cloud.google.com/speech-to-text/docs/v1/endpoints
Sensitive Data Protection documentation. https://docs.cloud.google.com/sensitive-data-protection/docs
Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling, Google Research. https://research.google/pubs/dual-mode-asr-unify-and-improve-streaming-asr-with-full-context-modeling/
Modular Domain Adaptation for Conformer-Based Streaming ASR, Google Research. https://research.google/pubs/modular-domain-adaptation-for-conformer-based-streaming-asr/
Patent US12136415B2, Mixture model attention for flexible streaming and non-streaming ASR. https://patents.google.com/patent/US12136415B2/en
Patent WO2019209569A1, Speaker diarization using an end-to-end model. https://patents.google.com/patent/WO2019209569A1/en
Patent US20220122586A1, Fast Emit low-latency streaming ASR with sequence-level emission regularization. https://patents.google.com/patent/US20220122586A1/en