Google Cloud Speech-to-Text latest_long: model profile
Reference profile of Google Cloud Speech-to-Text latest_long, a Conformer-based long-form transcription model: features, pricing, limits, history.
latest_long is Google Cloud Speech-to-Text's general-purpose long-form automatic speech recognition model for media, conversations, and other extended audio.
Specifications
| Developer | Google (Google Cloud) |
| Released | April 21, 2022: release notes announced "Latest" models available in more than 20 languages, using "new end-to-end machine learning techniques" |
| Model type | Long-form, end-to-end ASR model based on Google's Conformer speech-model technology |
| Training data | Google says Cloud STT models are trained by analyzing millions of examples of human speech; specialized models are trained on audio from specific sources. Corpus breakdown not publicly disclosed. |
| Languages | "Latest" family available in more than 20 languages and more than 50 variants; feature support varies by language |
| Modes (batch / streaming) | Synchronous recognition, asynchronous long-running recognition, and streaming recognition |
| Latency | No formal P50/P95 benchmark published. Streaming returns results in real time; long-running recognition often completes in about half the source audio length |
| Throughput / concurrency | V1 quotas: 900 recognition requests per 60 seconds; 480 hours of audio processing per day |
| Deployment | Google Cloud Speech-to-Text API (V1); US and EU regional endpoints in V1, broader regionalization in V2 |
| Pricing | V1 standard model: $0.016/minute above 60 monthly minutes with data logging, $0.024/minute without; first 60 minutes per month free |
| License | Not publicly disclosed. Offered as a GA model under the Speech-to-Text SLA. |
Not disclosedParameters
Full technical breakdown9 sections
Overview
Google defines latest_long as the model to use for "any kind of long form content such as media or spontaneous speech and conversations," and says it can be used in place of the video model when video is unavailable in the target language, or in place of the legacy default model. Google describes the "Latest" family as a way to access the "latest speech technology and machine learning research" from Google, with the caveat that some legacy features may lag behind older models.
Google positions latest_long as the recommended replacement for many video and default workloads in Cloud Speech-to-Text V1, and states that the "Latest" family is based on Google's Conformer speech-model technology. Google's documentation states that "Latest" models may be updated or refreshed more frequently than other models, with changes that can affect accuracy or latency; latest_long is a product label for a moving production model, not a frozen research checkpoint.
latest_long sits in the classic Speech-to-Text API line. Google now recommends that new users use the V2 API, whose long-form generic model is typically shown with the identifier long, while the versioned Chirp 2 and Chirp 3 models are Google's newer, explicitly documented multilingual ASR models.
Capabilities and features
latest_long can be used with all three classic Speech-to-Text recognition modes: synchronous recognition, asynchronous long-running recognition, and streaming recognition. Model selection applies to speech:recognize, speech:longrunningrecognize, and Streaming.
Documented features in the V1 RecognitionConfig include adaptation, speechContexts, transcriptNormalization, word timestamps, word confidence, punctuation, diarization, profanity filtering, and spoken punctuation and spoken emoji controls. When adaptation is set, it supersedes speechContexts.
Automatic punctuation is optional and, when enabled, inserts periods, commas, and question marks; Google says it automatically capitalizes the first letter after each period and question mark. Spoken punctuation and spoken emojis are separate toggles. Feature availability depends on language support.
Speaker diarization is supported by the API. Google's release notes identify October 2022 as the point when diarization became available for "Latest" models in en-US. In streaming mode, diarization causes the service to resend the accumulating word history on consecutive responses so that speaker tags can improve over time. For non-streaming requests, diarization is emitted on the final result.
Model adaptation is Google's domain-biasing mechanism: phrase sets, class tokens, and boost values bias recognition toward names, jargon, addresses, or other rare or domain-specific content. Google says adaptation is helpful when the audio contains frequent specialized terms, rare words, or noise or unclear speech. For latest_long, adaptation became available in 13 languages in January 2024.
For deeper customization, Google's Custom Speech-to-Text models in V2 use a pre-trained, Conformer-based architecture and are fine-tuned from a base model. Google's current docs list latest_long as the base model for supported custom-model locales such as de-DE, en-AU, en-GB, en-IN, en-US, es-US, and es-ES. Training requires at least 100 audio hours of training data and 10 audio hours of validation data.
Google's Speech-to-Text docs do not advertise native latest_long PII redaction as a model feature. Google documents Sensitive Data Protection as the product for classification, redaction, and de-identification of text content.
Language support
Google says the "Latest" family is available in more than 20 languages and more than 50 variants, with feature support varying by language. Google's supported-language pages are the source of truth for the current matrix.
For multilingual behavior, the V1 API allows a primary languageCode plus up to three alternativeLanguageCodes, with analogous multiple-language recognition in V2. Google notes that this is best kept to the minimum necessary list, because too many alternatives reduce the odds of selecting the correct language. In V2, Google states that alternative-language recognition works with the long, short, and telephony models.
Model adaptation for latest_long is available in 13 languages as of January 9, 2024.
Performance and benchmarks
Google does not publish WER benchmarks, robustness benchmarks, or formal latency benchmarks for latest_long in the sources reviewed.
Vendor-reported relative performance statements from Google's documentation:
- For telephony audio, Google says the newer telephony or telephony_short models will outperform latest_long and latest_short on phone audio.
- For noisy, multi-speaker video-style media, Google describes the legacy video model as often the best choice, especially for high-quality microphones and heavy background noise, while recommending latest_long as a replacement in many cases.
- Google warns that lossy codecs plus background noise can reduce accuracy, and recommends lossless audio such as FLAC or LINEAR16.
- Google says model adaptation can help when audio is noisy or unclear.
Google's Conformer research paper reports parameter efficiency and state-of-the-art results at publication time, including a small 10M-parameter variant on LibriSpeech. Google has not stated that those paper numbers describe the production latest_long model.
Google states that the latest models do not support true confidence scores. The API may return a value, but Google says it is "not truly a confidence score."
Latency and throughput
Google does not publish a formal P50/P95 latency benchmark for latest_long. Google states that streaming returns results in real time as the audio is processed, and that long-running recognition often completes in about half the source audio length, depending on the source. These are practical service expectations, not an SLA-level latency commitment.
Content limits:
| Limit | V1 | V2 |
|---|---|---|
| Synchronous request | ~1 minute | 1 minute / 10 MB |
| Asynchronous request | ~480 minutes | Batch files up to 8 hours each, up to 15 files per BatchRecognizeRequest |
| Streaming request | ~5 minutes per request | 5-minute streaming session limit |
| Inline local content | 10 MB request limit | Not stated separately in the source. |
Streaming audio must be sent at roughly real-time speed.
V1 quotas: 900 recognition requests per 60 seconds and 480 hours of audio processing per day. Multi-channel audio is billed per channel, while quota accounting is based on file duration rather than multiplied channel time.
Deployment and integrations
latest_long is accessed through the Google Cloud Speech-to-Text V1 API using the model identifier latest_long. In V2 migration examples, Google's client code typically uses the model identifier long alongside recognizer resources such as projects/{project}/locations/{location}/recognizers/_. Google's V2 multi-language docs refer to long, short, and telephony models. Google's docs do not, in the pages reviewed, state explicitly that latest_long equals long; the mapping is operationally apparent rather than formally named in one place.
Google offers US and EU regional endpoints in V1 and broader regionalization in V2. Google supports customer-managed encryption keys for Speech-to-Text resources through Cloud KMS.
V2 introduces recognizers as reusable configuration resources and broader regionalization, and is the exclusive home of the Chirp family. Google says new Cloud Speech-to-Text users should use the V2 API.
Data handling: by default, Cloud Speech-to-Text does not log customer audio data or transcripts. If a customer opts into data logging, Google may use the logged data to improve the service, and the customer may receive discounted pricing. Without opt-in, Google says it does not use customer content except to provide the service, and it does not claim ownership of the audio or returned transcript. For streaming and synchronous endpoints, audio is processed in memory and customer data is not stored, though some request metadata is temporarily logged for abuse prevention and service improvement. For asynchronous requests, the resulting transcript is stored for approximately five days for retrieval; the input audio is not stored by the API service.
Related models
| Model | API surface | Best fit | Key notes |
|---|---|---|---|
| latest_long | V1 identifier latest_long. | Long-form media, spontaneous speech, conversations. | Conformer-based "Latest" family; standard pricing; intended replacement for many default / video cases; confidence score not truly calibrated. |
| latest_short | V1 identifier latest_short. | Short utterances, commands, one-shot directed speech. | Intended replacement for command_and_search; quality improved in 2024 release notes. |
| video | V1 legacy premium model. | Video clips, podcasts, multi-speaker media, high-quality mics, background noise. | Often still best for noisy or multi-speaker media, but Google recommends latest_long as a replacement in many languages and workloads. |
| default | V1 legacy general model. | General audio not covered by specialized models. | Google says legacy models are "mostly based on classic non-conformer architectures" and kept mainly for backward compatibility. |
| phone_call | V1 legacy phone model. | 8 kHz phone audio. | Superseded in practice by telephony / telephony_short. |
| telephony / telephony_short | Newer phone models documented in V1/V2 materials. | Phone-call audio, especially 8 kHz. | Google says these correspond to the most recent versions of phone_call. |
| chirp / Chirp family | V2-only family in pricing and blog context; current docs emphasize chirp_2 and chirp_3. | Multilingual, newer foundation-style ASR. | Original Chirp was presented as a 2B-parameter large speech model; Chirp 2 and Chirp 3 are the currently documented versioned models. |
| chirp_2 | V2 only. | Multilingual transcription and translation. | USM-derived; supports timestamps, adaptation, and translation; docs say no diarization and no language detection. |
| chirp_3 | V2 only. | Google's newest documented multilingual ASR model. | Adds diarization and automatic language detection; supports streaming, sync, and batch. |
Pricing
Google bills latest_long and latest_short as Standard models, not premium media models.
Speech-to-Text V1, standard models including latest_long:
| Plan | Price |
|---|---|
| With data logging opt-in | $0.016/minute above 60 monthly minutes |
| Without data logging | $0.024/minute above 60 monthly minutes |
| Free tier | First 60 minutes per month per account, in both cases |
Speech-to-Text V2: standard recognition is priced by monthly tier, starting at $0.016/minute and falling with volume; dynamic batch recognition is listed at $0.003/minute. Google's pricing page places latest_long, latest_short, and chirp in the standard model set for billing purposes.
On November 11, 2022, pricing changed so enhanced models were no longer priced differently, and requests moved to one-second rounding.
Development and ownership
latest_long is developed and operated by Google as part of Google Cloud Speech-to-Text. Google does not publish a page naming the launch engineering roster for latest_long itself. Public attribution is indirect: product managers who publicly announced adjacent Speech-to-Text platform changes, and Google researchers tied to the Conformer and modern ASR work that Google says underlies the "Latest" models. Google has not, in the sources reviewed, published a statement naming the engineers who developed this specific model.
| Person or group | Public role or affiliation | Publicly documented relevance |
|---|---|---|
| Calum Barnes | Product Manager, Cloud Speech. | Co-authored Google Cloud's V2/Chirp GA announcement, indicating product-side ownership of the Cloud Speech platform. |
| Haris Ioannou | Product Manager, Cloud Speech. | Co-authored the same Google Cloud announcement. |
| Conformer author group: Anmol Gulati, Chung-Cheng Chiu, James Qin, Jiahui Yu, Niki Parmar, Ruoming Pang, Shibo Wang, Wei Han, Yonghui Wu, Yu Zhang, Zhengdong Zhang | Google Research publication authors. | Google's latest_long docs say the "Latest" family is currently based on Conformer technology. |
| Yu Zhang | Research Scientist, Google Research. | Co-authored the Google Research USM blog and is also a Conformer author. |
| James Qin | Software Engineer, Google Research. | Co-authored the Google Research USM blog and is also a Conformer author. |
| Tara N. Sainath | Principal Research Scientist at Google, working on deep neural networks for ASR. | Senior public Google ASR researcher appearing on related papers such as Dual-mode ASR and Modular Domain Adaptation. |
| Rohit Prabhavalkar | Senior Staff Research Scientist in Google's Speech Technologies group. | Publicly associated with end-to-end ASR research and long-form ASR work. |
| Ian McGraw | Staff Software Engineer in Google's Speech group, managing on-device speech-recognition teams. | Publicly tied to speech-recognition engineering and confidence-related ASR work. |
Release history
| Date | Public milestone | Context |
|---|---|---|
| February 20, 2019 | Google made transcription model selection generally available in Cloud Speech-to-Text, including specialized models such as video and phone_call. | Pre-latest_long baseline: model selection existed before the "Latest" family. |
| March 15, 2021 | Model Adaptation launched for Speech-to-Text. | Adaptation later became relevant to latest_long. |
| April 21, 2022 | Release notes announced that "Latest" models were available in more than 20 languages and used "new end-to-end machine learning techniques." | The clearest official public availability milestone for latest_long. |
| October 3, 2022 | Speaker diarization became available for "Latest" models in en-US. | Feature expansion for multi-speaker transcription. |
| November 11, 2022 | Pricing changed so enhanced models were no longer priced differently; requests moved to one-second rounding. | Pricing simplification around the modern model family. |
| August 9, 2023 | Speech-to-Text V2 became GA; Google said V2 migrated existing functionality and introduced Chirp, its latest 2B-parameter large speech model. | Split between the classic latest_long line and the newer Chirp family. |
| November 6, 2023 | Google launched telephony and telephony_short as the most recent versions of phone_call. | Narrows the domain where latest_long applies. |
| January 9, 2024 | Model adaptation became available for latest_long in 13 languages. | Adds domain-term and vocabulary biasing to latest_long. |
| January 27, 2025 | Google announced broader availability for Chirp 2. | Newer multilingual branch of Google's speech lineup. |
| October 13, 2025 | Google announced Chirp 3 GA, adding diarization and automatic language detection in the versioned Chirp line. | Google's newest ASR emphasis moved from "Latest" branding to versioned multilingual foundation models. |
Google does not publish a distinct "first preview" date for latest_long.
Sources
- Introduction to Latest Models, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/v1/latest-models
- Speech-to-Text release notes, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/release-notes
- Best practices, Cloud Speech-to-Text. https://docs.cloud.google.com/speech-to-text/docs/v1/best-practices
- Select a transcription model, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/v1/transcription-model
- Google Cloud Speech-to-Text V2 API, Google Cloud Blog. https://cloud.google.com/blog/products/ai-machine-learning/google-cloud-speech-to-text-v2-api
- Conformer: Convolution-augmented Transformer for Speech Recognition, Google Research. https://research.google/pubs/conformer-convolution-augmented-transformer-for-speech-recognition/
- Universal Speech Model (USM): State-of-the-art speech AI for 100+ languages, Google Research blog. https://research.google/blog/universal-speech-model-usm-state-of-the-art-speech-ai-for-100-languages/
- Tara N. Sainath, Google Research. https://research.google/people/tarasainath/
- Rohit Prabhavalkar, Google Research. https://research.google/people/rohitprabhavalkar/
- Ian McGraw, Google Research. https://research.google/people/106845/
- Compare transcription models, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/transcription-model
- Quotas and limits, Cloud Speech-to-Text V1. https://docs.cloud.google.com/speech-to-text/docs/v1/quotas
- Transcribe audio from streaming input, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/streaming-recognize
- RecognitionConfig reference, Cloud Speech-to-Text V1. https://docs.cloud.google.com/speech-to-text/docs/reference/rest/v1/RecognitionConfig
- Improve transcription results with model adaptation, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/adaptation-model
- Custom Speech-to-Text models. https://docs.cloud.google.com/speech-to-text/docs/custom-speech-models
- Speech-to-Text pricing. https://cloud.google.com/speech-to-text/pricing
- Chirp 2: Enhanced multilingual accuracy, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/models/chirp-2
- Chirp 3 Transcription: Enhanced multilingual accuracy, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/models/chirp-3
- Migration guide. https://docs.cloud.google.com/speech-to-text/docs/migration
- Data logging, Cloud Speech-to-Text V1. https://docs.cloud.google.com/speech-to-text/docs/v1/data-logging
- Data usage FAQ, Cloud Speech-to-Text V1. https://docs.cloud.google.com/speech-to-text/docs/v1/data-usage-faq
- Endpoints, Cloud Speech-to-Text V1. https://docs.cloud.google.com/speech-to-text/docs/v1/endpoints
- Sensitive Data Protection documentation. https://docs.cloud.google.com/sensitive-data-protection/docs
- Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling, Google Research. https://research.google/pubs/dual-mode-asr-unify-and-improve-streaming-asr-with-full-context-modeling/
- Modular Domain Adaptation for Conformer-Based Streaming ASR, Google Research. https://research.google/pubs/modular-domain-adaptation-for-conformer-based-streaming-asr/
- Patent US12136415B2, Mixture model attention for flexible streaming and non-streaming ASR. https://patents.google.com/patent/US12136415B2/en
- Patent WO2019209569A1, Speaker diarization using an end-to-end model. https://patents.google.com/patent/WO2019209569A1/en
- Patent US20220122586A1, Fast Emit low-latency streaming ASR with sequence-level emission regularization. https://patents.google.com/patent/US20220122586A1/en