Google Cloud latest_short: model profile
Reference profile of Google Cloud Speech-to-Text latest_short, a rolling Conformer-based model tag for short utterances and command-style speech.
Google Cloud latest_short is a model tag in the Cloud Speech-to-Text API that routes requests to Google's current Conformer-based recognition models for short utterances such as commands and other single shot directed speech.
Specifications
| Developer | Google (Google Cloud) |
| Released | The latest model tag was introduced in April 2022 |
| Model type | Rolling model tag over Conformer-based ASR models; short-form, command-oriented recognition |
| Languages | Not publicly disclosed. Feature support varies by language. |
| Modes (batch / streaming) | Selectable across speech:recognize (synchronous), speech:longrunningrecognize (batch), and streaming in V1 |
| Latency | Not publicly disclosed as a numeric figure. Tuned for low-latency single-utterance endpointing |
| Throughput / concurrency | V2 batch requests accept up to 15 files per request; files up to 8 hours |
| Deployment | Google Cloud Speech-to-Text API (V1 and V2 managed cloud service) |
| Pricing | Billed as a standard model; V2 standard recognition from $0.016 per minute down to $0.004 at the highest published tier; standard dynamic batch $0.003 per minute |
Not disclosedParameters · Training data · License
Full technical breakdown9 sections
Overview
latest_short is presented in Google's documentation as a rolling model tag rather than a pinned model release. Google's docs call latest a "model tag," and the 2022 launch post states that specifying latest_short gives access to "our latest conformer models as we continue to update them". The docs state that the current Latest models are based on Conformer technology but "may change in the future".
The model tag is intended for utterances "a few seconds in length," especially commands or other "single shot directed speech". Google positions latest_short as the successor to command_and_search: the V1 transcription-model docs say latest_short should be considered instead of command_and_search, which is classified among models "mostly based on classic non-conformer architectures" retained mainly for backward compatibility.
Google's documentation distinguishes two boundaries that are often conflated. The latest_short model docs describe a speech-behavior specialization (utterances a few seconds long), while the request-mode docs describe an API mechanics threshold: synchronous recognition for audio under one minute, asynchronous batch recognition for audio longer than 60 seconds. The source document identifies the under-60-second boundary as a synchronous request limit rather than part of the model definition.
Capabilities and features
- Short-utterance recognition: intended for utterances a few seconds in length, especially commands or other single shot directed speech.
- Single-utterance behavior: when a recognizer uses latest_short, Cloud Speech-to-Text stops recognition once it detects the utterance has finished, returns an END_OF_SINGLE_UTTERANCE event, and in streaming mode closes the stream automatically after the utterance ends.
- Silence sensitivity: short-form models such as latest_short and command_and_search are more suited to short audio and prompts and are likely to return results once they detect a period of silence.
- Multi-mode selection: in V1, model selection applies across speech:recognize, speech:longrunningrecognize, and streaming, so latest_short can be selected for long-running (batch) recognition as well as synchronous and streaming requests.
- Confidence values: confidence values for Latest models are returned but are "not truly a confidence score".
- Channel restriction: in the V2 client reference, separate-per-channel recognition cannot be selected when the model is latest_short.
Language support
The source does not state a language count for latest_short. Feature support varies by language.
For context within the same portfolio, Google's USM paper describes a 2B-parameter speech model pre-trained on 12 million hours across 300+ languages, with reported performance across 100+ languages; Chirp 2 is documented as USM-based in V2 materials. These figures apply to the USM and Chirp line, not to latest_short.
Performance and benchmarks
- Vendor-reported: Google's release notes state that on January 9, 2024, quality was "substantially improved" for latest_short; the public model name did not change.
- Vendor-reported (architecture family): the original Conformer paper reported state-of-the-art results on LibriSpeech at the time of publication. The paper describes the architecture family, not the deployed latest_short model.
- Vendor guidance: for phone audio, Google recommends telephony or telephony_short and notes that phone-specific models can outperform latest_short or latest_long on telephony content.
- Third-party evaluation: none cited in the source. The source states there is no current apples-to-apples public benchmark from primary sources for the short-command use case across vendors.
- Customer benchmarking: Google offers V2 benchmarking tools so customers can compare word-error-rate across models using their own ground-truth data.
No word error rate figures for latest_short are disclosed in the source.
Latency and throughput
- Google's best-practices page recommends StreamingRecognize with single_utterance=true for short queries or commands, to optimize for short utterances and minimize latency.
- Synchronous recognition is documented as the simplest option for audio under one minute; asynchronous batch recognition is intended for audio longer than 60 seconds.
- In V2, synchronous requests are limited to 10 MB or one minute of audio. Batch requests accept only Cloud Storage URIs, can include up to 15 files per request, and can process files up to 8 hours.
- Dynamic batching is described as lower cost in exchange for higher latency.
- The long-audio page states that for audio shorter than 60 seconds, synchronous recognition is faster and simpler than asynchronous recognition.
- Google warns that Latest model refreshes can alter accuracy or latency.
No numeric latency figures for latest_short are disclosed in the source.
Deployment and integrations
- latest_short is available through the Cloud Speech-to-Text API. In V1, it can be selected for synchronous (speech:recognize), asynchronous (speech:longrunningrecognize), and streaming recognition.
- V2 batch recognition is built for processing N audio files in one long-running operation, with dynamic batching available for lower-urgency processing.
- Current V1 pages direct new Cloud Speech-to-Text users to use the V2 API. V2 documentation promotes Chirp 2 and Chirp 3; Chirp 3 is described as Google's latest multilingual ASR-specific generative model, available only in V2, with streaming, synchronous, and batch support plus diarization and automatic language detection.
- Vendor guidance for interactive command paths points toward streaming or synchronous recognition with utterance-oriented endpointing rather than batch.
Pricing
The pricing page says latest_short is billed as a standard model.
| Item | Price |
|---|---|
| V2 standard recognition, first 500,000 minutes per month | $0.016 per minute |
| V2 standard recognition, highest published tier | $0.004 per minute |
| V2 standard dynamic batch | $0.003 per minute |
Derived figures stated in the source: a 30-second clip works out to about $0.008 at standard V2 pricing and about $0.0015 with dynamic batch, before storage and networking costs. Each channel is billed separately, so multi-channel audio can increase actual spend.
Development and ownership
latest_short is developed and operated by Google as part of Google Cloud Speech-to-Text. Google's docs state the Latest models are based on Conformer technology. The 2022 launch blog describes the speech architecture as a single neural network that augments a transformer with convolution layers instead of relying on separately trained acoustic, pronunciation, and language models.
The original Conformer paper describes a convolution-augmented Transformer that combines local feature extraction with global sequence modeling, using stacked Conformer blocks that sandwich self-attention and convolution between feed-forward modules.
Related Google Research publications include work on streaming Conformer transducers, FastEmit (a latency-regularization method applied to transducer models including Conformer-Transducer, with reported latency reductions and better streaming behavior), and a 2023 paper on modular domain adaptation for Conformer-based streaming ASR that discusses a Conformer transducer trained on video-caption data and adapted to domains such as voice search and dictation. The source states these papers do not establish that latest_short is exactly a Conformer-Transducer with FastEmit.
Release history
| Date | Event |
|---|---|
| April 2022 | Google introduced the latest model tag, described as access to newer Conformer-based models |
| January 9, 2024 | Release notes state quality was "substantially improved" for latest_short; the public model name did not change |
Google warns that Latest models may be refreshed more frequently than other models and that updates can make slight changes to accuracy or latency. There is no public evidence of a customer-facing version pin for latest_short.
Sources
Introduction to Latest Models | Cloud Speech-to-Text | Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/v1/latest-models
Best practices | Cloud Speech-to-Text | Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/v1/best-practices
Introduction to Latest Models | Cloud Speech-to-Text. https://docs.cloud.google.com/speech-to-text/docs/v1/latest-models?utm_source=chatgpt.com
Select a transcription model | Cloud Speech-to-Text | Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/v1/transcription-model
Speech-to-Text release notes | Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/release-notes
Single utterance behavior | Cloud Speech-to-Text | Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/single-utterance
Troubleshooting | Cloud Speech-to-Text | Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/troubleshooting
Transcribe long audio files into text | Cloud Speech-to-Text | Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/batch-recognize
Method: projects.locations.recognizers.batchRecognize | Cloud Speech-to-Text | Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/reference/rest/v2/projects.locations.recognizers/batchRecognize
Conformer: Convolution-augmented Transformer for Speech Recognition. https://www.isca-archive.org/interspeech_2020/gulati20_interspeech.pdf
FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization. https://research.google/pubs/fastemit-low-latency-streaming-asr-with-sequence-level-emission-regularization/
Google Cloud updates Speech API models for improved accuracy | Google Cloud Blog. https://cloud.google.com/blog/products/ai-machine-learning/google-cloud-updates-speech-api-models-for-improved-accuracy
How to recognize speech - Speech service - Foundry Tools | Microsoft Learn. https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-recognize-speech
Speech-to-Text API Pricing | Google Cloud. https://cloud.google.com/speech-to-text/pricing