OpenTranscription/ Blog
2026-07-03 · MODEL PROFILE

Google Cloud latest_short: model profile

Reference profile of Google Cloud Speech-to-Text latest_short, a rolling Conformer-based model tag for short utterances and command-style speech.

model-profilespeech-to-textgoogle-cloudconformerasr
GoogleCLOUD
Model profile Google

Google Cloud latest_short is a model tag in the Cloud Speech-to-Text API that routes requests to Google's current Conformer-based recognition models for short utterances such as commands and other single shot directed speech.

Specifications

DeveloperGoogle (Google Cloud)
ReleasedThe latest model tag was introduced in April 2022
Model typeRolling model tag over Conformer-based ASR models; short-form, command-oriented recognition
LanguagesNot publicly disclosed. Feature support varies by language.
Modes (batch / streaming)Selectable across speech:recognize (synchronous), speech:longrunningrecognize (batch), and streaming in V1
LatencyNot publicly disclosed as a numeric figure. Tuned for low-latency single-utterance endpointing
Throughput / concurrencyV2 batch requests accept up to 15 files per request; files up to 8 hours
DeploymentGoogle Cloud Speech-to-Text API (V1 and V2 managed cloud service)
PricingBilled as a standard model; V2 standard recognition from $0.016 per minute down to $0.004 at the highest published tier; standard dynamic batch $0.003 per minute

Not disclosedParameters · Training data · License

Full technical breakdown9 sections

Overview

latest_short is presented in Google's documentation as a rolling model tag rather than a pinned model release. Google's docs call latest a "model tag," and the 2022 launch post states that specifying latest_short gives access to "our latest conformer models as we continue to update them". The docs state that the current Latest models are based on Conformer technology but "may change in the future".

The model tag is intended for utterances "a few seconds in length," especially commands or other "single shot directed speech". Google positions latest_short as the successor to command_and_search: the V1 transcription-model docs say latest_short should be considered instead of command_and_search, which is classified among models "mostly based on classic non-conformer architectures" retained mainly for backward compatibility.

Google's documentation distinguishes two boundaries that are often conflated. The latest_short model docs describe a speech-behavior specialization (utterances a few seconds long), while the request-mode docs describe an API mechanics threshold: synchronous recognition for audio under one minute, asynchronous batch recognition for audio longer than 60 seconds. The source document identifies the under-60-second boundary as a synchronous request limit rather than part of the model definition.

Capabilities and features

  • Short-utterance recognition: intended for utterances a few seconds in length, especially commands or other single shot directed speech.
  • Single-utterance behavior: when a recognizer uses latest_short, Cloud Speech-to-Text stops recognition once it detects the utterance has finished, returns an END_OF_SINGLE_UTTERANCE event, and in streaming mode closes the stream automatically after the utterance ends.
  • Silence sensitivity: short-form models such as latest_short and command_and_search are more suited to short audio and prompts and are likely to return results once they detect a period of silence.
  • Multi-mode selection: in V1, model selection applies across speech:recognize, speech:longrunningrecognize, and streaming, so latest_short can be selected for long-running (batch) recognition as well as synchronous and streaming requests.
  • Confidence values: confidence values for Latest models are returned but are "not truly a confidence score".
  • Channel restriction: in the V2 client reference, separate-per-channel recognition cannot be selected when the model is latest_short.

Language support

The source does not state a language count for latest_short. Feature support varies by language.

For context within the same portfolio, Google's USM paper describes a 2B-parameter speech model pre-trained on 12 million hours across 300+ languages, with reported performance across 100+ languages; Chirp 2 is documented as USM-based in V2 materials. These figures apply to the USM and Chirp line, not to latest_short.

Performance and benchmarks

  • Vendor-reported: Google's release notes state that on January 9, 2024, quality was "substantially improved" for latest_short; the public model name did not change.
  • Vendor-reported (architecture family): the original Conformer paper reported state-of-the-art results on LibriSpeech at the time of publication. The paper describes the architecture family, not the deployed latest_short model.
  • Vendor guidance: for phone audio, Google recommends telephony or telephony_short and notes that phone-specific models can outperform latest_short or latest_long on telephony content.
  • Third-party evaluation: none cited in the source. The source states there is no current apples-to-apples public benchmark from primary sources for the short-command use case across vendors.
  • Customer benchmarking: Google offers V2 benchmarking tools so customers can compare word-error-rate across models using their own ground-truth data.

No word error rate figures for latest_short are disclosed in the source.

Latency and throughput

  • Google's best-practices page recommends StreamingRecognize with single_utterance=true for short queries or commands, to optimize for short utterances and minimize latency.
  • Synchronous recognition is documented as the simplest option for audio under one minute; asynchronous batch recognition is intended for audio longer than 60 seconds.
  • In V2, synchronous requests are limited to 10 MB or one minute of audio. Batch requests accept only Cloud Storage URIs, can include up to 15 files per request, and can process files up to 8 hours.
  • Dynamic batching is described as lower cost in exchange for higher latency.
  • The long-audio page states that for audio shorter than 60 seconds, synchronous recognition is faster and simpler than asynchronous recognition.
  • Google warns that Latest model refreshes can alter accuracy or latency.

No numeric latency figures for latest_short are disclosed in the source.

Deployment and integrations

  • latest_short is available through the Cloud Speech-to-Text API. In V1, it can be selected for synchronous (speech:recognize), asynchronous (speech:longrunningrecognize), and streaming recognition.
  • V2 batch recognition is built for processing N audio files in one long-running operation, with dynamic batching available for lower-urgency processing.
  • Current V1 pages direct new Cloud Speech-to-Text users to use the V2 API. V2 documentation promotes Chirp 2 and Chirp 3; Chirp 3 is described as Google's latest multilingual ASR-specific generative model, available only in V2, with streaming, synchronous, and batch support plus diarization and automatic language detection.
  • Vendor guidance for interactive command paths points toward streaming or synchronous recognition with utterance-oriented endpointing rather than batch.

Pricing

The pricing page says latest_short is billed as a standard model.

Item Price
V2 standard recognition, first 500,000 minutes per month $0.016 per minute
V2 standard recognition, highest published tier $0.004 per minute
V2 standard dynamic batch $0.003 per minute

Derived figures stated in the source: a 30-second clip works out to about $0.008 at standard V2 pricing and about $0.0015 with dynamic batch, before storage and networking costs. Each channel is billed separately, so multi-channel audio can increase actual spend.

Development and ownership

latest_short is developed and operated by Google as part of Google Cloud Speech-to-Text. Google's docs state the Latest models are based on Conformer technology. The 2022 launch blog describes the speech architecture as a single neural network that augments a transformer with convolution layers instead of relying on separately trained acoustic, pronunciation, and language models.

The original Conformer paper describes a convolution-augmented Transformer that combines local feature extraction with global sequence modeling, using stacked Conformer blocks that sandwich self-attention and convolution between feed-forward modules.

Related Google Research publications include work on streaming Conformer transducers, FastEmit (a latency-regularization method applied to transducer models including Conformer-Transducer, with reported latency reductions and better streaming behavior), and a 2023 paper on modular domain adaptation for Conformer-based streaming ASR that discusses a Conformer transducer trained on video-caption data and adapted to domains such as voice search and dictation. The source states these papers do not establish that latest_short is exactly a Conformer-Transducer with FastEmit.

Release history

Date Event
April 2022 Google introduced the latest model tag, described as access to newer Conformer-based models
January 9, 2024 Release notes state quality was "substantially improved" for latest_short; the public model name did not change

Google warns that Latest models may be refreshed more frequently than other models and that updates can make slight changes to accuracy or latency. There is no public evidence of a customer-facing version pin for latest_short.

Sources

Introduction to Latest Models | Cloud Speech-to-Text | Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/v1/latest-models

Best practices | Cloud Speech-to-Text | Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/v1/best-practices

Introduction to Latest Models | Cloud Speech-to-Text. https://docs.cloud.google.com/speech-to-text/docs/v1/latest-models?utm_source=chatgpt.com

Select a transcription model | Cloud Speech-to-Text | Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/v1/transcription-model

Speech-to-Text release notes | Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/release-notes

Single utterance behavior | Cloud Speech-to-Text | Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/single-utterance

Troubleshooting | Cloud Speech-to-Text | Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/troubleshooting

Transcribe long audio files into text | Cloud Speech-to-Text | Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/batch-recognize

Method: projects.locations.recognizers.batchRecognize | Cloud Speech-to-Text | Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/reference/rest/v2/projects.locations.recognizers/batchRecognize

Conformer: Convolution-augmented Transformer for Speech Recognition. https://www.isca-archive.org/interspeech_2020/gulati20_interspeech.pdf

FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization. https://research.google/pubs/fastemit-low-latency-streaming-asr-with-sequence-level-emission-regularization/

Google Cloud updates Speech API models for improved accuracy | Google Cloud Blog. https://cloud.google.com/blog/products/ai-machine-learning/google-cloud-updates-speech-api-models-for-improved-accuracy

How to recognize speech - Speech service - Foundry Tools | Microsoft Learn. https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-recognize-speech

Speech-to-Text API Pricing | Google Cloud. https://cloud.google.com/speech-to-text/pricing

The platform

Put these benchmarks to work

The same evaluations behind these dispatches drive OpenTranscription — one API that routes every job to the right speech model for your audio, language, and budget.

© 2026 OpenTranscription · Signal is our journal.Set in system grotesque, serif & mono