Google Cloud's latest_long model: what it is, what it costs, and when to pick something else

If you send Google Cloud Speech-to-Text an hour of podcast audio, a recorded meeting, or an interview, latest_long is the model Google wants you to reach for. It is the general-purpose long-form recognizer in the classic Speech-to-Text V1 API, positioned as the replacement for many workloads that used to run on the video and default models, and Google says the whole "latest" family is built on its Conformer speech-model technology.

The public record around it is thinner than you might expect for a production Google model, but the milestones are clear enough. On April 21, 2022, Google's release notes announced that the "Latest" models were available in more than 20 languages and used "new end-to-end machine learning techniques." Speaker diarization arrived for Latest models in en-US on October 3, 2022. Model adaptation for latest_long landed in 13 languages on January 9, 2024. Meanwhile Google shipped Speech-to-Text V2 in August 2023, which kept existing functionality but added recognizers, regionalization, and the separate Chirp family of large speech models.

That timeline puts latest_long in an awkward but useful spot. It is a modernized, Conformer-based production ASR model in the classic API line, and it is not Google's newest large multilingual speech family. Google now tells new users to start with the V2 API, where the generic long-form model usually appears under the identifier long, and the versioned Chirp 2 and Chirp 3 models carry Google's newer multilingual work. Google discloses only a high-level architecture for latest_long; the documentation cited here publishes no parameter count, no decoder design, and no training-corpus breakdown.

What Google says it is for

Google defines latest_long as the model for "any kind of long form content such as media or spontaneous speech and conversations." It can stand in for the video model when video is unavailable in the target language, or replace the legacy default model. Google frames the latest family as the way to get its "latest speech technology and machine learning research," with the caveat that some legacy features may lag behind older models.

In workload terms that means meeting notes, interviews, podcasts, webinars, lectures, media archives, long-form captioning, and general conversation transcription. To be precise about the sourcing: that list is an inference from Google's own intended categories of "media," "spontaneous speech," and "conversations," not a popularity ranking Google has published.

The docs also draw the boundaries plainly. If your source audio is telephony, Google says the newer telephony or telephony_short models will beat latest_long and latest_short on phone audio. For noisy, multi-speaker, video-style media, Google still describes the legacy video model as often the best choice, especially with high-quality microphones and heavy background noise, even while recommending latest_long as a replacement in many cases. That distinction matters day to day. latest_long is broad and modern, but it is not Google's best model for every domain, and Google says so itself.

One more positioning detail that changes the math: Google bills latest_long and latest_short as Standard models, not premium media models. Against the legacy video model, that is a real cost difference, not a rounding error.

The public timeline

The timeline below combines official release notes, Google Cloud blog posts, and model documentation. Google does not publish a distinct "first preview" date for latest_long, so none appears here. Every row is an event Google publicly documented.

Date	Public milestone	Why it matters
February 20, 2019	Google made transcription model selection generally available in Cloud Speech-to-Text, including specialized models such as video and phone_call.	The pre-latest_long baseline: model selection existed before the "Latest" family.
March 15, 2021	Model Adaptation launched for Speech-to-Text.	Adaptation later became relevant to latest_long, but not at its first public milestone.
April 21, 2022	Release notes announced that "Latest" models were available in more than 20 languages and used "new end-to-end machine learning techniques."	The clearest official public availability milestone for latest_long.
October 3, 2022	Speaker diarization became available for "Latest" models in en-US.	Important feature expansion for multi-speaker transcription.
November 11, 2022	Pricing changed so enhanced models were no longer priced differently; requests also moved to one-second rounding.	Simplified pricing context around the modern model family.
August 9, 2023	Speech-to-Text V2 became GA; Google said V2 migrated existing functionality and introduced Chirp, its latest 2B-parameter large speech model.	Marks the split between the classic latest_long line and the newer Chirp family.
November 6, 2023	Google launched telephony and telephony_short as the most recent versions of phone_call.	Narrows the domain where latest_long should be used.
January 9, 2024	Model adaptation became available for latest_long in 13 languages.	Major quality and control improvement for domain terms and vocabulary biasing.
January 27, 2025	Google announced broader availability for Chirp 2.	The newer multilingual branch of Google's speech lineup.
October 13, 2025	Google announced Chirp 3 GA, adding diarization and automatic language detection in the versioned Chirp line.	Google's newest ASR emphasis moved from "Latest" branding to explicit, versioned multilingual foundation models.

There is a wrinkle worth flagging for anyone who runs regression tests against transcription output. Google's docs say the "Latest" models may be updated or refreshed more frequently than other models, with changes that can affect accuracy or latency. latest_long is a product label pointing at a moving production model, not a frozen research checkpoint. If your quality gates assume a stable model underneath, they are assuming something Google explicitly declines to promise.

Timeline rendered as an abstract horizontal signal path with amber milestone nodes spaced along a slate-teal band

Who built it, as far as the public record goes

Google does not publish a launch engineering roster for latest_long. The strongest public attribution is indirect: product managers who announced adjacent Speech-to-Text platform changes, plus the researchers and engineers behind the Conformer and modern ASR work that Google says underlies the "Latest" models.

Person or group	Public role or affiliation	Publicly documented relevance
Calum Barnes	Product Manager, Cloud Speech.	Co-authored Google Cloud's V2/Chirp GA announcement, indicating product-side ownership of the Cloud Speech platform.
Haris Ioannou	Product Manager, Cloud Speech.	Co-authored the same Google Cloud announcement.
Conformer author group: Anmol Gulati, Chung-Cheng Chiu, James Qin, Jiahui Yu, Niki Parmar, Ruoming Pang, Shibo Wang, Wei Han, Yonghui Wu, Yu Zhang, Zhengdong Zhang	Google Research publication authors.	Google's latest_long docs say the "Latest" family is currently based on Conformer technology.
Yu Zhang	Research Scientist, Google Research.	Co-authored the Google Research USM blog and is also a Conformer author.
James Qin	Software Engineer, Google Research.	Co-authored the Google Research USM blog and is also a Conformer author.
Tara N. Sainath	Principal Research Scientist at Google, working on deep neural networks for ASR.	Senior public Google ASR researcher appearing on related papers such as Dual-mode ASR and Modular Domain Adaptation.
Rohit Prabhavalkar	Senior Staff Research Scientist in Google's Speech Technologies group.	Publicly associated with end-to-end ASR research and long-form ASR work.
Ian McGraw	Staff Software Engineer in Google's Speech group, managing on-device speech-recognition teams.	Publicly tied to speech-recognition engineering and confidence-related ASR work.

Read that table conservatively. These are publicly documented contributors to Google's speech stack and its model lineage, not a verified latest_long team list. Google has never published a "this exact model was developed by these named engineers" statement for latest_long in the sources reviewed here.

What we know about the architecture

Google's most explicit architectural statement is one line, and it carries a lot of weight: the "Latest" models are currently based on the Conformer Speech Model technology from Google and use "new end-to-end machine learning techniques." Google adds that this underlying technology "may change in the future," which again says latest_long is a rolling production family rather than a fixed, versioned checkpoint.

At the research level, Conformer combines transformer-style global context modeling with convolutional local feature modeling for ASR. Google's Conformer paper reports strong parameter efficiency and state-of-the-art results at publication time, including a small 10M-parameter variant on LibriSpeech. Do not read those paper numbers as the size of latest_long. Google has not stated the parameter count of the production model, and the paper predates the product.

So any diagram you see of latest_long internals is a simplified external view. Google documents the Conformer lineage and the API-visible features, and nothing about the decoder stack, the serving graph, or the parameter count.

Abstract geometric lattice of stacked convolution and attention blocks in amber and sage on a deep slate-teal background, representing the Conformer architecture

What is disclosed and what is not

Google documents latest_long as a Conformer-based member of the "Latest" family and says Cloud STT models are trained by analyzing millions of examples of human speech, with specialized models trained on audio from specific sources such as phone calls or video. For latest_long specifically, there is no published breakdown of supervised versus unsupervised data, no corpus sources, no labeling methodology, no sampling mix, and no parameter count.

The safe technical description: latest_long is a long-form, end-to-end, Conformer-based production ASR model whose public interface is far better documented than its training recipe. Compare that with USM and Chirp, where Google publishes high-level scale numbers like 2B parameters in its public materials. The newer the family, the more Google is willing to say about it.

Modes, limits, and latency

latest_long works with all three classic recognition modes: synchronous recognition, asynchronous long-running recognition, and streaming. Google's model-selection docs state that model selection applies to speech:recognize, speech:longrunningrecognize, and Streaming.

The V1 quotas set the practical envelope: roughly 1 minute for synchronous requests, roughly 480 minutes for asynchronous requests, and roughly 5 minutes per streaming request, with a 10 MB limit on inline local content. Streaming audio has to arrive at approximately real-time speed. In V2, the documented limits are 1 minute / 10 MB for synchronous requests, a 5-minute streaming session limit, and batch files up to 8 hours each, with up to 15 files per BatchRecognizeRequest.

On latency, Google publishes no formal P50/P95 benchmark for latest_long. What it does say: streaming returns results in real time as audio is processed, and long-running recognition often completes in about half the source audio length, depending on the source. Treat those as practical service expectations, not an SLA.

Languages, diarization, punctuation, and noise

The "Latest" family covers more than 20 languages and more than 50 variants, with feature support varying by language. The supported-language pages are the source of truth for the current matrix.

For multilingual audio, V1 accepts a primary languageCode plus up to three alternativeLanguageCodes, and V2 has analogous multiple-language recognition. Google advises keeping the alternatives list as short as possible, because too many candidates reduce the odds of picking the correct language. In V2, alternative-language recognition works with the long, short, and telephony models.

Automatic punctuation is optional. When enabled it inserts periods, commas, and question marks, and Google says it capitalizes the first letter after each period and question mark. Spoken punctuation and spoken emojis are separate toggles, and availability depends on the language.

Speaker diarization is supported by the API, and for the "Latest" models the release notes date en-US availability to October 2022. In streaming mode, diarization causes the service to resend the accumulating word history on consecutive responses so speaker tags can improve over time. For non-streaming requests, diarization arrives on the final result.

On noise, Google's stance is pragmatic rather than benchmark-driven. It recommends lossless audio such as FLAC or LINEAR16, warns that lossy codecs combined with background noise reduce accuracy, and describes the legacy video model as often best for high-quality microphones with heavy background noise. Model adaptation can help when audio is noisy or unclear. There is no dedicated robustness benchmark published for latest_long.

Adaptation and fine-tuning

The V1 RecognitionConfig exposes the accuracy controls around latest_long: adaptation, speechContexts, transcriptNormalization, word timestamps, word confidence, punctuation, diarization, profanity filtering, and the spoken punctuation and emoji toggles. When adaptation is set, it supersedes speechContexts.

Model adaptation is the lightweight domain-biasing mechanism. Phrase sets, class tokens, and boost values steer recognition toward names, jargon, addresses, and other rare or domain-specific content. Google says adaptation helps when audio contains frequent specialized terms, rare words, or noise and unclear speech. For latest_long it became available in 13 languages in January 2024.

For deeper customization, Google's Custom Speech-to-Text models in V2 use a pre-trained, Conformer-based architecture fine-tuned from a base model, and the current docs list latest_long as the base model for supported locales including de-DE, en-AU, en-GB, en-IN, en-US, es-US, and es-ES. Training requires at least 100 audio hours of training data and 10 audio hours of validation data. This is the closest thing to true fine-tuning against the latest_long line that Google exposes publicly.

And one limitation that deserves more attention than it gets: Google explicitly says the latest models do not support true confidence scores. The API may return a value, but Google's own wording is that it is "not truly a confidence score." If your pipeline filters transcripts, gates quality, or routes segments to human review based on that number, you are building on a value the vendor has disclaimed.

Where it sits in Google's model lineup

Google's speech lineup is easiest to read as two overlapping layers. There is the classic Cloud STT model family (default, video, phone_call, latest_long, latest_short, and the newer telephony variants), and there is the V2 foundation-model family centered on Chirp 2 and Chirp 3. V2 also introduces recognizers as reusable configuration resources and broader regionalization.

Model	API surface	Best fit	Key notes
latest_long	V1 identifier latest_long.	Long-form media, spontaneous speech, conversations.	Conformer-based "Latest" family; standard pricing; intended replacement for many default / video cases; confidence score not truly calibrated.
latest_short	V1 identifier latest_short.	Short utterances, commands, one-shot directed speech.	Intended replacement for command_and_search; quality improved in 2024 release notes.
video	V1 legacy premium model.	Video clips, podcasts, multi-speaker media, high-quality mics, background noise.	Often still best for noisy or multi-speaker media, but Google recommends latest_long as a replacement in many languages and workloads.
default	V1 legacy general model.	General audio not covered by specialized models.	Google says legacy models are "mostly based on classic non-conformer architectures" and kept mainly for backward compatibility.
phone_call	V1 legacy phone model.	8 kHz phone audio.	Superseded in practice by telephony / telephony_short.
telephony / telephony_short	Newer phone models documented in V1/V2 materials.	Phone-call audio, especially 8 kHz.	Google says these correspond to the most recent versions of phone_call.
chirp / Chirp family	V2-only family in pricing and blog context; current docs emphasize chirp_2 and chirp_3.	Multilingual, newer foundation-style ASR.	Original Chirp was presented as a 2B-parameter large speech model; Chirp 2 and Chirp 3 are the currently documented versioned models.
chirp_2	V2 only.	Multilingual transcription and translation.	USM-derived; supports timestamps, adaptation, and translation; docs say no diarization and no language detection.
chirp_3	V2 only.	Google's newest documented multilingual ASR model.	Adds diarization and automatic language detection; supports streaming, sync, and batch.

The V1 versus V2 question

For anyone evaluating latest_long today, the API generation matters almost as much as the model. Google now says new Cloud Speech-to-Text users should use the V2 API. V2 adds recognizers as reusable configuration objects, supports full regionalization per the V2 platform blog, and is the only home of the Chirp family.

Here is the part that trips people up. Google's V2 client examples use the model identifier long rather than latest_long, alongside recognizer resources like projects/{project}/locations/{location}/recognizers/_. The V2 multi-language docs likewise refer to long, short, and telephony models. That strongly suggests the generic long-form capability in V2 is exposed as long, even though Google's pricing tables still mention latest_long in the standard-model categories. But the docs reviewed here contain no single sentence saying "latest_long == long," so treat that mapping as operationally apparent rather than something Google has formally stated in one place.

If your organization is already standardized on V1 and happy with long-form English or supported-language workloads, latest_long remains a legitimate production choice under the GA Speech-to-Text SLA. For new systems, the strategic question usually is not "latest_long or not." It is "V2 long versus the Chirp family," especially where multilingual coverage, regionalization, or roadmap alignment matters.

Two abstract signal paths diverging across a slate-teal field, one continuing as a steady amber waveform and one branching into a wider multilingual lattice of sage nodes

Limits, privacy, and pricing

The limitations that actually bite

Four documented limitations are worth planning around.

The confidence warning comes first. For the latest models, the returned confidence value is not a true confidence score, in Google's own words. Vendors rarely disclaim their own API fields this directly, so take it seriously.

Second, domain mismatch. Telephony models outperform latest_long on phone audio, and Google continues to describe video as often superior for noisy, high-quality, multi-speaker media. Broad and modern does not mean optimal for every source condition.

Third, feature variability by language. Punctuation, diarization, and other features vary in support by language, and the supported-language matrix is the current source of truth.

Fourth, opacity. Google documents the Conformer lineage and the exposed API parameters, and nothing about the parameter count, data mix, or internal serving architecture. Teams that need frozen weights, auditable training provenance, or benchmarked latency distributions will not find them here.

Privacy and data handling

Google's privacy stance is comparatively clear. By default, Cloud Speech-to-Text does not log customer audio data or transcripts. If you opt into data logging, Google may use that logged data to improve the service, and you may get discounted pricing in exchange.

Without the opt-in, Google says it does not use your content except to provide the service, and it does not claim ownership of the audio or the returned transcript.

Storage behavior differs by request type. For streaming and synchronous endpoints, audio is processed in memory and customer data is not stored, though some request metadata is temporarily logged for abuse prevention and service improvement. For asynchronous requests, the resulting transcript is stored for approximately five days so you can retrieve it; the API service does not store the input audio itself.

For regulated or region-sensitive workloads, Google offers US and EU regional endpoints in V1 and broader regionalization in V2, plus customer-managed encryption keys for Speech-to-Text resources through Cloud KMS.

One gap to plan for: the Speech-to-Text docs do not advertise native PII redaction as a latest_long feature. Google points to Sensitive Data Protection as the product for classification, redaction, and de-identification of text content. If you need transcript redaction, budget for a post-transcription DLP step rather than assuming the model handles it.

Pricing and quotas

For Speech-to-Text V1, Google's pricing page lists standard models, explicitly including latest_long, at $0.016/minute above 60 monthly minutes with data logging, or $0.024/minute above 60 monthly minutes without data logging. The first 60 minutes per month per account are free in both cases.

For V2, standard recognition is priced by monthly tier, starting at $0.016/minute and falling with volume, and dynamic batch recognition is listed at $0.003/minute. The pricing page places latest_long, latest_short, and chirp in the standard model set for billing.

Current V1 quota docs list 900 recognition requests per 60 seconds, 480 hours of audio processing per day, and the content limits above (roughly 1 minute sync, 480 minutes async, 5 minutes streaming). Multi-channel audio is billed per channel, even though quota accounting is based on file duration rather than multiplied channel time. That per-channel billing detail surprises people on their first stereo invoice.

How to deploy it well

Google's best-practices guidance lines up with how latest_long should be run in production. Use 16 kHz or higher where practical, avoid unnecessary resampling, and prefer lossless encodings like FLAC or LINEAR16. Lossy codecs plus noisy capture conditions reduce accuracy.

For long-form media or conversation transcription, enable automatic punctuation, add word timestamps if you need them, and turn on diarization when speaker separation matters. Use model adaptation for names, product terms, or domain vocabulary. If the audio is actually phone audio, switch to telephony. If it is heavily noisy, multi-speaker studio, podcast, or video audio, benchmark against video or the newer V2 models before assuming latest_long wins.

For new builds, prefer V2 when possible, especially if you need regionalization, recognizers, or a path to Chirp 2 and 3. For stable legacy deployments already on V1 request shapes with long-form English or media transcription, latest_long is still a valid GA model.

Two request patterns follow, grounded in Google's documented methods, parameters, and identifiers. The first uses V1 latest_long directly; the second shows the V2 long-form equivalent from Google's migration examples.

V1 long-running request with latest_long

curl -s -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
  https://speech.googleapis.com/v1/speech:longrunningrecognize \
  --data '{
    "config": {
      "languageCode": "en-US",
      "model": "latest_long",
      "enableAutomaticPunctuation": true,
      "enableWordTimeOffsets": true,
      "diarizationConfig": {
        "enableSpeakerDiarization": true,
        "minSpeakerCount": 2,
        "maxSpeakerCount": 6
      },
      "adaptation": {
        "phraseSets": [
          {
            "phrases": [
              {"value": "OpenTranscription"},
              {"value": "speech adaptation"},
              {"value": "diarization"}
            ]
          }
        ]
      }
    },
    "audio": {
      "uri": "gs://YOUR_BUCKET/meeting_audio.flac"
    }
  }'

This reflects Google's documented V1 method names (speech:longrunningrecognize), the request configuration fields (model, punctuation, timestamps, diarization, adaptation), and the recommendation to use Cloud Storage URIs for longer audio.

V2 Python request using the long-form generic model

import os
from google.cloud.speech_v2 import SpeechClient
from google.cloud.speech_v2.types import cloud_speech

PROJECT_ID = os.environ["GOOGLE_CLOUD_PROJECT"]

client = SpeechClient()

config = cloud_speech.RecognitionConfig(
    auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),
    language_codes=["en-US"],
    model="long",
    features=cloud_speech.RecognitionFeatures(
        enable_automatic_punctuation=True,
        enable_word_time_offsets=True,
    ),
)

request = cloud_speech.RecognizeRequest(
    recognizer=f"projects/{PROJECT_ID}/locations/global/recognizers/_",
    config=config,
    uri="gs://YOUR_BUCKET/interview.wav",
)

response = client.recognize(request=request)

for result in response.results:
    print(result.alternatives.transcript)

Google's migration examples use model="long" with V2 recognizers, which is the documented long-form generic pattern in V2. If you need region-specific processing, Google also documents region-specific endpoints and recognizer paths.

Where to dig deeper

The sources below are the most useful starting set for primary-source diligence on latest_long and its lineage.

Source type	Source	Why it matters
Official doc	Introduction to Latest Models.	Canonical definition of latest_long, Conformer basis, pricing class, update cadence, and the confidence-score caveat.
Official doc	Select a transcription model.	Best single page for comparing latest_long to video, default, phone_call, and telephony.
Official doc	RecognitionConfig reference.	API surface for timestamps, adaptation, punctuation, diarization, channels, and model selection.
Official doc	Release notes.	The best source for the public timeline of latest_long feature rollouts.
Google Research paper	Conformer: Convolution-augmented Transformer for Speech Recognition.	Public research lineage behind the documented latest-model architecture.
Google Research paper	Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling.	Relevant to streaming and full-context co-training ideas in Google's ASR stack.
Google Research paper	Modular Domain Adaptation for Conformer-Based Streaming ASR.	Relevant to domain adaptation strategies around Conformer ASR.
Google Research blog	Universal Speech Model.	The newer multilingual speech-foundation direction that later fed Chirp.
Google Cloud blog	Speech-to-Text V2 API and Chirp are GA.	Best product-level explanation of V2, recognizers, regionalization, and Chirp's 2B scale.
Patent	Mixture model attention for flexible streaming and non-streaming ASR.	Representative Google patent on unified streaming and non-streaming ASR with conformer-based encoder blocks.
Patent	Speaker diarization using an end-to-end model.	Representative Google patent relevant to diarization functionality adjacent to latest_long.
Patent	Fast Emit: Low-latency Streaming ASR with Sequence-level Emission Regularization.	Representative Google patent on low-latency streaming transducer training relevant to production ASR behavior.

The short answer to "what is latest_long?" is: a Google Cloud Speech-to-Text long-form, Conformer-based, general-purpose production ASR model in the classic Speech API family. The longer, strategic answer is that it is a still-valid production model coexisting with V2's long path and the newer Chirp family, and Chirp is where Google's explicit multilingual foundation-model work now lives.

Sources

Introduction to Latest Models, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/v1/latest-models
Speech-to-Text release notes, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/release-notes
Speech-to-Text best practices, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/v1/best-practices
Select a transcription model, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/v1/transcription-model
Google Cloud Speech-to-Text V2 API, Google Cloud Blog. https://cloud.google.com/blog/products/ai-machine-learning/google-cloud-speech-to-text-v2-api
Conformer: Convolution-augmented Transformer for Speech Recognition, Google Research. https://research.google/pubs/conformer-convolution-augmented-transformer-for-speech-recognition/
Universal Speech Model (USM): State-of-the-art speech AI for 100+ languages, Google Research blog. https://research.google/blog/universal-speech-model-usm-state-of-the-art-speech-ai-for-100-languages/
Tara N. Sainath, Google Research. https://research.google/people/tarasainath/
Rohit Prabhavalkar, Google Research. https://research.google/people/rohitprabhavalkar/
Ian McGraw, Google Research. https://research.google/people/106845/
Compare transcription models, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/transcription-model
Speech-to-Text quotas and limits, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/v1/quotas
Transcribe audio from streaming input, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/streaming-recognize
RecognitionConfig reference, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/reference/rest/v1/RecognitionConfig
Improve transcription results with model adaptation, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/adaptation-model
Custom Speech-to-Text models, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/custom-speech-models
Speech-to-Text pricing, Google Cloud. https://cloud.google.com/speech-to-text/pricing
Chirp 2: Enhanced multilingual accuracy, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/models/chirp-2
Chirp 3 Transcription: Enhanced multilingual accuracy, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/models/chirp-3
Migrate to Speech-to-Text V2, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/migration
Data logging, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/v1/data-logging
Data usage FAQ, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/v1/data-usage-faq
Regional endpoints, Cloud Speech-to-Text, Google Cloud Documentation. https://docs.cloud.google.com/speech-to-text/docs/v1/endpoints
Sensitive Data Protection documentation, Google Cloud. https://docs.cloud.google.com/sensitive-data-protection/docs
Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling, Google Research. https://research.google/pubs/dual-mode-asr-unify-and-improve-streaming-asr-with-full-context-modeling/
Modular Domain Adaptation for Conformer-Based Streaming ASR, Google Research. https://research.google/pubs/modular-domain-adaptation-for-conformer-based-streaming-asr/
US12136415B2, Mixture model attention for flexible streaming and non-streaming ASR. https://patents.google.com/patent/US12136415B2/en
WO2019209569A1, Speaker diarization using an end-to-end model. https://patents.google.com/patent/WO2019209569A1/en
US20220122586A1, Fast Emit: Low-latency Streaming ASR with Sequence-level Emission Regularization. https://patents.google.com/patent/US20220122586A1/en