GPT-4o Transcribe: what OpenAI ships, claims, and still won't tell you
A practitioner's look at gpt-4o-transcribe: pricing, API surface, benchmark evidence, and why OpenAI now recommends the mini model over it.

gpt-4o-transcribe is OpenAI's dedicated speech-to-text model in the GPT-4o family. It arrived on March 20, 2025, alongside gpt-4o-mini-transcribe and gpt-4o-mini-tts, and it turned OpenAI from "the company that released Whisper" into a direct competitor in the premium managed transcription market. The model is aimed at developer workloads like customer call centers, meeting notes, and voice agents, not at ChatGPT users. And there is a twist worth knowing before you write a line of integration code: as of January 13, 2026, OpenAI's own changelog says it currently recommends gpt-4o-mini-transcribe over gpt-4o-transcribe for the best results.
That recommendation is strange on its face. The nominally bigger model is no longer the one OpenAI points you to. It also tells you something real about how this product line works, which is what this post is about: what OpenAI has actually disclosed, what the model costs and does at the API level, what the benchmark evidence says, and where the gaps are.
The short version on transparency: OpenAI says the model builds on GPT-4o, uses specialized audio-centric pretraining data, improved distillation, and an "RL-heavy" training paradigm for speech-to-text. It does not publish the parameter count, layer counts, tokenizer, corpus size, exact language coverage in training, or whether the deployed system is fully end-to-end or partly cascaded. Whisper's architecture, model sizes, and training corpus are all public. For this model, none of that is.
How the model got here
OpenAI's transcription stack evolved in stages rather than as one clean model line. Whisper shipped in September 2022 as an open-source general-purpose ASR system trained on 680,000 hours of multilingual, multitask supervision from the web. GPT-4o entered the API on May 13, 2024, as the flagship "omni" family anchor. The Realtime API beta followed on October 1, 2024, and gpt-4o-audio-preview on October 17, 2024, which exposed GPT-4o-family audio I/O for chat completions. The dedicated transcription SKUs, gpt-4o-transcribe and gpt-4o-mini-transcribe, landed on March 20, 2025 in the Audio API. By December 15, 2025, OpenAI had released a dated gpt-4o-mini-transcribe-2025-12-15 snapshot, and on January 13, 2026 it updated the moving slug and added the recommendation to prefer the mini model.
One product nuance that gets missed: public ChatGPT release notes from February 14, 2025 still referred to ChatGPT's voice-to-text dictation feature as Whisper, a month before the API launch of the GPT-4o transcription models. The consumer dictation path and the API transcription roadmap were apparently not unified at that point. OpenAI's Realtime API announcement also framed its speech stack as "similar to ChatGPT's Advanced Voice Mode," so consumer ChatGPT voice and the dedicated transcription API slugs are not the same product surface, and you should not reason about one from the other.
This timeline is synthesized from OpenAI's Whisper materials, GPT-4o model and changelog pages, the March 2025 audio-model launch post, ChatGPT release notes, and the 2025-2026 API changelog.
What it is for
OpenAI's launch framing was unusually direct: the new speech-to-text models exist to help developers build more accurate, more robust voice agents, with particular emphasis on customer call centers and meeting-note transcription. The launch post positions them as infrastructure, not as a general multimodal chat endpoint. It also says developers who want low-latency speech-to-speech should prefer the Realtime API speech models. gpt-4o-transcribe is the high-accuracy transcription member of a broader voice stack, and OpenAI has never pretended otherwise.
At the API level, the speech-to-text guide says OpenAI historically backed both transcription and translation endpoints with whisper-1, and now also supports gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-transcribe-diarize on the transcription side. The guide distinguishes three usage patterns: single-shot file transcription, streaming of completed recordings, and realtime transcription sessions for low-latency live audio. For gpt-4o-transcribe, OpenAI documents a language hint that improves both accuracy and latency, and a text prompt field that can guide style or continue a previous audio segment.
The promptability is the interesting part. Conventional ASR APIs mostly expose optional bias phrases. OpenAI exposes a GPT-like prompt channel directly on transcription requests. The API reference says the prompt should match the audio language, and the model can return token log probabilities when include=["logprobs"] is requested. This is much closer to a text-generating ASR system than a classic opaque recognizer, even though OpenAI markets it as a transcription model rather than a chat model.
One caveat for anyone building on this today. OpenAI's general speech guide says gpt-4o-transcribe and gpt-4o-mini-transcribe support JSON or plain-text outputs, while the typed API reference says those models support only json and reserves text and diarized_json for gpt-4o-transcribe-diarize. The docs contradict each other, so validate the actual behavior in your target SDK and API version before you depend on an output format.

Not a speech mode of GPT-4o
The simplest way to think about it: gpt-4o-transcribe is a specialized ASR SKU, and GPT-4o proper is a general-purpose multimodal reasoning and chat SKU. The current GPT-4o model page describes the default model as accepting text and image inputs and producing text outputs, with a 128,000-token context window and 16,384 max output tokens. The gpt-4o-transcribe page describes a speech-to-text model with audio and text input, text output, a 16,000-token context window, and 2,000 max output tokens. Its knowledge cutoff is listed as June 1, 2024; default GPT-4o's page lists October 1, 2023. Cutoffs matter less for ASR than for chat, but the mismatch confirms these are separately packaged models on separate release cadences. Notably, the default GPT-4o page says audio is not supported on that SKU at all. Audio I/O lives in separate GPT-4o audio and realtime variants.
The economics reinforce the split. gpt-4o-transcribe is priced at $2.50 input and $10.00 output per 1M tokens, with an estimated $0.006 per minute. gpt-4o-mini-transcribe is half that, at an estimated $0.003 per minute. gpt-4o-audio-preview, by contrast, bills audio tokens at $40 per 1M input and $80 per 1M output. If your task is straight ASR, the dedicated transcription model is dramatically cheaper than routing audio through a general audio chat model.
Latency posture differs too. The gpt-4o-transcribe model page labels speed as "Medium," while OpenAI's gpt-realtime-whisper page labels speed as "Very fast" and exposes an explicit delay-versus-accuracy tradeoff with five settings (minimal, low, medium, high, xhigh) for live transcription sessions. OpenAI's own lineup separates higher-accuracy offline transcription from ultra-low-latency live transcription rather than forcing one model to do both.
Customization is where the asymmetry is sharpest. GPT-4o proper supports documented fine-tuning, structured outputs, and function calling. For gpt-4o-transcribe, the public materials discuss internal midtraining, distillation, and RL-heavy post-training, but there is no customer-accessible fine-tuning documented anywhere. You can steer it through prompting and request options, and that is the extent of the surface.
The 4o family, side by side
| Model | Primary job | I/O and modality surface | Prompt / control surface | Context and output budget | Public cost signal |
|---|---|---|---|---|---|
| gpt-4o-transcribe | Dedicated speech-to-text | Audio + text in, text out; optimized for transcription. | Language hint improves accuracy/latency; prompt can guide style/continuation; logprobs available. | 16k context, 2k max output. | ~$0.006/min estimated. |
| gpt-4o-mini-transcribe | Lower-cost dedicated speech-to-text | Same broad role, smaller/faster/cheaper. OpenAI later recommends it over gpt-4o-transcribe. | Same prompting/logprob pattern in docs. | 16k context, 2k max output. | ~$0.003/min estimated. |
| Default gpt-4o | General-purpose flagship model | Text + image in, text out on the default model page. | Broad chat-style prompting, structured outputs, tools, fine-tuning. | 128k context, 16,384 max output. | Token-priced, not minute-priced. |
| gpt-4o-audio-preview | Audio-capable chat / speech-to-speech preview | Text + audio in, text + audio out. | Chat-completions style prompting; not positioned as the dedicated ASR choice. | 128k context, 16,384 max output. | Audio tokens: $40 input / $80 output per 1M audio tokens. |
| gpt-realtime-whisper | Low-latency live transcription | Audio + text in, text out; realtime sessions. | Explicit live delay-vs-accuracy knob; manual commit or server-side turn detection depending on model support. | 16k context, 2k max output. | $0.017/minute. |
Two takeaways follow from that table. First, if you only need transcription, gpt-4o-transcribe is not a "speech mode" of GPT-4o; it is a separate product optimized, budgeted, and priced differently. Second, OpenAI's 2026 guidance that the mini model is now the better performer is an unusual signal. It suggests post-training and data quality matter more here than raw base-family branding, and that "bigger 4o" does not automatically mean "better transcripts."
The architecture, as far as anyone outside OpenAI can see it
The March 2025 launch post says the new audio models "build upon the GPT-4o and GPT-4o-mini architectures," are "extensively pretrained on specialized audio-centric datasets," use enhanced distillation to transfer knowledge from larger audio models into smaller ones, and rely on an "RL-heavy" paradigm for speech-to-text. That is enough to establish these are not Whisper with a new wrapper. It is nowhere near enough to reconstruct the deployed system. OpenAI does not publish parameter counts, layer counts, encoder/decoder design, the feature-extraction recipe, any language-model subcomponents, or a declaration of whether the system is purely end-to-end or partly cascaded.
What is public is the API-facing pipeline. For batch and file transcription, you supply a file, optionally hint the language, optionally provide a text prompt, and optionally request token logprobs. For diarization, OpenAI exposes chunking controls: when chunking is set to auto, the server first normalizes loudness and then uses voice activity detection to choose boundaries, and it also supports manual server_vad configuration plus known-speaker reference clips, with up to four speaker references at 2 to 10 seconds each. For live transcription, OpenAI documents a separate realtime flow using 24 kHz mono PCM for gpt-realtime-whisper, with explicit delay settings and optional turn detection depending on model support. Everything past that boundary, the acoustic frontend, the encoder/decoder stack, the alignment mechanism, remains unspecified.
Whisper is the clearest reference point for how much is missing. Whisper's model card and code show a Transformer sequence-to-sequence system trained jointly across ASR, speech translation, language identification, and VAD-style tasks using special tokens. The architecture is explicit: an AudioEncoder with two convolutional layers followed by Transformer blocks over log-mel spectrogram inputs, and a TextDecoder with cross-attention over encoded audio features. Model sizes range from 39M to 1.55B parameters in the official series, and the training corpus is 680,000 hours across 98 languages. Every one of those facts is undisclosed for gpt-4o-transcribe.
To be fair to OpenAI, the competition is not much more open. Google describes Chirp 3 as an ASR-specific generative multilingual model and now exposes preview "custom prompt" formatting instructions, but does not publish topology or training-set scale in the cited docs. Microsoft says Azure Speech uses a "Universal Language Model" base model trained on Microsoft-owned data with dialect and phonetic coverage, plus custom-speech adaptation, and stops there. AWS's service card is the most explicit about the logical stack: Transcribe identifies acoustic features, generates candidate word strings, then applies language modeling to rank candidates, which reads more like a cascaded or hybrid description than an end-to-end seq2seq disclosure.

The benchmark picture
OpenAI's official benchmark story is strong on direction and weak on reproducible numbers. The launch post says gpt-4o-transcribe achieves lower WER than the original Whisper models, outperforms Whisper v2 and v3 across all language evaluations shown, and on FLEURS "matches or outperforms other leading models across most major languages." What it does not include is a numeric benchmark table. The positioning is credible as an official claim, but it is not a paper-style appendix you can check.
The December 2025 update shows where the effort actually went after launch. OpenAI says gpt-4o-mini-transcribe-2025-12-15 delivered lower WER than prior models on Common Voice and FLEURS without language hints, and in an internal "hallucination-with-noise" evaluation produced roughly 90% fewer hallucinations than Whisper v2 and about 70% fewer than previous GPT-4o-transcribe models. Read that carefully: the roadmap after launch concentrated on reliability, short-utterance behavior, and hallucination control in noisy real-world conversations, not on headline WER. Hallucination is the failure mode that actually burns production transcription users, so this is the right priority, even if the evaluation is internal.
Independent academic work confirms the model became a serious frontier ASR baseline. The AHELM benchmark paper includes GPT-4o-transcribe family models and reports that gpt-4o-transcribe itself did not show statistically significant ASR bias conditioned on speaker sex in one analysis, while gpt-4o-mini-transcribe showed a male-speaker advantage. The WhisperKit paper treats gpt-4o-transcribe as a frontier cloud baseline and finds it stronger than base GPT-4o for transcription. The Qwen3-ASR report compares directly against it as one of three leading proprietary services. The Step-Audio 2 report says the team preferred GPT-4o Transcribe over GPT-4o Audio because the former gave stronger results, and calls it one of the specialized ASR systems with leading-edge performance.
Performance is still contingent on evaluation design, though, and the degradation numbers are not subtle. A 2026 "Back to Basics" ASR paper reports environmental degradation results for GPT-4o Transcribe on FLEURS: clipping increased Chinese CER from 6.4 to 17.4, English WER from 2.8 to 8.8, Japanese CER from 3.0 to 7.9, and Korean CER from 4.0 to 10.2. The HiKE code-switching evaluation reports GPT-4o-Transcribe as the only LLM-based model in that study to outperform Whisper-Large. And a phoneme-level study of phonologically complex Ukrainian notes that the model tended to output Cyrillic and that the researchers explicitly prompted for Cyrillic transcription, a useful reminder that prompt handling changes measured quality on this model in a way it does not for classic ASR APIs.
Public benchmark evidence
| Source | What it says about GPT-4o-transcribe | Why it matters |
|---|---|---|
| OpenAI launch post | Lower WER than Whisper; outperforms Whisper v2/v3 across shown language evaluations; matches or outperforms other leading models across most major languages on FLEURS. | Strong official positioning, but mostly relative rather than numeric. |
| OpenAI Dec. 2025 audio update | New mini snapshot lowers WER on Common Voice/FLEURS and cuts hallucinations ~90% vs Whisper v2 and ~70% vs previous GPT-4o-transcribe models in an internal noise-heavy eval. | Suggests OpenAI's biggest later gains were reliability and hallucination control, not just raw WER. |
| AHELM 2025 | GPT-4o-transcribe used as a major evaluated model; no statistically significant sex-conditioned ASR bias reported there for GPT-4o-transcribe, while the mini variant showed a male-speaker effect in that analysis. | Evidence that the model is already central in academic audio-language benchmarking. |
| WhisperKit 2025 | Treats gpt-4o-transcribe as the frontier transcription baseline and stronger than base GPT-4o for this task. | Reinforces that gpt-4o-transcribe should be compared to ASR systems, not just to GPT chat models. |
| HiKE 2026 | GPT-4o-Transcribe was the only LLM-based model in that benchmark to outperform Whisper-Large. | Indicates strong code-switching performance in at least one setting. |
| Back to Basics 2026 | Strong clean FLEURS performance, but clipping/far-field/reverberation materially degrade results. | Shows robustness is improved, not solved. |
Against Google, Azure, and AWS
| System | Public architecture disclosure | Multilingual support in official sources | Noise / robustness claim | Diarization / customization | Public pricing signal | Public numeric benchmark transparency |
|---|---|---|---|---|---|---|
| OpenAI gpt-4o-transcribe | Builds on GPT-4o; exact topology unspecified. | OpenAI publishes a supported-language list for 57 languages meeting its support criterion; launch post references FLEURS and broader multilingual gains. | OpenAI directly claims gains in accents, noisy environments, and varying speech speeds. | Prompting, language hints, token logprobs; diarization available on related gpt-4o-transcribe-diarize. | ~$0.006/min. | Relative official WER claims, but few official numeric tables. |
| OpenAI Whisper | Fully public seq2seq encoder-decoder and sizes. | Training data covers 98 languages. | OpenAI says Whisper improved robustness to accents, noise, and technical language. | Open-source prompting/task tokens; not realtime out of the box. | Open-source; API whisper-1 pricing not in scope here. | Stronger research transparency than products, but older quality frontier. |
| Google Chirp 3 | Officially described as a multilingual ASR-specific generative model; topology unspecified. | 85+ languages and variants on the product page. | Google says it offers enhanced accuracy and speed and can handle noisy audio without extra noise cancellation. | Diarization in supported languages; speech adaptation and preview custom prompt support. | V2 standard recognition starts at $0.016/min; dynamic batch $0.003/min. | No current official, directly comparable WER/CER table found in the cited docs. |
| Azure Speech | Uses a Universal Language Model base model; topology unspecified. | Official language-support tables cover realtime, fast, and batch locales, but the cited overview does not summarize one headline count. | Microsoft positions fast transcription as faster than real time and exposes custom speech for domain adaptation. | Diarization up to 35 speakers; phrase lists and custom speech. | Search snippets show roughly $1.20/hr realtime and ~$0.225/hr batch, with region-dependent dynamic pricing tables. | No current official, directly comparable WER/CER table found in the cited docs. |
| Amazon Transcribe | AWS service card describes acoustic features, then candidate strings, then language-model ranking. | 100+ languages and locales. | AWS says the 2023 foundation-model update improved accuracy 20 to 50% across most languages and 30 to 70% on telephony; service card says it can perform well in noisy and multi-speaker settings. | Speaker diarization, custom vocabulary, custom language models, language identification. | Tier 1 in us-east-1 is $0.024/min for the first 250k minutes; one-second billing, 15-second minimum. | AWS publishes some F1 and diarization accuracy examples in service cards, but not a simple universally comparable WER/CER table. |
The positioning that emerges: OpenAI is now a serious proprietary transcription competitor, but it is less transparent than Whisper was and not clearly more transparent than the hyperscalers. Google, Azure, and AWS each offer stronger enterprise packaging around batch, streaming, and customization. OpenAI's differentiation is modern GPT-family prompting combined with strong multilingual and noise claims and increasingly competitive error behavior, traded against weaker official numeric benchmarking.
Developer anecdotes are mixed rather than one-sided, which matches the benchmark picture. OpenAI community posts report cases where GPT-4o-transcribe handled background noise or language recognition better than Whisper, but others report transcript truncation, odd outputs, prompt leakage in related models, language-enforcement issues, and perceived slowdown or instability that later turned out to be networking-specific. None of that outweighs controlled evaluation, but it fits the pattern: LLM-style ASR is stronger in messy real audio while behaving in less familiar, sometimes more generative ways than older speech APIs.
Enterprise adoption is easier to document for OpenAI's broader voice stack than for this specific slug. OpenAI's customer stories highlight Retell AI using GPT-4o-based voice systems and Parloa using OpenAI models to run enterprise voice-driven customer service. Those are real production signals, but the public stories never specify whether transcription runs on gpt-4o-transcribe, realtime models, or some internal routing mix. If you are evaluating traction of the exact model slug rather than the platform, that ambiguity matters.

Who built it
For the March 2025 audio launch, OpenAI publicly credited four research leads: Christina Kim, Junhua Mao, Yi Shen, and Yu Zhang. It also published broad contributor lists across research, engineering, and product. Product leads for the audio launch included Anubha Srivastava, Jackie Shannon, Jeff Harris, Reah Miyara, and Xiaolin Hao. Leadership sponsors spanned multimodal, product, and engineering: Kevin Weil, Mark Chen, Nick Turley, Olivier Godement, Prafulla Dhariwal, Shengjia Zhao, and Andrew Gibiansky. That is unusually specific attribution by industry standards, and it reads like a cross-functional initiative spanning research, inference, product, and applied engineering rather than a small isolated speech team.
The ChatGPT product org is adjacent but not the same thing. Official credits from OpenAI's March 2025 4o image-generation release identify Jackie Shannon as ChatGPT Product Lead and Mengchao Zhong and Wayne Chang as ChatGPT Engineering Leads, with a broader ChatGPT applied team listed by name. LinkedIn snippets further identify Nick Turley as VP, Head of ChatGPT and Sulman Choudhry as Head of Engineering, ChatGPT. Those credits are not transcription-specific, but they mark the organizational boundary: the ChatGPT app org and the GPT-4o audio effort overlap at the leadership layer while staying distinct at the project-team level.
The personnel overlap is visible if you look. Wayne Chang appears in both the GPT-4o audio launch engineering credits and the ChatGPT engineering leadership credits; Xiaolin Hao, Wanning Jiang, Ola Okelola, and Yilei Qian appear in 2025 ChatGPT release credits and audio-launch credits. The reasonable read is that OpenAI's "omni" strategy is organizationally integrated, with shared platform, inference, and applied-model talent working across chat, voice, and other multimodal launches.
What to take away
gpt-4o-transcribe matters for two reasons. It made OpenAI a direct competitor in premium managed transcription, with pricing, prompting, and reliability behaviors that look like a modern generative system rather than an older ASR API. And it shows how OpenAI productizes the GPT-4o family: not one monolithic omni model, but specialized surfaces for chat, speech-to-speech, realtime transcription, and high-accuracy offline transcription. The strongest evidence that the specialization is real is that OpenAI later recommended gpt-4o-mini-transcribe over gpt-4o-transcribe. That recommendation makes no sense if "4o" were one undifferentiated model line.
For developers, the practical implications: if the task is transcription, the dedicated transcribe model is cheaper and more operationally appropriate than funneling audio through general GPT-4o audio chat paths. Prompting and language hints are first-class levers. But build your evaluations around clipping, far-field audio, code-switching, prompt leakage, and minute-scale truncation edge cases, because both the academic papers and the user reports show those failure modes are still live.
For researchers, the model is important and frustrating at the same time. It is already a major proprietary baseline in current ASR literature, yet OpenAI withholds the architectural and dataset details that would make rigorous comparison possible. Whisper remains far more valuable as a research object because its architecture, corpus scale, and model sizes are public. gpt-4o-transcribe is a deployed-system benchmark, not a transparent scientific artifact.
The open items, concretely. OpenAI has not specified parameter count, encoder/decoder topology, tokenizer, training-hours figure, or whether the system is end-to-end or partly cascaded; the disclosure stops at "builds on GPT-4o," specialized audio datasets, distillation, and RL-heavy training. There is no current OpenAI paper or docs page with a full reproducible numeric WER/CER table for gpt-4o-transcribe; the official messaging is relative. Direct vendor comparisons stay imperfect because Google, Azure, and AWS publish no common, directly comparable WER/CER/latency benchmark set in the cited primary materials. And the docs are internally inconsistent on whether the model supports plain-text output or only JSON, so validate current behavior in your SDK before shipping.
Sources
- Introducing next-generation audio models in the API | OpenAI - https://openai.com/index/introducing-our-next-generation-audio-models/
- GPT-4o Transcribe Model | OpenAI API - https://developers.openai.com/api/docs/models/gpt-4o-transcribe
- Whisper model card - https://github.com/openai/whisper/blob/main/model-card.md
- ChatGPT release notes - https://help.openai.com/en/articles/6825453-chatgpt-release-notes
- Speech to text | OpenAI API - https://developers.openai.com/api/docs/guides/speech-to-text?utm_source=chatgpt.com
- Transcriptions API reference - https://developers.openai.com/api/reference/python/resources/audio/subresources/transcriptions/methods/create/
- Speech to text | OpenAI API - https://developers.openai.com/api/docs/guides/speech-to-text
- GPT-4o Model | OpenAI API - https://developers.openai.com/api/docs/models/gpt-4o
- Pricing | OpenAI API - https://developers.openai.com/api/docs/pricing?utm_source=chatgpt.com
- GPT-4o Model | OpenAI API - https://developers.openai.com/api/docs/models/gpt-4o?utm_source=chatgpt.com
- GPT-4o mini Transcribe Model | OpenAI API - https://developers.openai.com/api/docs/models/gpt-4o-mini-transcribe
- GPT-4o mini Transcribe Model | OpenAI API - https://developers.openai.com/api/docs/models/gpt-4o-mini-transcribe?utm_source=chatgpt.com
- GPT-4o Audio Model | OpenAI API - https://developers.openai.com/api/docs/models/gpt-4o-audio-preview
- GPT-Realtime-Whisper Model | OpenAI API - https://developers.openai.com/api/docs/models/gpt-realtime-whisper
- Realtime transcription | OpenAI API - https://developers.openai.com/api/docs/guides/realtime-transcription
- Changelog | OpenAI API - https://developers.openai.com/api/docs/changelog
- Whisper repository - https://github.com/openai/whisper
- Google Chirp 3 documentation - https://docs.cloud.google.com/speech-to-text/docs/models/chirp-3
- Updates for developers building with voice | OpenAI Developers - https://developers.openai.com/blog/updates-audio-models
- AHELM benchmark paper - https://arxiv.org/pdf/2508.21376
- Back to Basics ASR paper - https://arxiv.org/html/2603.25727v1
- WhisperKit paper - https://arxiv.org/html/2507.10860v1
- HiKE code-switching evaluation - https://arxiv.org/html/2509.24613v4
- Radford et al., Whisper (ICML 2023) - https://proceedings.mlr.press/v202/radford23a.html
- Google Cloud Speech-to-Text - https://cloud.google.com/speech-to-text
- Google Speech-to-Text pricing - https://cloud.google.com/speech-to-text/pricing
- Azure Speech to text - https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-to-text
- Azure Speech language support - https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support
- Azure Speech pricing - https://azure.microsoft.com/en-us/pricing/details/speech/
- AWS Transcribe responsible AI overview - https://docs.aws.amazon.com/ai/responsible-ai/transcribe-speech-recognition/overview.html
- Amazon Transcribe 100+ languages announcement - https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-transcribe-over-100-languages/
- Amazon Transcribe foundation model blog - https://aws.amazon.com/blogs/machine-learning/amazon-transcribe-announces-a-new-speech-foundation-model-powered-asr-system-that-expands-support-to-over-100-languages/
- Amazon Transcribe diarization docs - https://docs.aws.amazon.com/transcribe/latest/dg/diarization.html
- Amazon Transcribe pricing - https://aws.amazon.com/transcribe/pricing/
- OpenAI community thread on transcribe vs Whisper - https://community.openai.com/t/gpt-4o-mini-transcribe-and-gpt-4o-transcribe-not-as-good-as-whisper/1153905
- Retell AI customer story - https://openai.com/index/retell-ai/
- Introducing 4o image generation - https://openai.com/index/introducing-4o-image-generation/