GPT-4o Transcribe: model profile
Reference profile of OpenAI's gpt-4o-transcribe speech-to-text model: release date, pricing, API features, benchmarks, and disclosed specifications.
gpt-4o-transcribe is OpenAI's dedicated speech-to-text model in the GPT-4o family, released through the Audio API on March 20, 2025.
Specifications
| Developer | OpenAI |
| Released | March 20, 2025 |
| Model type | Dedicated speech-to-text (ASR) model; builds on the GPT-4o architecture |
| Training data | Specialized audio-centric pretraining datasets, enhanced distillation from larger audio models, RL-heavy post-training for speech-to-text; exact corpus size not publicly disclosed |
| Languages | 57 languages on OpenAI's supported-language list; training-language coverage not publicly disclosed |
| Modes (batch / streaming) | Single-shot file transcription and streaming of completed recordings; realtime low-latency transcription is served by separate realtime models |
| Latency | Speed labeled "Medium" on the model page |
| Deployment | OpenAI Audio API transcription endpoint (cloud) |
| Pricing | $2.50 input and $10.00 output per 1M tokens; estimated $0.006 per minute |
| License | Proprietary API service; described in third-party literature as a proprietary ASR system |
| Context window | 16,000 tokens |
| Max output tokens | 2,000 |
| Knowledge cutoff | June 1, 2024 |
Not disclosedParameters · Throughput / concurrency
Full technical breakdown9 sections
Overview
gpt-4o-transcribe was introduced on March 20, 2025, alongside gpt-4o-mini-transcribe and gpt-4o-mini-tts. It followed OpenAI's earlier speech and audio releases: Whisper in 2022, GPT-4o in May 2024, the Realtime API beta in October 2024, and preview GPT-4o audio models later in 2024. OpenAI positioned the model for developer-facing transcription workloads such as customer call centers, meeting notes, and voice agents, not as a general-purpose ChatGPT model slug.
As of January 13, 2026, OpenAI's changelog states that it currently recommends gpt-4o-mini-transcribe over gpt-4o-transcribe for the best results.
OpenAI says the model builds on GPT-4o, uses specialized audio-centric pretraining data, improved distillation, and an "RL-heavy" training paradigm for speech-to-text. It does not publicly disclose parameter count, layer counts, tokenizer, exact corpus size, exact language coverage in training, or whether the deployed system is fully end-to-end or partly cascaded.
The current model page describes gpt-4o-transcribe as audio-plus-text input with text output, a 16,000-token context window, 2,000 max output tokens, and estimated pricing of $0.006 per minute. The default GPT-4o SKU, by contrast, is a 128,000-context text-and-image model; audio I/O lives in separate GPT-4o audio and realtime variants.
Capabilities and features
- Input: audio plus text; output: text.
- A language hint on the request improves both accuracy and latency.
- A text prompt field can guide style or continue a previous audio segment. The API reference says the prompt should match the audio language.
- The model can return token log probabilities when include=["logprobs"] is requested.
- Output formats: OpenAI's general speech guide says gpt-4o-transcribe and gpt-4o-mini-transcribe support JSON or plain-text outputs, while the typed API reference says those models support only json and reserves text and diarized_json for gpt-4o-transcribe-diarize. The two documentation sources are inconsistent on this point.
- Diarization is available on the related model gpt-4o-transcribe-diarize. With chunking set to auto, the server first normalizes loudness and then uses voice activity detection to choose boundaries. Manual server_vad configuration is supported, along with known-speaker reference clips: up to four speaker references with 2 to 10 second samples.
- Customer-accessible fine-tuning is not publicly documented for gpt-4o-transcribe. GPT-4o proper supports documented fine-tuning, structured outputs, and function calling.
OpenAI family comparison
| Model | Primary job | I/O and modality surface | Prompt / control surface | Context and output budget | Public cost signal |
|---|---|---|---|---|---|
| gpt-4o-transcribe | Dedicated speech-to-text | Audio + text in, text out; optimized for transcription. | Language hint improves accuracy/latency; prompt can guide style/continuation; logprobs available. | 16k context, 2k max output. | ~$0.006/min estimated. |
| gpt-4o-mini-transcribe | Lower-cost dedicated speech-to-text | Same broad role, smaller/faster/cheaper. OpenAI later recommends it over gpt-4o-transcribe. | Same prompting/logprob pattern in docs. | 16k context, 2k max output. | ~$0.003/min estimated. |
| Default gpt-4o | General-purpose flagship model | Text + image in, text out on the default model page. | Chat-style prompting, structured outputs, tools, fine-tuning. | 128k context, 16,384 max output. | Token-priced, not minute-priced. |
| gpt-4o-audio-preview | Audio-capable chat / speech-to-speech preview | Text + audio in, text + audio out. | Chat-completions style prompting; not positioned as the dedicated ASR choice. | 128k context, 16,384 max output. | Audio tokens: $40 input / $80 output per 1M audio tokens. |
| gpt-realtime-whisper | Low-latency live transcription | Audio + text in, text out; realtime sessions. | Explicit live delay-vs-accuracy setting; manual commit or server-side turn detection depending on model support. | 16k context, 2k max output. | $0.017/minute. |
Language support
OpenAI publishes a supported-language list of 57 languages meeting its support criterion.
The launch post says the model outperforms Whisper v2 and v3 on the multilingual FLEURS benchmark and matches or outperforms other leading models across most major languages.
Language coverage in the training data is not publicly disclosed.
Performance and benchmarks
Vendor-reported: the March 2025 launch post says gpt-4o-transcribe achieves lower word error rate (WER) than original Whisper models, outperforms Whisper v2 and v3 across all language evaluations shown, and on FLEURS matches or outperforms other leading models across most major languages. OpenAI also claims gains in accents, noisy environments, and varying speech speeds. The post does not include a numeric benchmark table for gpt-4o-transcribe.
Vendor-reported: in its December 2025 audio-model update, OpenAI says the gpt-4o-mini-transcribe-2025-12-15 snapshot delivered lower WER than prior models on Common Voice and FLEURS without language hints, and in an internal hallucination-with-noise evaluation produced roughly 90% fewer hallucinations than Whisper v2 and about 70% fewer than previous GPT-4o-transcribe models.
Third-party evaluations:
- The AHELM benchmark paper includes GPT-4o-transcribe family models and reports that gpt-4o-transcribe did not show statistically significant ASR bias conditioned on speaker sex in one analysis, while gpt-4o-mini-transcribe showed a male-speaker advantage.
- The WhisperKit paper describes gpt-4o-transcribe as a frontier cloud baseline and stronger than base GPT-4o for transcription.
- The Qwen3-ASR report compares directly against GPT-4o-transcribe as one of three leading proprietary services.
- The Step-Audio 2 report says it preferred GPT-4o Transcribe over GPT-4o Audio because the former gave stronger results, and calls GPT-4o Transcribe one of the specialized ASR systems with leading-edge performance.
- A 2026 "Back to Basics" ASR paper reports environmental degradation on FLEURS: clipping increased Chinese character error rate (CER) from 6.4 to 17.4, English WER from 2.8 to 8.8, Japanese CER from 3.0 to 7.9, and Korean CER from 4.0 to 10.2.
- The HiKE code-switching evaluation reports GPT-4o-Transcribe as the only LLM-based model in that study to outperform Whisper-Large.
- A phoneme-level study of Ukrainian notes the model tended to output Cyrillic and that the researchers explicitly prompted for Cyrillic transcription, indicating that prompt handling affects measured quality.
Public benchmark evidence
| Source | What it says about GPT-4o-transcribe | Why it matters |
|---|---|---|
| OpenAI launch post | Lower WER than Whisper; outperforms Whisper v2/v3 across shown language evaluations; matches or outperforms other leading models across most major languages on FLEURS. | Official positioning, mostly relative rather than numeric. |
| OpenAI Dec. 2025 audio update | New mini snapshot lowers WER on Common Voice/FLEURS and cuts hallucinations ~90% vs Whisper v2 and ~70% vs previous GPT-4o-transcribe models in an internal noise-heavy eval. | Later gains concentrated on reliability and hallucination control, not just raw WER. |
| AHELM 2025 | GPT-4o-transcribe used as a major evaluated model; no statistically significant sex-conditioned ASR bias reported there for GPT-4o-transcribe, while the mini variant showed a male-speaker effect in that analysis. | The model is used in academic audio-language benchmarking. |
| WhisperKit 2025 | Treats gpt-4o-transcribe as the frontier transcription baseline and stronger than base GPT-4o for this task. | gpt-4o-transcribe is compared to ASR systems, not just to GPT chat models. |
| HiKE 2026 | GPT-4o-Transcribe was the only LLM-based model in that benchmark to outperform Whisper-Large. | Code-switching performance in at least one setting. |
| Back to Basics 2026 | Strong clean FLEURS performance, but clipping/far-field/reverberation materially degrade results. | Robustness is improved, not solved. |
Developer reports on the OpenAI community forum are mixed: some report cases where GPT-4o-transcribe handled background noise or language recognition better than Whisper; others report transcript truncation, odd outputs, prompt leakage in related models, language-enforcement issues, and perceived slowdown or instability that later turned out to be networking-specific.
Comparison with major competitors
| System | Public architecture disclosure | Multilingual support in official sources | Noise / robustness claim | Diarization / customization | Public pricing signal | Public numeric benchmark transparency |
|---|---|---|---|---|---|---|
| OpenAI gpt-4o-transcribe | Builds on GPT-4o; exact topology unspecified. | OpenAI publishes a supported-language list of 57 languages meeting its support criterion; launch post references FLEURS and broader multilingual gains. | OpenAI directly claims gains in accents, noisy environments, and varying speech speeds. | Prompting, language hints, token logprobs; diarization available on related gpt-4o-transcribe-diarize. | ~$0.006/min. | Relative official WER claims, but few official numeric tables. |
| OpenAI Whisper | Fully public seq2seq encoder-decoder and sizes. | Training data covers 98 languages. | OpenAI says Whisper improved robustness to accents, noise, and technical language. | Open-source prompting/task tokens; not realtime out of the box. | Open-source; API whisper-1 pricing not in scope in the source. | Stronger research transparency than products, but older quality frontier. |
| Google Chirp 3 | Officially described as a multilingual ASR-specific generative model; topology unspecified. | 85+ languages and variants on the product page. | Google says it offers enhanced accuracy and speed and can handle noisy audio without extra noise cancellation. | Diarization in supported languages; speech adaptation and preview custom prompt support. | V2 standard recognition starts at $0.016/min; dynamic batch $0.003/min. | The source did not identify a current official, directly comparable WER/CER table in the cited docs. |
| Azure Speech | Uses a Universal Language Model base model; topology unspecified. | Official language-support tables cover realtime, fast, and batch locales, but the cited overview does not summarize one headline count. | Microsoft positions fast transcription as faster than real time and exposes custom speech for domain adaptation. | Diarization up to 35 speakers; phrase lists and custom speech. | Search snippets show roughly $1.20/hr realtime and ~$0.225/hr batch, with region-dependent dynamic pricing tables. | The source did not identify a current official, directly comparable WER/CER table in the cited docs. |
| Amazon Transcribe | AWS service card describes acoustic features, then candidate word strings, then language-model ranking. | 100+ languages and locales. | AWS says the 2023 foundation-model update improved accuracy 20-50% across most languages and 30-70% on telephony; service card says it can perform well in noisy and multi-speaker settings. | Speaker diarization, custom vocabulary, custom language models, language identification. | Tier 1 in us-east-1 is $0.024/min for the first 250k minutes; one-second billing, 15-second minimum. | AWS publishes some F1 and diarization accuracy examples in service cards, but not a simple universally comparable WER/CER table. |
Latency and throughput
The gpt-4o-transcribe model page labels speed as "Medium."
A language hint on the transcription request improves latency as well as accuracy.
For low-latency live transcription, OpenAI's lineup uses separate realtime models: the gpt-realtime-whisper page labels speed as "Very fast" and exposes an explicit delay/accuracy tradeoff (minimal, low, medium, high, xhigh) for live transcription sessions. Realtime transcription with gpt-realtime-whisper uses 24 kHz mono PCM, with explicit delay settings and optional turn detection depending on model support. The launch post says developers seeking low-latency speech-to-speech experiences should prefer Realtime API speech models.
Throughput and concurrency limits: Not publicly disclosed.
Deployment and integrations
gpt-4o-transcribe is served through the OpenAI Audio API. The speech-to-text guide says OpenAI historically backed transcription and translation endpoints with whisper-1, and now also supports gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-transcribe-diarize on the transcription side. The guide distinguishes three usage patterns: single-shot file transcription, streaming of completed recordings, and realtime transcription sessions for low-latency live audio.
The model is a developer API SKU, not a ChatGPT model slug. ChatGPT release notes from February 14, 2025 still referred to ChatGPT's voice-to-text dictation feature as Whisper, a month before the API launch of the GPT-4o transcription models. OpenAI's Realtime API announcement framed its speech stack as "similar to ChatGPT's Advanced Voice Mode," indicating that consumer ChatGPT voice and the dedicated transcription API slugs are separate product surfaces.
OpenAI customer stories highlight Retell AI using GPT-4o-based voice systems and Parloa using OpenAI models for enterprise voice-driven customer service. The public stories do not specify whether transcription in those deployments is powered by gpt-4o-transcribe, realtime models, or another routing mix.
Pricing
| Model | Pricing |
|---|---|
| gpt-4o-transcribe | $2.50 input and $10.00 output per 1M tokens; estimated $0.006 per minute |
| gpt-4o-mini-transcribe | Estimated $0.003 per minute |
| gpt-4o-audio-preview | $40 per 1M input audio tokens; $80 per 1M output audio tokens |
| gpt-realtime-whisper | $0.017 per minute |
Development and ownership
gpt-4o-transcribe is developed and operated by OpenAI. For the March 2025 audio-model launch, OpenAI publicly credited four research leads: Christina Kim, Junhua Mao, Yi Shen, and Yu Zhang. Named product leads included Anubha Srivastava, Jackie Shannon, Jeff Harris, Reah Miyara, and Xiaolin Hao. Leadership sponsors included Kevin Weil, Mark Chen, Nick Turley, Olivier Godement, Prafulla Dhariwal, Shengjia Zhao, and Andrew Gibiansky. OpenAI also published broad contributor lists across research, engineering, and product.
Credits from OpenAI's March 2025 4o image-generation release identify Jackie Shannon as ChatGPT Product Lead and Mengchao Zhong and Wayne Chang as ChatGPT Engineering Leads. LinkedIn snippets identify Nick Turley as VP, Head of ChatGPT and Sulman Choudhry as Head of Engineering, ChatGPT. These credits are not transcription-specific.
Several names appear in both the audio launch credits and ChatGPT release credits, including Wayne Chang, Xiaolin Hao, Wanning Jiang, Ola Okelola, and Yilei Qian.
Release history
| Date | Event |
|---|---|
| September 2022 | Whisper released as an open-source ASR system trained on 680,000 hours of multilingual, multitask supervision from the web |
| May 13, 2024 | GPT-4o enters the API as the flagship "omni" family anchor |
| October 1, 2024 | Realtime API beta ships |
| October 17, 2024 | gpt-4o-audio-preview ships, exposing GPT-4o-family audio I/O for chat completions |
| February 14, 2025 | ChatGPT release notes still refer to the voice-to-text dictation feature as Whisper |
| March 20, 2025 | gpt-4o-transcribe and gpt-4o-mini-transcribe launch in the Audio API, alongside gpt-4o-mini-tts |
| December 15, 2025 | Dated snapshot gpt-4o-mini-transcribe-2025-12-15 released |
| January 13, 2026 | OpenAI updates the moving slug and recommends gpt-4o-mini-transcribe over gpt-4o-transcribe |
Sources
- Introducing next-generation audio models in the API | OpenAI. https://openai.com/index/introducing-our-next-generation-audio-models/
- GPT-4o Transcribe Model | OpenAI API. https://developers.openai.com/api/docs/models/gpt-4o-transcribe
- Whisper model card. https://github.com/openai/whisper/blob/main/model-card.md
- ChatGPT release notes. https://help.openai.com/en/articles/6825453-chatgpt-release-notes
- Speech to text | OpenAI API. https://developers.openai.com/api/docs/guides/speech-to-text?utm_source=chatgpt.com
- Audio transcriptions API reference. https://developers.openai.com/api/reference/python/resources/audio/subresources/transcriptions/methods/create/
- Speech to text | OpenAI API. https://developers.openai.com/api/docs/guides/speech-to-text
- GPT-4o Model | OpenAI API. https://developers.openai.com/api/docs/models/gpt-4o
- Pricing | OpenAI API. https://developers.openai.com/api/docs/pricing?utm_source=chatgpt.com
- GPT-4o Model | OpenAI API. https://developers.openai.com/api/docs/models/gpt-4o?utm_source=chatgpt.com
- GPT-4o mini Transcribe Model | OpenAI API. https://developers.openai.com/api/docs/models/gpt-4o-mini-transcribe
- GPT-4o mini Transcribe Model | OpenAI API. https://developers.openai.com/api/docs/models/gpt-4o-mini-transcribe?utm_source=chatgpt.com
- GPT-4o Audio Model | OpenAI API. https://developers.openai.com/api/docs/models/gpt-4o-audio-preview
- GPT-Realtime-Whisper Model | OpenAI API. https://developers.openai.com/api/docs/models/gpt-realtime-whisper
- Realtime transcription | OpenAI API. https://developers.openai.com/api/docs/guides/realtime-transcription
- Changelog | OpenAI API. https://developers.openai.com/api/docs/changelog
- Whisper repository. https://github.com/openai/whisper
- Chirp 3 documentation. https://docs.cloud.google.com/speech-to-text/docs/models/chirp-3
- Updates for developers building with voice | OpenAI Developers. https://developers.openai.com/blog/updates-audio-models
- AHELM benchmark paper. https://arxiv.org/pdf/2508.21376
- Back to Basics ASR paper. https://arxiv.org/html/2603.25727v1
- WhisperKit paper. https://arxiv.org/html/2507.10860v1
- HiKE code-switching evaluation. https://arxiv.org/html/2509.24613v4
- Robust speech recognition via large-scale weak supervision. https://proceedings.mlr.press/v202/radford23a.html
- Google Cloud Speech-to-Text. https://cloud.google.com/speech-to-text
- Google Cloud Speech-to-Text pricing. https://cloud.google.com/speech-to-text/pricing
- Azure Speech to text. https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-to-text
- Azure Speech language support. https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support
- Azure Speech pricing. https://azure.microsoft.com/en-us/pricing/details/speech/
- Amazon Transcribe service card. https://docs.aws.amazon.com/ai/responsible-ai/transcribe-speech-recognition/overview.html
- Amazon Transcribe language expansion announcement. https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-transcribe-over-100-languages/
- Amazon Transcribe foundation model blog post. https://aws.amazon.com/blogs/machine-learning/amazon-transcribe-announces-a-new-speech-foundation-model-powered-asr-system-that-expands-support-to-over-100-languages/
- Amazon Transcribe diarization documentation. https://docs.aws.amazon.com/transcribe/latest/dg/diarization.html
- Amazon Transcribe pricing. https://aws.amazon.com/transcribe/pricing/
- OpenAI community forum thread. https://community.openai.com/t/gpt-4o-mini-transcribe-and-gpt-4o-transcribe-not-as-good-as-whisper/1153905
- Retell AI customer story. https://openai.com/index/retell-ai/
- Introducing 4o image generation. https://openai.com/index/introducing-4o-image-generation/