GPT-4o Transcribe: model profile

gpt-4o-transcribe is OpenAI's dedicated speech-to-text model in the GPT-4o family, released through the Audio API on March 20, 2025.

Specifications

Developer	OpenAI
Released	March 20, 2025
Model type	Dedicated speech-to-text (ASR) model; builds on the GPT-4o architecture
Training data	Specialized audio-centric pretraining datasets, enhanced distillation from larger audio models, RL-heavy post-training for speech-to-text; exact corpus size not publicly disclosed
Languages	57 languages on OpenAI's supported-language list; training-language coverage not publicly disclosed
Modes (batch / streaming)	Single-shot file transcription and streaming of completed recordings; realtime low-latency transcription is served by separate realtime models
Latency	Speed labeled "Medium" on the model page
Deployment	OpenAI Audio API transcription endpoint (cloud)
Pricing	$2.50 input and $10.00 output per 1M tokens; estimated $0.006 per minute
License	Proprietary API service; described in third-party literature as a proprietary ASR system
Context window	16,000 tokens
Max output tokens	2,000
Knowledge cutoff	June 1, 2024

Not disclosedParameters · Throughput / concurrency

Known limitations

OpenAI has not publicly specified the parameter count, encoder/decoder topology, tokenizer, exact training-hours figure, or whether gpt-4o-transcribe is end-to-end or partly cascaded. The official disclosure stops at "builds on GPT-4o," specialized audio datasets, distillation, and RL-heavy training.
OpenAI's official benchmark messaging is mostly relative rather than numeric. The source review did not identify a current OpenAI paper or docs page with a full reproducible numeric WER/CER table for gpt-4o-transcribe.
OpenAI's docs are internally inconsistent on whether gpt-4o-transcribe supports plain-text output or only JSON.
Third-party evaluation shows performance is context-sensitive: clipping, far-field audio, and reverberation materially degrade FLEURS results (for example, clipping raised English WER from 2.8 to 8.8).
Developer reports include transcript truncation, language-enforcement issues, prompt leakage in related models, and odd outputs; some reported slowdowns were later attributed to networking.
Direct vendor comparisons remain imperfect because Google, Azure, and AWS do not publish a common, directly comparable WER/CER/latency benchmark set in the cited primary materials.
Customer-accessible fine-tuning is not publicly documented for gpt-4o-transcribe.
As of January 13, 2026, OpenAI recommends gpt-4o-mini-transcribe over gpt-4o-transcribe for the best results.
Public customer stories do not confirm whether specific enterprise deployments use the gpt-4o-transcribe slug rather than other models in the family.

Full technical breakdown9 sections

Overview

gpt-4o-transcribe was introduced on March 20, 2025, alongside gpt-4o-mini-transcribe and gpt-4o-mini-tts. It followed OpenAI's earlier speech and audio releases: Whisper in 2022, GPT-4o in May 2024, the Realtime API beta in October 2024, and preview GPT-4o audio models later in 2024. OpenAI positioned the model for developer-facing transcription workloads such as customer call centers, meeting notes, and voice agents, not as a general-purpose ChatGPT model slug.

As of January 13, 2026, OpenAI's changelog states that it currently recommends gpt-4o-mini-transcribe over gpt-4o-transcribe for the best results.

OpenAI says the model builds on GPT-4o, uses specialized audio-centric pretraining data, improved distillation, and an "RL-heavy" training paradigm for speech-to-text. It does not publicly disclose parameter count, layer counts, tokenizer, exact corpus size, exact language coverage in training, or whether the deployed system is fully end-to-end or partly cascaded.

The current model page describes gpt-4o-transcribe as audio-plus-text input with text output, a 16,000-token context window, 2,000 max output tokens, and estimated pricing of $0.006 per minute. The default GPT-4o SKU, by contrast, is a 128,000-context text-and-image model; audio I/O lives in separate GPT-4o audio and realtime variants.

Capabilities and features

Input: audio plus text; output: text.
A language hint on the request improves both accuracy and latency.
A text prompt field can guide style or continue a previous audio segment. The API reference says the prompt should match the audio language.
The model can return token log probabilities when include=["logprobs"] is requested.
Output formats: OpenAI's general speech guide says gpt-4o-transcribe and gpt-4o-mini-transcribe support JSON or plain-text outputs, while the typed API reference says those models support only json and reserves text and diarized_json for gpt-4o-transcribe-diarize. The two documentation sources are inconsistent on this point.
Diarization is available on the related model gpt-4o-transcribe-diarize. With chunking set to auto, the server first normalizes loudness and then uses voice activity detection to choose boundaries. Manual server_vad configuration is supported, along with known-speaker reference clips: up to four speaker references with 2 to 10 second samples.
Customer-accessible fine-tuning is not publicly documented for gpt-4o-transcribe. GPT-4o proper supports documented fine-tuning, structured outputs, and function calling.

OpenAI family comparison

Model	Primary job	I/O and modality surface	Prompt / control surface	Context and output budget	Public cost signal
gpt-4o-transcribe	Dedicated speech-to-text	Audio + text in, text out; optimized for transcription.	Language hint improves accuracy/latency; prompt can guide style/continuation; logprobs available.	16k context, 2k max output.	~$0.006/min estimated.
gpt-4o-mini-transcribe	Lower-cost dedicated speech-to-text	Same broad role, smaller/faster/cheaper. OpenAI later recommends it over gpt-4o-transcribe.	Same prompting/logprob pattern in docs.	16k context, 2k max output.	~$0.003/min estimated.
Default gpt-4o	General-purpose flagship model	Text + image in, text out on the default model page.	Chat-style prompting, structured outputs, tools, fine-tuning.	128k context, 16,384 max output.	Token-priced, not minute-priced.
gpt-4o-audio-preview	Audio-capable chat / speech-to-speech preview	Text + audio in, text + audio out.	Chat-completions style prompting; not positioned as the dedicated ASR choice.	128k context, 16,384 max output.	Audio tokens: $40 input / $80 output per 1M audio tokens.
gpt-realtime-whisper	Low-latency live transcription	Audio + text in, text out; realtime sessions.	Explicit live delay-vs-accuracy setting; manual commit or server-side turn detection depending on model support.	16k context, 2k max output.	$0.017/minute.

Language support

OpenAI publishes a supported-language list of 57 languages meeting its support criterion.

The launch post says the model outperforms Whisper v2 and v3 on the multilingual FLEURS benchmark and matches or outperforms other leading models across most major languages.

Language coverage in the training data is not publicly disclosed.

Performance and benchmarks

Vendor-reported: the March 2025 launch post says gpt-4o-transcribe achieves lower word error rate (WER) than original Whisper models, outperforms Whisper v2 and v3 across all language evaluations shown, and on FLEURS matches or outperforms other leading models across most major languages. OpenAI also claims gains in accents, noisy environments, and varying speech speeds. The post does not include a numeric benchmark table for gpt-4o-transcribe.

Vendor-reported: in its December 2025 audio-model update, OpenAI says the gpt-4o-mini-transcribe-2025-12-15 snapshot delivered lower WER than prior models on Common Voice and FLEURS without language hints, and in an internal hallucination-with-noise evaluation produced roughly 90% fewer hallucinations than Whisper v2 and about 70% fewer than previous GPT-4o-transcribe models.

Third-party evaluations:

The AHELM benchmark paper includes GPT-4o-transcribe family models and reports that gpt-4o-transcribe did not show statistically significant ASR bias conditioned on speaker sex in one analysis, while gpt-4o-mini-transcribe showed a male-speaker advantage.
The WhisperKit paper describes gpt-4o-transcribe as a frontier cloud baseline and stronger than base GPT-4o for transcription.
The Qwen3-ASR report compares directly against GPT-4o-transcribe as one of three leading proprietary services.
The Step-Audio 2 report says it preferred GPT-4o Transcribe over GPT-4o Audio because the former gave stronger results, and calls GPT-4o Transcribe one of the specialized ASR systems with leading-edge performance.
A 2026 "Back to Basics" ASR paper reports environmental degradation on FLEURS: clipping increased Chinese character error rate (CER) from 6.4 to 17.4, English WER from 2.8 to 8.8, Japanese CER from 3.0 to 7.9, and Korean CER from 4.0 to 10.2.
The HiKE code-switching evaluation reports GPT-4o-Transcribe as the only LLM-based model in that study to outperform Whisper-Large.
A phoneme-level study of Ukrainian notes the model tended to output Cyrillic and that the researchers explicitly prompted for Cyrillic transcription, indicating that prompt handling affects measured quality.

Public benchmark evidence

Source	What it says about GPT-4o-transcribe	Why it matters
OpenAI launch post	Lower WER than Whisper; outperforms Whisper v2/v3 across shown language evaluations; matches or outperforms other leading models across most major languages on FLEURS.	Official positioning, mostly relative rather than numeric.
OpenAI Dec. 2025 audio update	New mini snapshot lowers WER on Common Voice/FLEURS and cuts hallucinations ~90% vs Whisper v2 and ~70% vs previous GPT-4o-transcribe models in an internal noise-heavy eval.	Later gains concentrated on reliability and hallucination control, not just raw WER.
AHELM 2025	GPT-4o-transcribe used as a major evaluated model; no statistically significant sex-conditioned ASR bias reported there for GPT-4o-transcribe, while the mini variant showed a male-speaker effect in that analysis.	The model is used in academic audio-language benchmarking.
WhisperKit 2025	Treats gpt-4o-transcribe as the frontier transcription baseline and stronger than base GPT-4o for this task.	gpt-4o-transcribe is compared to ASR systems, not just to GPT chat models.
HiKE 2026	GPT-4o-Transcribe was the only LLM-based model in that benchmark to outperform Whisper-Large.	Code-switching performance in at least one setting.
Back to Basics 2026	Strong clean FLEURS performance, but clipping/far-field/reverberation materially degrade results.	Robustness is improved, not solved.

Developer reports on the OpenAI community forum are mixed: some report cases where GPT-4o-transcribe handled background noise or language recognition better than Whisper; others report transcript truncation, odd outputs, prompt leakage in related models, language-enforcement issues, and perceived slowdown or instability that later turned out to be networking-specific.

Comparison with major competitors

System	Public architecture disclosure	Multilingual support in official sources	Noise / robustness claim	Diarization / customization	Public pricing signal	Public numeric benchmark transparency
OpenAI gpt-4o-transcribe	Builds on GPT-4o; exact topology unspecified.	OpenAI publishes a supported-language list of 57 languages meeting its support criterion; launch post references FLEURS and broader multilingual gains.	OpenAI directly claims gains in accents, noisy environments, and varying speech speeds.	Prompting, language hints, token logprobs; diarization available on related gpt-4o-transcribe-diarize.	~$0.006/min.	Relative official WER claims, but few official numeric tables.
OpenAI Whisper	Fully public seq2seq encoder-decoder and sizes.	Training data covers 98 languages.	OpenAI says Whisper improved robustness to accents, noise, and technical language.	Open-source prompting/task tokens; not realtime out of the box.	Open-source; API whisper-1 pricing not in scope in the source.	Stronger research transparency than products, but older quality frontier.
Google Chirp 3	Officially described as a multilingual ASR-specific generative model; topology unspecified.	85+ languages and variants on the product page.	Google says it offers enhanced accuracy and speed and can handle noisy audio without extra noise cancellation.	Diarization in supported languages; speech adaptation and preview custom prompt support.	V2 standard recognition starts at $0.016/min; dynamic batch $0.003/min.	The source did not identify a current official, directly comparable WER/CER table in the cited docs.
Azure Speech	Uses a Universal Language Model base model; topology unspecified.	Official language-support tables cover realtime, fast, and batch locales, but the cited overview does not summarize one headline count.	Microsoft positions fast transcription as faster than real time and exposes custom speech for domain adaptation.	Diarization up to 35 speakers; phrase lists and custom speech.	Search snippets show roughly $1.20/hr realtime and ~$0.225/hr batch, with region-dependent dynamic pricing tables.	The source did not identify a current official, directly comparable WER/CER table in the cited docs.
Amazon Transcribe	AWS service card describes acoustic features, then candidate word strings, then language-model ranking.	100+ languages and locales.	AWS says the 2023 foundation-model update improved accuracy 20-50% across most languages and 30-70% on telephony; service card says it can perform well in noisy and multi-speaker settings.	Speaker diarization, custom vocabulary, custom language models, language identification.	Tier 1 in us-east-1 is $0.024/min for the first 250k minutes; one-second billing, 15-second minimum.	AWS publishes some F1 and diarization accuracy examples in service cards, but not a simple universally comparable WER/CER table.

Latency and throughput

The gpt-4o-transcribe model page labels speed as "Medium."

A language hint on the transcription request improves latency as well as accuracy.

For low-latency live transcription, OpenAI's lineup uses separate realtime models: the gpt-realtime-whisper page labels speed as "Very fast" and exposes an explicit delay/accuracy tradeoff (minimal, low, medium, high, xhigh) for live transcription sessions. Realtime transcription with gpt-realtime-whisper uses 24 kHz mono PCM, with explicit delay settings and optional turn detection depending on model support. The launch post says developers seeking low-latency speech-to-speech experiences should prefer Realtime API speech models.

Throughput and concurrency limits: Not publicly disclosed.

Deployment and integrations

gpt-4o-transcribe is served through the OpenAI Audio API. The speech-to-text guide says OpenAI historically backed transcription and translation endpoints with whisper-1, and now also supports gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-transcribe-diarize on the transcription side. The guide distinguishes three usage patterns: single-shot file transcription, streaming of completed recordings, and realtime transcription sessions for low-latency live audio.

The model is a developer API SKU, not a ChatGPT model slug. ChatGPT release notes from February 14, 2025 still referred to ChatGPT's voice-to-text dictation feature as Whisper, a month before the API launch of the GPT-4o transcription models. OpenAI's Realtime API announcement framed its speech stack as "similar to ChatGPT's Advanced Voice Mode," indicating that consumer ChatGPT voice and the dedicated transcription API slugs are separate product surfaces.

OpenAI customer stories highlight Retell AI using GPT-4o-based voice systems and Parloa using OpenAI models for enterprise voice-driven customer service. The public stories do not specify whether transcription in those deployments is powered by gpt-4o-transcribe, realtime models, or another routing mix.

Pricing

Model	Pricing
gpt-4o-transcribe	$2.50 input and $10.00 output per 1M tokens; estimated $0.006 per minute
gpt-4o-mini-transcribe	Estimated $0.003 per minute
gpt-4o-audio-preview	$40 per 1M input audio tokens; $80 per 1M output audio tokens
gpt-realtime-whisper	$0.017 per minute

Development and ownership

gpt-4o-transcribe is developed and operated by OpenAI. For the March 2025 audio-model launch, OpenAI publicly credited four research leads: Christina Kim, Junhua Mao, Yi Shen, and Yu Zhang. Named product leads included Anubha Srivastava, Jackie Shannon, Jeff Harris, Reah Miyara, and Xiaolin Hao. Leadership sponsors included Kevin Weil, Mark Chen, Nick Turley, Olivier Godement, Prafulla Dhariwal, Shengjia Zhao, and Andrew Gibiansky. OpenAI also published broad contributor lists across research, engineering, and product.

Credits from OpenAI's March 2025 4o image-generation release identify Jackie Shannon as ChatGPT Product Lead and Mengchao Zhong and Wayne Chang as ChatGPT Engineering Leads. LinkedIn snippets identify Nick Turley as VP, Head of ChatGPT and Sulman Choudhry as Head of Engineering, ChatGPT. These credits are not transcription-specific.

Several names appear in both the audio launch credits and ChatGPT release credits, including Wayne Chang, Xiaolin Hao, Wanning Jiang, Ola Okelola, and Yilei Qian.

Release history

Date	Event
September 2022	Whisper released as an open-source ASR system trained on 680,000 hours of multilingual, multitask supervision from the web
May 13, 2024	GPT-4o enters the API as the flagship "omni" family anchor
October 1, 2024	Realtime API beta ships
October 17, 2024	gpt-4o-audio-preview ships, exposing GPT-4o-family audio I/O for chat completions
February 14, 2025	ChatGPT release notes still refer to the voice-to-text dictation feature as Whisper
March 20, 2025	gpt-4o-transcribe and gpt-4o-mini-transcribe launch in the Audio API, alongside gpt-4o-mini-tts
December 15, 2025	Dated snapshot gpt-4o-mini-transcribe-2025-12-15 released
January 13, 2026	OpenAI updates the moving slug and recommends gpt-4o-mini-transcribe over gpt-4o-transcribe

Sources

Introducing next-generation audio models in the API | OpenAI. https://openai.com/index/introducing-our-next-generation-audio-models/
GPT-4o Transcribe Model | OpenAI API. https://developers.openai.com/api/docs/models/gpt-4o-transcribe
Whisper model card. https://github.com/openai/whisper/blob/main/model-card.md
ChatGPT release notes. https://help.openai.com/en/articles/6825453-chatgpt-release-notes
Speech to text | OpenAI API. https://developers.openai.com/api/docs/guides/speech-to-text?utm_source=chatgpt.com
Audio transcriptions API reference. https://developers.openai.com/api/reference/python/resources/audio/subresources/transcriptions/methods/create/
Speech to text | OpenAI API. https://developers.openai.com/api/docs/guides/speech-to-text
GPT-4o Model | OpenAI API. https://developers.openai.com/api/docs/models/gpt-4o
Pricing | OpenAI API. https://developers.openai.com/api/docs/pricing?utm_source=chatgpt.com
GPT-4o Model | OpenAI API. https://developers.openai.com/api/docs/models/gpt-4o?utm_source=chatgpt.com
GPT-4o mini Transcribe Model | OpenAI API. https://developers.openai.com/api/docs/models/gpt-4o-mini-transcribe
GPT-4o mini Transcribe Model | OpenAI API. https://developers.openai.com/api/docs/models/gpt-4o-mini-transcribe?utm_source=chatgpt.com
GPT-4o Audio Model | OpenAI API. https://developers.openai.com/api/docs/models/gpt-4o-audio-preview
GPT-Realtime-Whisper Model | OpenAI API. https://developers.openai.com/api/docs/models/gpt-realtime-whisper
Realtime transcription | OpenAI API. https://developers.openai.com/api/docs/guides/realtime-transcription
Changelog | OpenAI API. https://developers.openai.com/api/docs/changelog
Whisper repository. https://github.com/openai/whisper
Chirp 3 documentation. https://docs.cloud.google.com/speech-to-text/docs/models/chirp-3
Updates for developers building with voice | OpenAI Developers. https://developers.openai.com/blog/updates-audio-models
AHELM benchmark paper. https://arxiv.org/pdf/2508.21376
Back to Basics ASR paper. https://arxiv.org/html/2603.25727v1
WhisperKit paper. https://arxiv.org/html/2507.10860v1
HiKE code-switching evaluation. https://arxiv.org/html/2509.24613v4
Robust speech recognition via large-scale weak supervision. https://proceedings.mlr.press/v202/radford23a.html
Google Cloud Speech-to-Text. https://cloud.google.com/speech-to-text
Google Cloud Speech-to-Text pricing. https://cloud.google.com/speech-to-text/pricing
Azure Speech to text. https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-to-text
Azure Speech language support. https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support
Azure Speech pricing. https://azure.microsoft.com/en-us/pricing/details/speech/
Amazon Transcribe service card. https://docs.aws.amazon.com/ai/responsible-ai/transcribe-speech-recognition/overview.html
Amazon Transcribe language expansion announcement. https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-transcribe-over-100-languages/
Amazon Transcribe foundation model blog post. https://aws.amazon.com/blogs/machine-learning/amazon-transcribe-announces-a-new-speech-foundation-model-powered-asr-system-that-expands-support-to-over-100-languages/
Amazon Transcribe diarization documentation. https://docs.aws.amazon.com/transcribe/latest/dg/diarization.html
Amazon Transcribe pricing. https://aws.amazon.com/transcribe/pricing/
OpenAI community forum thread. https://community.openai.com/t/gpt-4o-mini-transcribe-and-gpt-4o-transcribe-not-as-good-as-whisper/1153905
Retell AI customer story. https://openai.com/index/retell-ai/
Introducing 4o image generation. https://openai.com/index/introducing-4o-image-generation/