OpenTranscription/ Blog
2026-07-03 · MODEL PROFILE

GPT-4o Transcribe: model profile

Reference profile of OpenAI's gpt-4o-transcribe speech-to-text model: release date, pricing, API features, benchmarks, and disclosed specifications.

model-profilespeech-to-textopenaigpt-4oasr
OpenAI
Model profile OpenAI

gpt-4o-transcribe is OpenAI's dedicated speech-to-text model in the GPT-4o family, released through the Audio API on March 20, 2025.

Specifications

DeveloperOpenAI
ReleasedMarch 20, 2025
Model typeDedicated speech-to-text (ASR) model; builds on the GPT-4o architecture
Training dataSpecialized audio-centric pretraining datasets, enhanced distillation from larger audio models, RL-heavy post-training for speech-to-text; exact corpus size not publicly disclosed
Languages57 languages on OpenAI's supported-language list; training-language coverage not publicly disclosed
Modes (batch / streaming)Single-shot file transcription and streaming of completed recordings; realtime low-latency transcription is served by separate realtime models
LatencySpeed labeled "Medium" on the model page
DeploymentOpenAI Audio API transcription endpoint (cloud)
Pricing$2.50 input and $10.00 output per 1M tokens; estimated $0.006 per minute
LicenseProprietary API service; described in third-party literature as a proprietary ASR system
Context window16,000 tokens
Max output tokens2,000
Knowledge cutoffJune 1, 2024

Not disclosedParameters · Throughput / concurrency

Full technical breakdown9 sections

Overview

gpt-4o-transcribe was introduced on March 20, 2025, alongside gpt-4o-mini-transcribe and gpt-4o-mini-tts. It followed OpenAI's earlier speech and audio releases: Whisper in 2022, GPT-4o in May 2024, the Realtime API beta in October 2024, and preview GPT-4o audio models later in 2024. OpenAI positioned the model for developer-facing transcription workloads such as customer call centers, meeting notes, and voice agents, not as a general-purpose ChatGPT model slug.

As of January 13, 2026, OpenAI's changelog states that it currently recommends gpt-4o-mini-transcribe over gpt-4o-transcribe for the best results.

OpenAI says the model builds on GPT-4o, uses specialized audio-centric pretraining data, improved distillation, and an "RL-heavy" training paradigm for speech-to-text. It does not publicly disclose parameter count, layer counts, tokenizer, exact corpus size, exact language coverage in training, or whether the deployed system is fully end-to-end or partly cascaded.

The current model page describes gpt-4o-transcribe as audio-plus-text input with text output, a 16,000-token context window, 2,000 max output tokens, and estimated pricing of $0.006 per minute. The default GPT-4o SKU, by contrast, is a 128,000-context text-and-image model; audio I/O lives in separate GPT-4o audio and realtime variants.

Capabilities and features

  • Input: audio plus text; output: text.
  • A language hint on the request improves both accuracy and latency.
  • A text prompt field can guide style or continue a previous audio segment. The API reference says the prompt should match the audio language.
  • The model can return token log probabilities when include=["logprobs"] is requested.
  • Output formats: OpenAI's general speech guide says gpt-4o-transcribe and gpt-4o-mini-transcribe support JSON or plain-text outputs, while the typed API reference says those models support only json and reserves text and diarized_json for gpt-4o-transcribe-diarize. The two documentation sources are inconsistent on this point.
  • Diarization is available on the related model gpt-4o-transcribe-diarize. With chunking set to auto, the server first normalizes loudness and then uses voice activity detection to choose boundaries. Manual server_vad configuration is supported, along with known-speaker reference clips: up to four speaker references with 2 to 10 second samples.
  • Customer-accessible fine-tuning is not publicly documented for gpt-4o-transcribe. GPT-4o proper supports documented fine-tuning, structured outputs, and function calling.

OpenAI family comparison

Model Primary job I/O and modality surface Prompt / control surface Context and output budget Public cost signal
gpt-4o-transcribe Dedicated speech-to-text Audio + text in, text out; optimized for transcription. Language hint improves accuracy/latency; prompt can guide style/continuation; logprobs available. 16k context, 2k max output. ~$0.006/min estimated.
gpt-4o-mini-transcribe Lower-cost dedicated speech-to-text Same broad role, smaller/faster/cheaper. OpenAI later recommends it over gpt-4o-transcribe. Same prompting/logprob pattern in docs. 16k context, 2k max output. ~$0.003/min estimated.
Default gpt-4o General-purpose flagship model Text + image in, text out on the default model page. Chat-style prompting, structured outputs, tools, fine-tuning. 128k context, 16,384 max output. Token-priced, not minute-priced.
gpt-4o-audio-preview Audio-capable chat / speech-to-speech preview Text + audio in, text + audio out. Chat-completions style prompting; not positioned as the dedicated ASR choice. 128k context, 16,384 max output. Audio tokens: $40 input / $80 output per 1M audio tokens.
gpt-realtime-whisper Low-latency live transcription Audio + text in, text out; realtime sessions. Explicit live delay-vs-accuracy setting; manual commit or server-side turn detection depending on model support. 16k context, 2k max output. $0.017/minute.

Language support

OpenAI publishes a supported-language list of 57 languages meeting its support criterion.

The launch post says the model outperforms Whisper v2 and v3 on the multilingual FLEURS benchmark and matches or outperforms other leading models across most major languages.

Language coverage in the training data is not publicly disclosed.

Performance and benchmarks

Vendor-reported: the March 2025 launch post says gpt-4o-transcribe achieves lower word error rate (WER) than original Whisper models, outperforms Whisper v2 and v3 across all language evaluations shown, and on FLEURS matches or outperforms other leading models across most major languages. OpenAI also claims gains in accents, noisy environments, and varying speech speeds. The post does not include a numeric benchmark table for gpt-4o-transcribe.

Vendor-reported: in its December 2025 audio-model update, OpenAI says the gpt-4o-mini-transcribe-2025-12-15 snapshot delivered lower WER than prior models on Common Voice and FLEURS without language hints, and in an internal hallucination-with-noise evaluation produced roughly 90% fewer hallucinations than Whisper v2 and about 70% fewer than previous GPT-4o-transcribe models.

Third-party evaluations:

  • The AHELM benchmark paper includes GPT-4o-transcribe family models and reports that gpt-4o-transcribe did not show statistically significant ASR bias conditioned on speaker sex in one analysis, while gpt-4o-mini-transcribe showed a male-speaker advantage.
  • The WhisperKit paper describes gpt-4o-transcribe as a frontier cloud baseline and stronger than base GPT-4o for transcription.
  • The Qwen3-ASR report compares directly against GPT-4o-transcribe as one of three leading proprietary services.
  • The Step-Audio 2 report says it preferred GPT-4o Transcribe over GPT-4o Audio because the former gave stronger results, and calls GPT-4o Transcribe one of the specialized ASR systems with leading-edge performance.
  • A 2026 "Back to Basics" ASR paper reports environmental degradation on FLEURS: clipping increased Chinese character error rate (CER) from 6.4 to 17.4, English WER from 2.8 to 8.8, Japanese CER from 3.0 to 7.9, and Korean CER from 4.0 to 10.2.
  • The HiKE code-switching evaluation reports GPT-4o-Transcribe as the only LLM-based model in that study to outperform Whisper-Large.
  • A phoneme-level study of Ukrainian notes the model tended to output Cyrillic and that the researchers explicitly prompted for Cyrillic transcription, indicating that prompt handling affects measured quality.

Public benchmark evidence

Source What it says about GPT-4o-transcribe Why it matters
OpenAI launch post Lower WER than Whisper; outperforms Whisper v2/v3 across shown language evaluations; matches or outperforms other leading models across most major languages on FLEURS. Official positioning, mostly relative rather than numeric.
OpenAI Dec. 2025 audio update New mini snapshot lowers WER on Common Voice/FLEURS and cuts hallucinations ~90% vs Whisper v2 and ~70% vs previous GPT-4o-transcribe models in an internal noise-heavy eval. Later gains concentrated on reliability and hallucination control, not just raw WER.
AHELM 2025 GPT-4o-transcribe used as a major evaluated model; no statistically significant sex-conditioned ASR bias reported there for GPT-4o-transcribe, while the mini variant showed a male-speaker effect in that analysis. The model is used in academic audio-language benchmarking.
WhisperKit 2025 Treats gpt-4o-transcribe as the frontier transcription baseline and stronger than base GPT-4o for this task. gpt-4o-transcribe is compared to ASR systems, not just to GPT chat models.
HiKE 2026 GPT-4o-Transcribe was the only LLM-based model in that benchmark to outperform Whisper-Large. Code-switching performance in at least one setting.
Back to Basics 2026 Strong clean FLEURS performance, but clipping/far-field/reverberation materially degrade results. Robustness is improved, not solved.

Developer reports on the OpenAI community forum are mixed: some report cases where GPT-4o-transcribe handled background noise or language recognition better than Whisper; others report transcript truncation, odd outputs, prompt leakage in related models, language-enforcement issues, and perceived slowdown or instability that later turned out to be networking-specific.

Comparison with major competitors

System Public architecture disclosure Multilingual support in official sources Noise / robustness claim Diarization / customization Public pricing signal Public numeric benchmark transparency
OpenAI gpt-4o-transcribe Builds on GPT-4o; exact topology unspecified. OpenAI publishes a supported-language list of 57 languages meeting its support criterion; launch post references FLEURS and broader multilingual gains. OpenAI directly claims gains in accents, noisy environments, and varying speech speeds. Prompting, language hints, token logprobs; diarization available on related gpt-4o-transcribe-diarize. ~$0.006/min. Relative official WER claims, but few official numeric tables.
OpenAI Whisper Fully public seq2seq encoder-decoder and sizes. Training data covers 98 languages. OpenAI says Whisper improved robustness to accents, noise, and technical language. Open-source prompting/task tokens; not realtime out of the box. Open-source; API whisper-1 pricing not in scope in the source. Stronger research transparency than products, but older quality frontier.
Google Chirp 3 Officially described as a multilingual ASR-specific generative model; topology unspecified. 85+ languages and variants on the product page. Google says it offers enhanced accuracy and speed and can handle noisy audio without extra noise cancellation. Diarization in supported languages; speech adaptation and preview custom prompt support. V2 standard recognition starts at $0.016/min; dynamic batch $0.003/min. The source did not identify a current official, directly comparable WER/CER table in the cited docs.
Azure Speech Uses a Universal Language Model base model; topology unspecified. Official language-support tables cover realtime, fast, and batch locales, but the cited overview does not summarize one headline count. Microsoft positions fast transcription as faster than real time and exposes custom speech for domain adaptation. Diarization up to 35 speakers; phrase lists and custom speech. Search snippets show roughly $1.20/hr realtime and ~$0.225/hr batch, with region-dependent dynamic pricing tables. The source did not identify a current official, directly comparable WER/CER table in the cited docs.
Amazon Transcribe AWS service card describes acoustic features, then candidate word strings, then language-model ranking. 100+ languages and locales. AWS says the 2023 foundation-model update improved accuracy 20-50% across most languages and 30-70% on telephony; service card says it can perform well in noisy and multi-speaker settings. Speaker diarization, custom vocabulary, custom language models, language identification. Tier 1 in us-east-1 is $0.024/min for the first 250k minutes; one-second billing, 15-second minimum. AWS publishes some F1 and diarization accuracy examples in service cards, but not a simple universally comparable WER/CER table.

Latency and throughput

The gpt-4o-transcribe model page labels speed as "Medium."

A language hint on the transcription request improves latency as well as accuracy.

For low-latency live transcription, OpenAI's lineup uses separate realtime models: the gpt-realtime-whisper page labels speed as "Very fast" and exposes an explicit delay/accuracy tradeoff (minimal, low, medium, high, xhigh) for live transcription sessions. Realtime transcription with gpt-realtime-whisper uses 24 kHz mono PCM, with explicit delay settings and optional turn detection depending on model support. The launch post says developers seeking low-latency speech-to-speech experiences should prefer Realtime API speech models.

Throughput and concurrency limits: Not publicly disclosed.

Deployment and integrations

gpt-4o-transcribe is served through the OpenAI Audio API. The speech-to-text guide says OpenAI historically backed transcription and translation endpoints with whisper-1, and now also supports gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-transcribe-diarize on the transcription side. The guide distinguishes three usage patterns: single-shot file transcription, streaming of completed recordings, and realtime transcription sessions for low-latency live audio.

The model is a developer API SKU, not a ChatGPT model slug. ChatGPT release notes from February 14, 2025 still referred to ChatGPT's voice-to-text dictation feature as Whisper, a month before the API launch of the GPT-4o transcription models. OpenAI's Realtime API announcement framed its speech stack as "similar to ChatGPT's Advanced Voice Mode," indicating that consumer ChatGPT voice and the dedicated transcription API slugs are separate product surfaces.

OpenAI customer stories highlight Retell AI using GPT-4o-based voice systems and Parloa using OpenAI models for enterprise voice-driven customer service. The public stories do not specify whether transcription in those deployments is powered by gpt-4o-transcribe, realtime models, or another routing mix.

Pricing

Model Pricing
gpt-4o-transcribe $2.50 input and $10.00 output per 1M tokens; estimated $0.006 per minute
gpt-4o-mini-transcribe Estimated $0.003 per minute
gpt-4o-audio-preview $40 per 1M input audio tokens; $80 per 1M output audio tokens
gpt-realtime-whisper $0.017 per minute

Development and ownership

gpt-4o-transcribe is developed and operated by OpenAI. For the March 2025 audio-model launch, OpenAI publicly credited four research leads: Christina Kim, Junhua Mao, Yi Shen, and Yu Zhang. Named product leads included Anubha Srivastava, Jackie Shannon, Jeff Harris, Reah Miyara, and Xiaolin Hao. Leadership sponsors included Kevin Weil, Mark Chen, Nick Turley, Olivier Godement, Prafulla Dhariwal, Shengjia Zhao, and Andrew Gibiansky. OpenAI also published broad contributor lists across research, engineering, and product.

Credits from OpenAI's March 2025 4o image-generation release identify Jackie Shannon as ChatGPT Product Lead and Mengchao Zhong and Wayne Chang as ChatGPT Engineering Leads. LinkedIn snippets identify Nick Turley as VP, Head of ChatGPT and Sulman Choudhry as Head of Engineering, ChatGPT. These credits are not transcription-specific.

Several names appear in both the audio launch credits and ChatGPT release credits, including Wayne Chang, Xiaolin Hao, Wanning Jiang, Ola Okelola, and Yilei Qian.

Release history

Date Event
September 2022 Whisper released as an open-source ASR system trained on 680,000 hours of multilingual, multitask supervision from the web
May 13, 2024 GPT-4o enters the API as the flagship "omni" family anchor
October 1, 2024 Realtime API beta ships
October 17, 2024 gpt-4o-audio-preview ships, exposing GPT-4o-family audio I/O for chat completions
February 14, 2025 ChatGPT release notes still refer to the voice-to-text dictation feature as Whisper
March 20, 2025 gpt-4o-transcribe and gpt-4o-mini-transcribe launch in the Audio API, alongside gpt-4o-mini-tts
December 15, 2025 Dated snapshot gpt-4o-mini-transcribe-2025-12-15 released
January 13, 2026 OpenAI updates the moving slug and recommends gpt-4o-mini-transcribe over gpt-4o-transcribe

Sources

The platform

Put these benchmarks to work

The same evaluations behind these dispatches drive OpenTranscription — one API that routes every job to the right speech model for your audio, language, and budget.

© 2026 OpenTranscription · Signal is our journal.Set in system grotesque, serif & mono