OpenAI Whisper large-v3: model profile

Whisper large-v3 is a 1.55B-parameter checkpoint in OpenAI's Whisper speech recognition family. OpenAI released large-v3 in November 2023 as an update to the original Whisper large and large-v2 models.

Specifications

Developer	OpenAI
Original Whisper release	September 21, 2022
large-v3 release	November 2023
Model type	Encoder-decoder Transformer sequence-to-sequence ASR model
Parameters	1.55B for large, large-v2, and large-v3
Input representation	Log-Mel spectrogram; large-v3 uses 128 Mel bins
Training data for original Whisper	680,000 hours of multilingual and multitask supervised audio data from the web
Training data for large-v3	1M hours weakly labeled audio plus 4M hours pseudo-labeled by large-v2
License	MIT License for open-source code and weights
Managed API	`whisper-1` in OpenAI Audio API
API price	$0.006 per minute for `whisper-1`
Primary third-party runtimes	openai/whisper, whisper.cpp, faster-whisper, NVIDIA TensorRT-LLM packaging

Known limitations

The training dataset is not public. OpenAI describes internet-collected audio and transcript pairs at a high level but does not publish source-level provenance.
Whisper can hallucinate text that was not spoken, especially in difficult long-form or non-speech cases. OpenAI's paper and model card both discuss hallucination and long-form decoding failure modes.
Performance varies by language, accent, dialect, data volume, and audio quality.
The model card recommends against high-risk decision contexts and warns against transcribing recordings without consent.
Whisper itself does not provide a robustly evaluated speaker diarization system. OpenAI's managed stack now offers diarization through a separate model, gpt-4o-transcribe-diarize.
Long audio requires chunking and decoding discipline. A naive single-pass pipeline is not equivalent to a production deployment.
OpenAI's hosted whisper-1 API does not expose the same model-size choices as the open-source repository in the reviewed sources.

Full technical breakdown10 sections

Overview

Whisper is a general-purpose speech recognition system trained for multilingual transcription, speech translation into English, language identification, voice activity style control through special tokens, and timestamped transcription. OpenAI released the original Whisper code and model weights under the MIT License in September 2022.

large-v3 keeps the same broad encoder-decoder Transformer design as the previous large checkpoints. OpenAI's release notes identify two model-level differences from earlier large models: 128 Mel frequency bins instead of 80, and a new Cantonese language token. The larger change is the training mix: 1 million hours of weakly labeled audio plus 4 million hours of pseudo-labeled audio generated with large-v2, trained for 2.0 epochs over the combined dataset.

OpenAI's current product stack treats Whisper as one model family inside a larger managed speech platform. The Audio API still supports whisper-1, while newer transcription models include gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper.

Architecture and training

The Whisper paper describes an encoder-decoder Transformer trained as a single sequence-to-sequence model across several speech tasks. The model consumes log-Mel spectrogram inputs and predicts text tokens plus task and timestamp tokens. The repository describes this as a single model replacing many stages of a traditional speech-processing pipeline.

The model family uses 30-second audio chunks. Long-form transcription requires sequential decoding and supporting logic such as segmentation, silence trimming, temperature fallback, previous-text conditioning, timestamp constraints, and other decoding safeguards.

OpenAI's original 2022 paper emphasizes data scale rather than architectural novelty. The authors say they used an "off-the-shelf" encoder-decoder Transformer to avoid mixing architecture changes into the study. The original dataset used 680,000 hours of audio and transcript pairs collected from the internet, including 117,000 hours across 96 non-English languages and 125,000 hours for X-to-English translation. The public model card later describes the non-English data as spanning 98 languages.

large-v2 kept the same architecture and size as the original large model but used a different training procedure: 2.5 times more epochs, SpecAugment, stochastic depth, and BPE dropout. OpenAI reported about 5% relative English error reduction and 10% relative non-English error reduction on average compared with the original large model.

large-v3 kept the architecture mostly intact and changed the input features and training data. It uses 128 Mel bins, adds a Cantonese token, and trains on 1M hours of weak labels plus 4M hours of pseudo-labels generated by large-v2.

Capabilities and features

Multilingual transcription.
Speech translation into English.
Language identification.
Timestamp prediction for transcription segments.
Public open-source inference through OpenAI's repository.
Managed API access through whisper-1 in OpenAI's Audio API.
Local and optimized runtimes through projects such as whisper.cpp and faster-whisper.
Commercial deployment options through third-party packaging, including NVIDIA's TensorRT-LLM model card for Whisper large-v3.

Language support

The original paper reports training data across 96 non-English languages, while the public model card describes non-English data across 98 languages. Performance is uneven and strongly related to training-data volume. The paper reports a 0.83 correlation between log WER and log training hours per language.

The paper identifies language-specific outliers such as Hebrew, Telugu, Chinese, and Korean, where performance was worse than the data volume alone would predict. It points to possible causes including linguistic distance, script differences, tokenizer mismatch, or data-quality variation.

large-v3 added a Cantonese language token.

Performance and benchmarks

OpenAI's large-v3 release notes report broad gains over large-v2 on the Common Voice 15 and FLEURS comparisons shown in the release discussion, with 10 to 20% error reductions across many languages. The release notes do not claim that every audio type or language improves uniformly.

The large-v2 release notes report about 5% relative error reduction in English and 10% in other languages on average compared with the original large model, while noting that some audio still favored large-v1.

OpenAI's current managed model page describes whisper-1 as "Average" performance and "Medium" speed, and prices it at $0.006 per minute. Newer OpenAI transcription products are marketed as higher accuracy or lower latency depending on use case.

Independent research has raised concerns about hallucination behavior. The 2024 "Careless Whisper" paper reports that roughly 1% of audio transcriptions in its dataset contained entire hallucinated phrases or sentences absent from the audio, and that 38% of those hallucinations included explicit harms such as violent rhetoric, false authority, or inaccurate associations.

Latency and throughput

The open-source Whisper checkpoints do not have one latency number because performance depends on hardware, runtime, precision, batching, quantization, decoding settings, and segment length.

The ecosystem has produced several faster runtimes. whisper.cpp supports CPU-only inference, quantization, Apple Silicon, NVIDIA, ROCm, OpenVINO, WebAssembly, Android, iOS, Raspberry Pi, and other targets. faster-whisper, based on CTranslate2, reports up to 4x faster inference than openai/whisper at the same accuracy while using less memory, with additional gains from batching and quantization.

OpenAI's managed whisper-1 model page labels speed as "Medium." The same current OpenAI speech stack positions gpt-realtime-whisper as the low-latency realtime option and newer GPT-4o transcription models as the higher-accuracy managed transcription path.

Deployment and integrations

Whisper large-v3 can be used locally through OpenAI's open-source repository or through optimized third-party runtimes. The MIT License allows commercial use of the released code and weights.

OpenAI's Audio API exposes whisper-1, not a customer-selectable open-source size menu. The reviewed sources do not show that the hosted whisper-1 API lets customers choose large-v3 specifically.

Third-party deployment paths include whisper.cpp for broad hardware portability, faster-whisper for CTranslate2 inference, and NVIDIA's TensorRT-LLM packaging for Whisper large-v3.

Pricing

OpenAI's hosted whisper-1 API is priced at $0.006 per minute.

Self-hosted large-v3 has no per-minute model license fee under the MIT License, but the user pays for compute, storage, maintenance, and any deployment platform.

Third-party managed or optimized deployments may have their own pricing. The source article did not establish a single authoritative cross-provider price for all large-v3 hosting options.

Development and ownership

The Whisper paper lists Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever as authors, with Radford and Kim as corresponding authors.

OpenAI released Whisper as a research artifact for robust speech processing under large-scale weak supervision. The model card names researchers as the primary intended users and also notes developer usefulness.

Release history

Date	Milestone	Notes
September 21, 2022	Whisper released	OpenAI released the paper, code, and weights under MIT License
December 2022	large-v2	Same architecture and size as large, with additional training and regularization
November 2023	large-v3	128 Mel bins, Cantonese token, 1M weak-label hours plus 4M pseudo-label hours
September 2024	turbo	Model card records turbo as a later speed-oriented checkpoint in the Whisper family

Sources

GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision. https://github.com/openai/whisper
Introducing Whisper. https://openai.com/index/whisper/
Robust Speech Recognition via Large-Scale Weak Supervision. https://cdn.openai.com/papers/whisper.pdf
Announcing the large-v2 model, openai/whisper Discussion #661. https://github.com/openai/whisper/discussions/661
large-v3 release, openai/whisper Discussion #1762. https://github.com/openai/whisper/discussions/1762
Language Models are Unsupervised Multitask Learners. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Speech to text, OpenAI API docs. https://developers.openai.com/api/docs/guides/speech-to-text
Speech to text, OpenAI API docs. https://developers.openai.com/api/docs/guides/speech-to-text
Whisper Model, OpenAI API docs. https://developers.openai.com/api/docs/models/whisper-1
Careless Whisper: Speech-to-Text Hallucination Harms. https://arxiv.org/abs/2402.08021
Whisper model card, openai/whisper repository. https://github.com/openai/whisper/blob/main/model-card.md
Whisper.cpp. https://github.com/ggml-org/whisper.cpp
faster-whisper. https://github.com/SYSTRAN/faster-whisper
NVIDIA Whisper large-v3 model card. https://build.nvidia.com/openai/whisper-large-v3