OpenAI Whisper large-v3: model profile
Reference profile of OpenAI Whisper large-v3: architecture, training data, release history, deployment options, pricing, limitations, and sources.
Whisper large-v3 is a 1.55B-parameter checkpoint in OpenAI's Whisper speech recognition family. OpenAI released large-v3 in November 2023 as an update to the original Whisper large and large-v2 models.
Specifications
| Developer | OpenAI |
| Original Whisper release | September 21, 2022 |
| large-v3 release | November 2023 |
| Model type | Encoder-decoder Transformer sequence-to-sequence ASR model |
| Parameters | 1.55B for large, large-v2, and large-v3 |
| Input representation | Log-Mel spectrogram; large-v3 uses 128 Mel bins |
| Training data for original Whisper | 680,000 hours of multilingual and multitask supervised audio data from the web |
| Training data for large-v3 | 1M hours weakly labeled audio plus 4M hours pseudo-labeled by large-v2 |
| License | MIT License for open-source code and weights |
| Managed API | whisper-1 in OpenAI Audio API |
| API price | $0.006 per minute for whisper-1 |
| Primary third-party runtimes | openai/whisper, whisper.cpp, faster-whisper, NVIDIA TensorRT-LLM packaging |
Full technical breakdown10 sections
Overview
Whisper is a general-purpose speech recognition system trained for multilingual transcription, speech translation into English, language identification, voice activity style control through special tokens, and timestamped transcription. OpenAI released the original Whisper code and model weights under the MIT License in September 2022.
large-v3 keeps the same broad encoder-decoder Transformer design as the previous large checkpoints. OpenAI's release notes identify two model-level differences from earlier large models: 128 Mel frequency bins instead of 80, and a new Cantonese language token. The larger change is the training mix: 1 million hours of weakly labeled audio plus 4 million hours of pseudo-labeled audio generated with large-v2, trained for 2.0 epochs over the combined dataset.
OpenAI's current product stack treats Whisper as one model family inside a larger managed speech platform. The Audio API still supports whisper-1, while newer transcription models include gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper.
Architecture and training
The Whisper paper describes an encoder-decoder Transformer trained as a single sequence-to-sequence model across several speech tasks. The model consumes log-Mel spectrogram inputs and predicts text tokens plus task and timestamp tokens. The repository describes this as a single model replacing many stages of a traditional speech-processing pipeline.
The model family uses 30-second audio chunks. Long-form transcription requires sequential decoding and supporting logic such as segmentation, silence trimming, temperature fallback, previous-text conditioning, timestamp constraints, and other decoding safeguards.
OpenAI's original 2022 paper emphasizes data scale rather than architectural novelty. The authors say they used an "off-the-shelf" encoder-decoder Transformer to avoid mixing architecture changes into the study. The original dataset used 680,000 hours of audio and transcript pairs collected from the internet, including 117,000 hours across 96 non-English languages and 125,000 hours for X-to-English translation. The public model card later describes the non-English data as spanning 98 languages.
large-v2 kept the same architecture and size as the original large model but used a different training procedure: 2.5 times more epochs, SpecAugment, stochastic depth, and BPE dropout. OpenAI reported about 5% relative English error reduction and 10% relative non-English error reduction on average compared with the original large model.
large-v3 kept the architecture mostly intact and changed the input features and training data. It uses 128 Mel bins, adds a Cantonese token, and trains on 1M hours of weak labels plus 4M hours of pseudo-labels generated by large-v2.
Capabilities and features
- Multilingual transcription.
- Speech translation into English.
- Language identification.
- Timestamp prediction for transcription segments.
- Public open-source inference through OpenAI's repository.
- Managed API access through
whisper-1in OpenAI's Audio API. - Local and optimized runtimes through projects such as whisper.cpp and faster-whisper.
- Commercial deployment options through third-party packaging, including NVIDIA's TensorRT-LLM model card for Whisper large-v3.
Language support
The original paper reports training data across 96 non-English languages, while the public model card describes non-English data across 98 languages. Performance is uneven and strongly related to training-data volume. The paper reports a 0.83 correlation between log WER and log training hours per language.
The paper identifies language-specific outliers such as Hebrew, Telugu, Chinese, and Korean, where performance was worse than the data volume alone would predict. It points to possible causes including linguistic distance, script differences, tokenizer mismatch, or data-quality variation.
large-v3 added a Cantonese language token.
Performance and benchmarks
OpenAI's large-v3 release notes report broad gains over large-v2 on the Common Voice 15 and FLEURS comparisons shown in the release discussion, with 10 to 20% error reductions across many languages. The release notes do not claim that every audio type or language improves uniformly.
The large-v2 release notes report about 5% relative error reduction in English and 10% in other languages on average compared with the original large model, while noting that some audio still favored large-v1.
OpenAI's current managed model page describes whisper-1 as "Average" performance and "Medium" speed, and prices it at $0.006 per minute. Newer OpenAI transcription products are marketed as higher accuracy or lower latency depending on use case.
Independent research has raised concerns about hallucination behavior. The 2024 "Careless Whisper" paper reports that roughly 1% of audio transcriptions in its dataset contained entire hallucinated phrases or sentences absent from the audio, and that 38% of those hallucinations included explicit harms such as violent rhetoric, false authority, or inaccurate associations.
Latency and throughput
The open-source Whisper checkpoints do not have one latency number because performance depends on hardware, runtime, precision, batching, quantization, decoding settings, and segment length.
The ecosystem has produced several faster runtimes. whisper.cpp supports CPU-only inference, quantization, Apple Silicon, NVIDIA, ROCm, OpenVINO, WebAssembly, Android, iOS, Raspberry Pi, and other targets. faster-whisper, based on CTranslate2, reports up to 4x faster inference than openai/whisper at the same accuracy while using less memory, with additional gains from batching and quantization.
OpenAI's managed whisper-1 model page labels speed as "Medium." The same current OpenAI speech stack positions gpt-realtime-whisper as the low-latency realtime option and newer GPT-4o transcription models as the higher-accuracy managed transcription path.
Deployment and integrations
Whisper large-v3 can be used locally through OpenAI's open-source repository or through optimized third-party runtimes. The MIT License allows commercial use of the released code and weights.
OpenAI's Audio API exposes whisper-1, not a customer-selectable open-source size menu. The reviewed sources do not show that the hosted whisper-1 API lets customers choose large-v3 specifically.
Third-party deployment paths include whisper.cpp for broad hardware portability, faster-whisper for CTranslate2 inference, and NVIDIA's TensorRT-LLM packaging for Whisper large-v3.
Pricing
OpenAI's hosted whisper-1 API is priced at $0.006 per minute.
Self-hosted large-v3 has no per-minute model license fee under the MIT License, but the user pays for compute, storage, maintenance, and any deployment platform.
Third-party managed or optimized deployments may have their own pricing. The source article did not establish a single authoritative cross-provider price for all large-v3 hosting options.
Development and ownership
The Whisper paper lists Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever as authors, with Radford and Kim as corresponding authors.
OpenAI released Whisper as a research artifact for robust speech processing under large-scale weak supervision. The model card names researchers as the primary intended users and also notes developer usefulness.
Release history
| Date | Milestone | Notes |
|---|---|---|
| September 21, 2022 | Whisper released | OpenAI released the paper, code, and weights under MIT License |
| December 2022 | large-v2 | Same architecture and size as large, with additional training and regularization |
| November 2023 | large-v3 | 128 Mel bins, Cantonese token, 1M weak-label hours plus 4M pseudo-label hours |
| September 2024 | turbo | Model card records turbo as a later speed-oriented checkpoint in the Whisper family |
Sources
- GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision. https://github.com/openai/whisper
- Introducing Whisper. https://openai.com/index/whisper/
- Robust Speech Recognition via Large-Scale Weak Supervision. https://cdn.openai.com/papers/whisper.pdf
- Announcing the large-v2 model, openai/whisper Discussion #661. https://github.com/openai/whisper/discussions/661
- large-v3 release, openai/whisper Discussion #1762. https://github.com/openai/whisper/discussions/1762
- Language Models are Unsupervised Multitask Learners. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
- Speech to text, OpenAI API docs. https://developers.openai.com/api/docs/guides/speech-to-text
- Speech to text, OpenAI API docs. https://developers.openai.com/api/docs/guides/speech-to-text
- Whisper Model, OpenAI API docs. https://developers.openai.com/api/docs/models/whisper-1
- Careless Whisper: Speech-to-Text Hallucination Harms. https://arxiv.org/abs/2402.08021
- Whisper model card, openai/whisper repository. https://github.com/openai/whisper/blob/main/model-card.md
- Whisper.cpp. https://github.com/ggml-org/whisper.cpp
- faster-whisper. https://github.com/SYSTRAN/faster-whisper
- NVIDIA Whisper large-v3 model card. https://build.nvidia.com/openai/whisper-large-v3