OpenTranscription/ Blog
2026-07-03 · MODEL PROFILE

OpenAI Whisper large-v3: model profile

Reference profile of OpenAI Whisper large-v3: architecture, training data, release history, deployment options, pricing, limitations, and sources.

OpenAI
Model profile OpenAI

Whisper large-v3 is a 1.55B-parameter checkpoint in OpenAI's Whisper speech recognition family. OpenAI released large-v3 in November 2023 as an update to the original Whisper large and large-v2 models.

Specifications

DeveloperOpenAI
Original Whisper releaseSeptember 21, 2022
large-v3 releaseNovember 2023
Model typeEncoder-decoder Transformer sequence-to-sequence ASR model
Parameters1.55B for large, large-v2, and large-v3
Input representationLog-Mel spectrogram; large-v3 uses 128 Mel bins
Training data for original Whisper680,000 hours of multilingual and multitask supervised audio data from the web
Training data for large-v31M hours weakly labeled audio plus 4M hours pseudo-labeled by large-v2
LicenseMIT License for open-source code and weights
Managed APIwhisper-1 in OpenAI Audio API
API price$0.006 per minute for whisper-1
Primary third-party runtimesopenai/whisper, whisper.cpp, faster-whisper, NVIDIA TensorRT-LLM packaging
Full technical breakdown10 sections

Overview

Whisper is a general-purpose speech recognition system trained for multilingual transcription, speech translation into English, language identification, voice activity style control through special tokens, and timestamped transcription. OpenAI released the original Whisper code and model weights under the MIT License in September 2022.

large-v3 keeps the same broad encoder-decoder Transformer design as the previous large checkpoints. OpenAI's release notes identify two model-level differences from earlier large models: 128 Mel frequency bins instead of 80, and a new Cantonese language token. The larger change is the training mix: 1 million hours of weakly labeled audio plus 4 million hours of pseudo-labeled audio generated with large-v2, trained for 2.0 epochs over the combined dataset.

OpenAI's current product stack treats Whisper as one model family inside a larger managed speech platform. The Audio API still supports whisper-1, while newer transcription models include gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-transcribe-diarize, and gpt-realtime-whisper.

Architecture and training

The Whisper paper describes an encoder-decoder Transformer trained as a single sequence-to-sequence model across several speech tasks. The model consumes log-Mel spectrogram inputs and predicts text tokens plus task and timestamp tokens. The repository describes this as a single model replacing many stages of a traditional speech-processing pipeline.

The model family uses 30-second audio chunks. Long-form transcription requires sequential decoding and supporting logic such as segmentation, silence trimming, temperature fallback, previous-text conditioning, timestamp constraints, and other decoding safeguards.

OpenAI's original 2022 paper emphasizes data scale rather than architectural novelty. The authors say they used an "off-the-shelf" encoder-decoder Transformer to avoid mixing architecture changes into the study. The original dataset used 680,000 hours of audio and transcript pairs collected from the internet, including 117,000 hours across 96 non-English languages and 125,000 hours for X-to-English translation. The public model card later describes the non-English data as spanning 98 languages.

large-v2 kept the same architecture and size as the original large model but used a different training procedure: 2.5 times more epochs, SpecAugment, stochastic depth, and BPE dropout. OpenAI reported about 5% relative English error reduction and 10% relative non-English error reduction on average compared with the original large model.

large-v3 kept the architecture mostly intact and changed the input features and training data. It uses 128 Mel bins, adds a Cantonese token, and trains on 1M hours of weak labels plus 4M hours of pseudo-labels generated by large-v2.

Capabilities and features

  • Multilingual transcription.
  • Speech translation into English.
  • Language identification.
  • Timestamp prediction for transcription segments.
  • Public open-source inference through OpenAI's repository.
  • Managed API access through whisper-1 in OpenAI's Audio API.
  • Local and optimized runtimes through projects such as whisper.cpp and faster-whisper.
  • Commercial deployment options through third-party packaging, including NVIDIA's TensorRT-LLM model card for Whisper large-v3.

Language support

The original paper reports training data across 96 non-English languages, while the public model card describes non-English data across 98 languages. Performance is uneven and strongly related to training-data volume. The paper reports a 0.83 correlation between log WER and log training hours per language.

The paper identifies language-specific outliers such as Hebrew, Telugu, Chinese, and Korean, where performance was worse than the data volume alone would predict. It points to possible causes including linguistic distance, script differences, tokenizer mismatch, or data-quality variation.

large-v3 added a Cantonese language token.

Performance and benchmarks

OpenAI's large-v3 release notes report broad gains over large-v2 on the Common Voice 15 and FLEURS comparisons shown in the release discussion, with 10 to 20% error reductions across many languages. The release notes do not claim that every audio type or language improves uniformly.

The large-v2 release notes report about 5% relative error reduction in English and 10% in other languages on average compared with the original large model, while noting that some audio still favored large-v1.

OpenAI's current managed model page describes whisper-1 as "Average" performance and "Medium" speed, and prices it at $0.006 per minute. Newer OpenAI transcription products are marketed as higher accuracy or lower latency depending on use case.

Independent research has raised concerns about hallucination behavior. The 2024 "Careless Whisper" paper reports that roughly 1% of audio transcriptions in its dataset contained entire hallucinated phrases or sentences absent from the audio, and that 38% of those hallucinations included explicit harms such as violent rhetoric, false authority, or inaccurate associations.

Latency and throughput

The open-source Whisper checkpoints do not have one latency number because performance depends on hardware, runtime, precision, batching, quantization, decoding settings, and segment length.

The ecosystem has produced several faster runtimes. whisper.cpp supports CPU-only inference, quantization, Apple Silicon, NVIDIA, ROCm, OpenVINO, WebAssembly, Android, iOS, Raspberry Pi, and other targets. faster-whisper, based on CTranslate2, reports up to 4x faster inference than openai/whisper at the same accuracy while using less memory, with additional gains from batching and quantization.

OpenAI's managed whisper-1 model page labels speed as "Medium." The same current OpenAI speech stack positions gpt-realtime-whisper as the low-latency realtime option and newer GPT-4o transcription models as the higher-accuracy managed transcription path.

Deployment and integrations

Whisper large-v3 can be used locally through OpenAI's open-source repository or through optimized third-party runtimes. The MIT License allows commercial use of the released code and weights.

OpenAI's Audio API exposes whisper-1, not a customer-selectable open-source size menu. The reviewed sources do not show that the hosted whisper-1 API lets customers choose large-v3 specifically.

Third-party deployment paths include whisper.cpp for broad hardware portability, faster-whisper for CTranslate2 inference, and NVIDIA's TensorRT-LLM packaging for Whisper large-v3.

Pricing

OpenAI's hosted whisper-1 API is priced at $0.006 per minute.

Self-hosted large-v3 has no per-minute model license fee under the MIT License, but the user pays for compute, storage, maintenance, and any deployment platform.

Third-party managed or optimized deployments may have their own pricing. The source article did not establish a single authoritative cross-provider price for all large-v3 hosting options.

Development and ownership

The Whisper paper lists Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever as authors, with Radford and Kim as corresponding authors.

OpenAI released Whisper as a research artifact for robust speech processing under large-scale weak supervision. The model card names researchers as the primary intended users and also notes developer usefulness.

Release history

Date Milestone Notes
September 21, 2022 Whisper released OpenAI released the paper, code, and weights under MIT License
December 2022 large-v2 Same architecture and size as large, with additional training and regularization
November 2023 large-v3 128 Mel bins, Cantonese token, 1M weak-label hours plus 4M pseudo-label hours
September 2024 turbo Model card records turbo as a later speed-oriented checkpoint in the Whisper family

Sources

  1. GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision. https://github.com/openai/whisper
  2. Introducing Whisper. https://openai.com/index/whisper/
  3. Robust Speech Recognition via Large-Scale Weak Supervision. https://cdn.openai.com/papers/whisper.pdf
  4. Announcing the large-v2 model, openai/whisper Discussion #661. https://github.com/openai/whisper/discussions/661
  5. large-v3 release, openai/whisper Discussion #1762. https://github.com/openai/whisper/discussions/1762
  6. Language Models are Unsupervised Multitask Learners. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
  7. Speech to text, OpenAI API docs. https://developers.openai.com/api/docs/guides/speech-to-text
  8. Speech to text, OpenAI API docs. https://developers.openai.com/api/docs/guides/speech-to-text
  9. Whisper Model, OpenAI API docs. https://developers.openai.com/api/docs/models/whisper-1
  10. Careless Whisper: Speech-to-Text Hallucination Harms. https://arxiv.org/abs/2402.08021
  11. Whisper model card, openai/whisper repository. https://github.com/openai/whisper/blob/main/model-card.md
  12. Whisper.cpp. https://github.com/ggml-org/whisper.cpp
  13. faster-whisper. https://github.com/SYSTRAN/faster-whisper
  14. NVIDIA Whisper large-v3 model card. https://build.nvidia.com/openai/whisper-large-v3
The platform

Put these benchmarks to work

The same evaluations behind these dispatches drive OpenTranscription — one API that routes every job to the right speech model for your audio, language, and budget.

© 2026 OpenTranscription · Signal is our journal.Set in system grotesque, serif & mono