Microsoft Azure Whisper: model profile
Reference spec sheet for OpenAI's Whisper model as offered on Microsoft Azure: delivery paths, limits, languages, pricing, benchmarks, and release history.
Microsoft Azure Whisper is a set of Azure delivery paths for OpenAI's Whisper speech-to-text model, comprising a managed Azure OpenAI endpoint for file-based transcription and speech-to-English translation and Azure Speech batch transcription support for Whisper.
Specifications
| Developer | Model created by OpenAI; Azure offering productized and hosted by Microsoft |
| Released | Whisper released by OpenAI September 21, 2022; Azure public preview September 15, 2023; general availability in Azure OpenAI and Azure Speech March 13, 2024 |
| Model type | General-purpose speech recognition model for transcription and speech-to-English translation |
| Parameters | Not publicly disclosed for the managed Azure OpenAI whisper SKU. The separate catalog model openai-whisper-large-v3 is listed as 1.55B parameters |
| Training data | 680,000 hours of multilingual and multitask supervised audio data from the web |
| Languages | Transcription in 57 languages; translation from those languages into English |
| Modes (batch / streaming) | File-based and batch. The core whisper lane is not for streaming; Microsoft directs low-latency streaming to GPT Realtime Whisper or Azure Speech real-time and fast transcription |
| Latency | Not publicly disclosed. Microsoft publishes no Whisper-specific latency SLA or p50/p95 latency table |
| Throughput / concurrency | Azure Speech Whisper batch transcription: up to 1,000 files per request. Not publicly disclosed for the Azure OpenAI endpoint |
| Deployment | Managed Azure OpenAI endpoint; Azure Speech batch transcription API; openai-whisper-large-v3 on managed compute in the Azure model catalog |
| Pricing | Consumption-based, billed by audio hours in one-second increments, region-sensitive. Microsoft moderators cited $0.36 per audio hour in the US region in late 2023 |
Not disclosedLicense
Full technical breakdown9 sections
Overview
Whisper on Azure is not a single product. It is a set of Azure delivery paths around OpenAI's Whisper speech model. The two primary Azure surfaces are a managed Azure OpenAI model endpoint named whisper for file-based speech-to-text and speech-to-English translation, and Azure Speech batch transcription support for Whisper, which adds enterprise transcription features such as larger files, asynchronous batch operation, speaker diarization, and word-level timestamps. Microsoft's newer audio roadmap layers successors around that base offer, notably GPT-4o transcription models and GPT Realtime Whisper for low-latency streaming transcription.
The core Azure OpenAI Whisper endpoint exposes one managed Azure model ID, whisper, through /audio/transcriptions and /audio/translations, with a documented maximum request size of 25 MB. Microsoft does not publicly expose model-size choices such as tiny, base, small, medium, or large on that managed endpoint. This differs from the open-source Whisper ecosystem, where OpenAI's repository exposes six model sizes, and from Azure's broader model catalog, which separately lists openai-whisper-large-v3 as a distinct managed-compute catalog model.
The underlying model was created by OpenAI. It was trained on 680,000 hours of multilingual and multitask supervised audio data from the web. Microsoft's role has been productization, hosting, governance, and service integration.
Capabilities and features
Azure exposes Whisper in three ways:
| Surface | What Azure exposes | Model-size choice exposed to customer | Technical specialization | Key documented limits and caveats | Sources |
|---|---|---|---|---|---|
| Azure OpenAI Whisper | Model ID whisper, version 001, via /audio/transcriptions and /audio/translations | No public size choice on the managed endpoint | File-based STT and speech translation into English; managed Azure OpenAI experience | 25 MB max audio request; prerecorded/file-based rather than streaming; Microsoft publicly documents one managed SKU rather than tiny/base/small/medium/large choices | |
| Azure Speech with Whisper | Whisper selectable through Azure Speech batch transcription | No public size choice in the batch API docs | Large-volume prerecorded transcription with async processing, diarization, and timestamps | Up to 1 GB per file, up to 1,000 files per request; intended for batch, not live streaming | |
| Azure model catalog openai-whisper-large-v3 | Separate catalog model on managed compute | Yes, implicitly, because the catalog entry is specifically large-v3 | For teams that want a specific open Whisper checkpoint on Azure infrastructure rather than the Azure OpenAI managed API SKU | Distinct from Azure OpenAI whisper; listed as 1.55B parameters and generally available in catalog |
The Azure OpenAI REST reference defines two core operations: /audio/transcriptions, which transcribes into the input language, and /audio/translations, which transcribes and translates into English text. The REST API documents a language hint to improve accuracy and latency. The quickstart and samples show REST/cURL, Python, JavaScript/TypeScript via the OpenAI client library with AzureOpenAI, and PowerShell examples. The quickstart lists common input formats including mp3, mp4, mpeg, mpga, m4a, wav, and webm.
Azure Speech's Whisper support is built around batch transcription. The preview and GA announcements state that Azure Speech adds asynchronous processing, speaker diarization, word-level timestamps, and larger file and batch capacity: files up to 1 GB and up to 1,000 files in a single batch request.
Microsoft's broader speech stack also includes fast transcription, which "returns results synchronously and faster than real-time," and LLM Speech, which supports multilingual transcription and translation, diarization, and prompt-tuning capabilities at up to 500 MB and under 5 hours per file.
Documented input-size limits across the Azure transcription paths: 25 MB for Azure OpenAI whisper, under 500 MB for Azure Speech LLM Speech, and up to 1 GB for Azure Speech Whisper batch transcription.
Microsoft's recommended use cases for Whisper specifically are small prerecorded audio files, translation of non-English speech into English, call-center post-call analysis, accessibility captions, and mining large audio and video corpora for searchable text and downstream insight extraction.
Language support
Microsoft's launch and GA materials describe Whisper on Azure as supporting transcription in 57 languages and translation from those languages into English. OpenAI described Whisper as handling multilingual transcription and translation into English, and positioned it as "robust" to accents, background noise, and technical language.
Performance and benchmarks
Microsoft does not publish a public Azure-hosted WER chart for the managed whisper endpoint.
Vendor-reported: at GA, Microsoft said "thousands of customers" had already used the Whisper API in Azure across healthcare, education, finance, manufacturing, media, and agriculture. Microsoft's featured customer at GA was Lightbulb.AI, which used Whisper in Azure OpenAI Service for call-center workflows and stated its product was "500X more scalable, 90X faster, and 20X more cost-effective than manual call reviews." Those are customer-reported figures rather than third-party audited benchmarks.
Third-party evaluation: in the PennSound evaluation, researchers tested AWS, Azure, Google, IBM, Rev.ai, Whisper, and Whisper.cpp on nearly 10 hours of varied archival audio. Rev.ai was the top performer overall and Whisper was the top open-source performer "as long as hallucinations were avoided," with relatively slim WER and diarization gaps across systems.
Third-party evaluation: a 2024 child-speech study compared Microsoft Azure Speech, Google Cloud Speech-to-Text, and multiple Whisper sizes. On that dataset, the best Whisper model achieved a WER of 21.3%, versus 30.3% for Azure and 49.0% for Google. The same study found that local Whisper variants could beat cloud systems on responsiveness in some setups, a result that partly reflects hardware choice and network overhead rather than model quality alone.
Third-party evaluation: a large comparative study on ASR solution accuracy found that Microsoft and Whisper tended to report lower confidence than measured accuracy, whereas some rivals appeared overconfident. The same study observed that higher accuracy did not necessarily correlate with shorter processing time.
Latency and throughput
Microsoft does not publish a Whisper-specific latency SLA or p50/p95 latency table. Microsoft's product positioning is: Azure OpenAI Whisper is recommended for smaller files and time-sensitive work; Azure Speech fast transcription "returns results synchronously and faster than real-time" for prerecorded audio; GPT Realtime Whisper is intended for low-latency streaming captions and monitoring.
Documented capacity figures: 25 MB maximum request size for the Azure OpenAI whisper endpoint; up to 1 GB per file and up to 1,000 files per batch request for Azure Speech Whisper batch transcription; up to 500 MB and under 5 hours per file for LLM Speech.
Deployment and integrations
Microsoft's architecture guidance separates the two main services by role. Azure OpenAI audio models are for scenarios that combine speech with language reasoning or flexible prompt-based control, or where the team is already in the Azure OpenAI stack. Azure Speech is for high-volume real-time or batch transcription, diarization, custom speech models, custom vocabulary, and deployments that may require container or sovereign-cloud options.
Azure Speech broadly supports containers and on-premises deployment options, but the retrieved first-party documentation does not package the Whisper batch route itself as an on-premises or containerized Whisper product.
Data handling for Azure OpenAI: Microsoft states that customer data, prompts, and completions are not available to other customers, not available to OpenAI or other model providers, not used to improve Microsoft or third-party products or services without explicit permission, and not used to train foundation models without customer permission or instruction. Prompts and responses are processed within the customer-specified geography unless the customer uses Global or DataZone deployment types. Stored data is encrypted at rest with AES-256 by default, with customer-managed keys available for supported scenarios.
Microsoft's Transparency Note states that Azure OpenAI wraps OpenAI models inside Microsoft-managed guardrails and abuse-detection systems. Microsoft announced a confidential inferencing preview for Azure OpenAI Whisper in September 2024, targeting end-to-end privacy in regulated industries.
Compliance: Azure inherits the broader Microsoft Azure compliance program rather than Whisper having a separately branded compliance regime. Microsoft says Azure has more than 100 compliance offerings and points to third-party reports and certifications such as ISO 27001, SOC 2, FedRAMP, and HITRUST across the Azure platform. Microsoft states that Azure supports customers subject to HIPAA and that Azure and Azure Government align with the NIST Cybersecurity Framework and are certified under ISO/IEC 27001. Teams with regulated deployments need to validate service-specific scope in the Service Trust Portal and their contractual setup.
Comparison across Azure offerings and external services, as stated in the source:
| Offering | Core lane | Languages stated publicly | Streaming / real-time | Diarization | Custom adaptation or tuning | On-prem / container path | Best fit | Sources |
|---|---|---|---|---|---|---|---|---|
| Azure OpenAI Whisper | Managed file-based transcription and speech-to-English translation | 57 at launch/GA messaging | Not for the core whisper lane; Microsoft now points low-latency streaming to GPT Realtime Whisper instead | Not documented for core whisper endpoint | No public customer-visible tuning flow documented for the core managed Azure OpenAI whisper SKU | No first-party Whisper container documented in retrieved sources | Small prerecorded multilingual files inside Azure OpenAI workflows | |
| Azure Speech with Whisper | Batch transcription using Whisper | Whisper multilingual plus translation to English | Batch only | Yes | For domain tuning, Microsoft generally steers customers to Azure Speech custom speech features rather than the basic Whisper batch lane | Azure Speech as a service supports containers more broadly, but Whisper-specific container packaging is not documented in the retrieved sources | Large prerecorded files, batch jobs, call recordings, post-call analytics | |
| Azure Speech native STT / fast / LLM Speech | Speech-specialized platform | Broad locale matrix | Yes | Yes | Yes | Yes | Deterministic enterprise STT, custom vocabulary/acoustics, regulated deployment constraints | |
| Google Cloud Speech-to-Text / Chirp | Managed cloud STT | 85+ languages and variants | Yes | Yes | Yes, model adaptation | Not documented in retrieved sources | Multilingual cloud STT with strong adaptation support | |
| Amazon Transcribe | Managed cloud STT | Broad supported-language matrix | Yes | Yes | Yes, custom vocabulary and custom language models | Not documented in retrieved sources | AWS-native batch and streaming transcription, especially call analytics workflows |
Pricing
Pricing is consumption-based. The official Azure Speech pricing page says speech-to-text hours are measured by hours of audio sent and billed in one-second increments, with separate SKUs for standard transcription, fast transcription, batch transcription, and LLM Speech.
Microsoft moderators publicly cited Azure OpenAI Whisper pricing at $0.36 per audio hour in the US region in late 2023. The primary Azure pricing pages are dynamically rendered and the retrieved HTML does not expose a stable current Whisper line item; the source states this figure should not be treated as a guaranteed current price and that pricing is region-sensitive.
Development and ownership
The underlying model was created by OpenAI, not Microsoft. OpenAI's Whisper paper is authored by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. The model was trained on 680,000 hours of multilingual and multitask supervised audio data from the web.
Microsoft's role was productization, hosting, governance, and service integration. Public Azure messaging places Whisper under the Azure AI / Azure OpenAI / Azure Speech umbrella. The public preview announcement was published by Microsoft's Heiko Rausch in the Azure AI services blog, and the GA announcement was published by Marco Casalaina, Vice President of Products for Azure AI. Microsoft does not publish a named Azure Whisper engineering team roster in the retrieved sources.
The Azure OpenAI managed whisper endpoint is not the same as the open-source openai/whisper repository, which exposes model sizes tiny, base, small, medium, large, and turbo, with explicit parameter counts and hardware tradeoffs. Azure also separately catalogs openai-whisper-large-v3 on managed compute.
Microsoft's stated motivation was enterprise-facing: its GA announcement says enterprises struggle to analyze voice interactions across many languages while preserving security and privacy guardrails, and it frames Whisper on Azure as giving customers a choice between Azure OpenAI and Azure Speech for call-center analytics, captions and accessibility, and mining audio and video for insights.
The source states there is no meaningful public evidence that Meta played a role in Azure's Whisper offer; Meta models such as MMS belong to a separate multilingual speech line.
Release history
| Date | Milestone | What changed | Why Microsoft updated it | Sources |
|---|---|---|---|---|
| September 21, 2022 | OpenAI releases Whisper | Whisper is launched as an ASR system trained on 680,000 hours of multilingual/multitask audio data | Established the upstream model that Azure later commercialized | |
| September 15, 2023 | Azure public preview | Microsoft announces Whisper preview in both Azure OpenAI Service and Azure AI Speech | To bring OpenAI Whisper into Azure with enterprise controls and two usage paths: simple API and batch speech pipeline | |
| September 2023 | Azure Speech REST API v3.2 preview | Azure Speech v3.2 preview adds the API path that supports Whisper for batch transcription | To enable Whisper inside the existing Speech batch ecosystem | |
| March 13, 2024 | General availability | Whisper becomes GA in Azure OpenAI and Azure Speech | To move Whisper into production workloads under Azure's enterprise-readiness positioning | |
| June 2024 | Azure Speech REST API v3.2 GA | Speech batch API version supporting Whisper goes GA | To stabilize the Azure Speech integration path for production batch transcription | |
| September 24, 2024 | Confidential Inferencing preview | Microsoft announces confidential inferencing preview for Azure OpenAI Whisper | To support verifiable end-to-end privacy for regulated workloads | |
| April 2025 | GPT-4o transcription models released on Azure | gpt-4o-transcribe and gpt-4o-mini-transcribe arrive in Azure OpenAI | To improve accuracy, flexibility, and voice-first application support beyond classic Whisper | |
| May 2026 | GPT Realtime Whisper documentation appears | Azure publishes GPT Realtime Whisper concept docs | To cover low-latency streaming transcription for captions, monitoring, and archival workflows, a gap in the original file-based Whisper API |
The most recent Whisper-family update in the source is the May 2026 publication of GPT Realtime Whisper documentation, described by Microsoft as covering low-latency, stream-based transcription for live captions and monitoring.
Sources
- Foundry Models sold by Azure - Microsoft Foundry | Microsoft Learn. https://learn.microsoft.com/en-us/azure/foundry/foundry-models/concepts/models-sold-directly-by-azure
- Announcing the Preview of OpenAI Whisper in Azure OpenAI service and Azure AI Speech | Microsoft Community Hub. https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-the-preview-of-openai-whisper-in-azure-openai-service/ba-p/3928388
- Introducing Whisper. https://openai.com/index/whisper/
- Accelerate your productivity with the Whisper model in Azure AI now generally available | Microsoft Azure Blog. https://azure.microsoft.com/en-us/blog/accelerate-your-productivity-with-the-whisper-model-in-azure-ai-now-generally-available/
- AI Model Catalog | Microsoft Foundry Models. https://ai.azure.com/catalog/models/whisper
- AI Model Catalog | Microsoft Foundry Models. https://ai.azure.com/catalog/models/openai-whisper-large-v3
- Choose an Azure Speech Recognition and Generation Technology - Azure Architecture Center | Microsoft Learn. https://learn.microsoft.com/en-us/azure/architecture/data-guide/ai-services/speech-recognition-generation
- Pricing - Azure Speech in Foundry Tools | Microsoft Azure. https://azure.microsoft.com/ja-jp/pricing/details/speech/
- What's new - Speech service - Foundry Tools | Microsoft Learn. https://learn.microsoft.com/en-us/azure/ai-services/speech-service/releasenotes
- Microsoft Trustworthy AI: Unlocking human potential starts with trust. https://blogs.microsoft.com/blog/2024/09/24/microsoft-trustworthy-ai-unlocking-human-potential-starts-with-trust/
- What's new in Azure OpenAI in Microsoft Foundry Models? (classic) - Microsoft Foundry (classic) portal | Microsoft Learn. https://learn.microsoft.com/en-us/azure/foundry-classic/openai/whats-new
- Robust Speech Recognition via Large-Scale Weak Supervision. https://arxiv.org/abs/2212.04356
- GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision. https://github.com/openai/whisper
- blog/open-asr-leaderboard.md at main - huggingface/blog - GitHub. https://github.com/huggingface/blog/blob/main/open-asr-leaderboard.md
- Evaluating Speech-to-Text Systems with PennSound. https://arxiv.org/html/2504.05702v1
- Child Speech Recognition in Human-Robot Interaction: Problem Solved? https://arxiv.org/html/2404.17394v2
- Measuring the Accuracy of Automatic Speech Recognition Solutions. https://arxiv.org/html/2408.16287v1
- Researchers say an AI-powered transcription tool used in hospitals invents things no one ever said. https://apnews.com/article/90020cdf5fa16c79ca2e5b6c4c9bbb14
- Data, privacy, and security for Foundry Models sold by Azure in Microsoft Foundry - Microsoft Foundry | Microsoft Learn. https://learn.microsoft.com/en-us/azure/foundry/responsible-ai/openai/data-privacy
- Transparency Note for Azure OpenAI in Microsoft Foundry Models - Microsoft Foundry | Microsoft Learn. https://learn.microsoft.com/en-us/azure/foundry/responsible-ai/openai/transparency-note
- Compliance in the trusted cloud | Microsoft Azure. https://azure.microsoft.com/en-us/explore/trusted-cloud/compliance
- Speech-to-Text: AI voice typing and transcription. https://cloud.google.com/speech-to-text
- What is Amazon Transcribe? - Amazon. https://docs.aws.amazon.com/transcribe/latest/dg/what-is.html
- Speech to text with Whisper - Microsoft Foundry | Microsoft Learn. https://learn.microsoft.com/en-us/azure/foundry/openai/whisper-quickstart
- What is the price for whisper model and what is the quota for normal customer and commitment customer - Microsoft Q&A. https://learn.microsoft.com/en-us/answers/questions/1466656/what-is-the-price-for-whisper-model-and-what-is-th