OpenTranscription/ Blog
2026-07-03 · MODEL PROFILE

Microsoft Azure Whisper: model profile

Reference spec sheet for OpenAI's Whisper model as offered on Microsoft Azure: delivery paths, limits, languages, pricing, benchmarks, and release history.

model-profilespeech-to-textwhisperazureopenai
MicrosoftAZURE
Model profile Microsoft (model by OpenAI)

Microsoft Azure Whisper is a set of Azure delivery paths for OpenAI's Whisper speech-to-text model, comprising a managed Azure OpenAI endpoint for file-based transcription and speech-to-English translation and Azure Speech batch transcription support for Whisper.

Specifications

DeveloperModel created by OpenAI; Azure offering productized and hosted by Microsoft
ReleasedWhisper released by OpenAI September 21, 2022; Azure public preview September 15, 2023; general availability in Azure OpenAI and Azure Speech March 13, 2024
Model typeGeneral-purpose speech recognition model for transcription and speech-to-English translation
ParametersNot publicly disclosed for the managed Azure OpenAI whisper SKU. The separate catalog model openai-whisper-large-v3 is listed as 1.55B parameters
Training data680,000 hours of multilingual and multitask supervised audio data from the web
LanguagesTranscription in 57 languages; translation from those languages into English
Modes (batch / streaming)File-based and batch. The core whisper lane is not for streaming; Microsoft directs low-latency streaming to GPT Realtime Whisper or Azure Speech real-time and fast transcription
LatencyNot publicly disclosed. Microsoft publishes no Whisper-specific latency SLA or p50/p95 latency table
Throughput / concurrencyAzure Speech Whisper batch transcription: up to 1,000 files per request. Not publicly disclosed for the Azure OpenAI endpoint
DeploymentManaged Azure OpenAI endpoint; Azure Speech batch transcription API; openai-whisper-large-v3 on managed compute in the Azure model catalog
PricingConsumption-based, billed by audio hours in one-second increments, region-sensitive. Microsoft moderators cited $0.36 per audio hour in the US region in late 2023

Not disclosedLicense

Full technical breakdown9 sections

Overview

Whisper on Azure is not a single product. It is a set of Azure delivery paths around OpenAI's Whisper speech model. The two primary Azure surfaces are a managed Azure OpenAI model endpoint named whisper for file-based speech-to-text and speech-to-English translation, and Azure Speech batch transcription support for Whisper, which adds enterprise transcription features such as larger files, asynchronous batch operation, speaker diarization, and word-level timestamps. Microsoft's newer audio roadmap layers successors around that base offer, notably GPT-4o transcription models and GPT Realtime Whisper for low-latency streaming transcription.

The core Azure OpenAI Whisper endpoint exposes one managed Azure model ID, whisper, through /audio/transcriptions and /audio/translations, with a documented maximum request size of 25 MB. Microsoft does not publicly expose model-size choices such as tiny, base, small, medium, or large on that managed endpoint. This differs from the open-source Whisper ecosystem, where OpenAI's repository exposes six model sizes, and from Azure's broader model catalog, which separately lists openai-whisper-large-v3 as a distinct managed-compute catalog model.

The underlying model was created by OpenAI. It was trained on 680,000 hours of multilingual and multitask supervised audio data from the web. Microsoft's role has been productization, hosting, governance, and service integration.

Capabilities and features

Azure exposes Whisper in three ways:

Surface What Azure exposes Model-size choice exposed to customer Technical specialization Key documented limits and caveats Sources
Azure OpenAI Whisper Model ID whisper, version 001, via /audio/transcriptions and /audio/translations No public size choice on the managed endpoint File-based STT and speech translation into English; managed Azure OpenAI experience 25 MB max audio request; prerecorded/file-based rather than streaming; Microsoft publicly documents one managed SKU rather than tiny/base/small/medium/large choices
Azure Speech with Whisper Whisper selectable through Azure Speech batch transcription No public size choice in the batch API docs Large-volume prerecorded transcription with async processing, diarization, and timestamps Up to 1 GB per file, up to 1,000 files per request; intended for batch, not live streaming
Azure model catalog openai-whisper-large-v3 Separate catalog model on managed compute Yes, implicitly, because the catalog entry is specifically large-v3 For teams that want a specific open Whisper checkpoint on Azure infrastructure rather than the Azure OpenAI managed API SKU Distinct from Azure OpenAI whisper; listed as 1.55B parameters and generally available in catalog

The Azure OpenAI REST reference defines two core operations: /audio/transcriptions, which transcribes into the input language, and /audio/translations, which transcribes and translates into English text. The REST API documents a language hint to improve accuracy and latency. The quickstart and samples show REST/cURL, Python, JavaScript/TypeScript via the OpenAI client library with AzureOpenAI, and PowerShell examples. The quickstart lists common input formats including mp3, mp4, mpeg, mpga, m4a, wav, and webm.

Azure Speech's Whisper support is built around batch transcription. The preview and GA announcements state that Azure Speech adds asynchronous processing, speaker diarization, word-level timestamps, and larger file and batch capacity: files up to 1 GB and up to 1,000 files in a single batch request.

Microsoft's broader speech stack also includes fast transcription, which "returns results synchronously and faster than real-time," and LLM Speech, which supports multilingual transcription and translation, diarization, and prompt-tuning capabilities at up to 500 MB and under 5 hours per file.

Documented input-size limits across the Azure transcription paths: 25 MB for Azure OpenAI whisper, under 500 MB for Azure Speech LLM Speech, and up to 1 GB for Azure Speech Whisper batch transcription.

Microsoft's recommended use cases for Whisper specifically are small prerecorded audio files, translation of non-English speech into English, call-center post-call analysis, accessibility captions, and mining large audio and video corpora for searchable text and downstream insight extraction.

Language support

Microsoft's launch and GA materials describe Whisper on Azure as supporting transcription in 57 languages and translation from those languages into English. OpenAI described Whisper as handling multilingual transcription and translation into English, and positioned it as "robust" to accents, background noise, and technical language.

Performance and benchmarks

Microsoft does not publish a public Azure-hosted WER chart for the managed whisper endpoint.

Vendor-reported: at GA, Microsoft said "thousands of customers" had already used the Whisper API in Azure across healthcare, education, finance, manufacturing, media, and agriculture. Microsoft's featured customer at GA was Lightbulb.AI, which used Whisper in Azure OpenAI Service for call-center workflows and stated its product was "500X more scalable, 90X faster, and 20X more cost-effective than manual call reviews." Those are customer-reported figures rather than third-party audited benchmarks.

Third-party evaluation: in the PennSound evaluation, researchers tested AWS, Azure, Google, IBM, Rev.ai, Whisper, and Whisper.cpp on nearly 10 hours of varied archival audio. Rev.ai was the top performer overall and Whisper was the top open-source performer "as long as hallucinations were avoided," with relatively slim WER and diarization gaps across systems.

Third-party evaluation: a 2024 child-speech study compared Microsoft Azure Speech, Google Cloud Speech-to-Text, and multiple Whisper sizes. On that dataset, the best Whisper model achieved a WER of 21.3%, versus 30.3% for Azure and 49.0% for Google. The same study found that local Whisper variants could beat cloud systems on responsiveness in some setups, a result that partly reflects hardware choice and network overhead rather than model quality alone.

Third-party evaluation: a large comparative study on ASR solution accuracy found that Microsoft and Whisper tended to report lower confidence than measured accuracy, whereas some rivals appeared overconfident. The same study observed that higher accuracy did not necessarily correlate with shorter processing time.

Latency and throughput

Microsoft does not publish a Whisper-specific latency SLA or p50/p95 latency table. Microsoft's product positioning is: Azure OpenAI Whisper is recommended for smaller files and time-sensitive work; Azure Speech fast transcription "returns results synchronously and faster than real-time" for prerecorded audio; GPT Realtime Whisper is intended for low-latency streaming captions and monitoring.

Documented capacity figures: 25 MB maximum request size for the Azure OpenAI whisper endpoint; up to 1 GB per file and up to 1,000 files per batch request for Azure Speech Whisper batch transcription; up to 500 MB and under 5 hours per file for LLM Speech.

Deployment and integrations

Microsoft's architecture guidance separates the two main services by role. Azure OpenAI audio models are for scenarios that combine speech with language reasoning or flexible prompt-based control, or where the team is already in the Azure OpenAI stack. Azure Speech is for high-volume real-time or batch transcription, diarization, custom speech models, custom vocabulary, and deployments that may require container or sovereign-cloud options.

Azure Speech broadly supports containers and on-premises deployment options, but the retrieved first-party documentation does not package the Whisper batch route itself as an on-premises or containerized Whisper product.

Data handling for Azure OpenAI: Microsoft states that customer data, prompts, and completions are not available to other customers, not available to OpenAI or other model providers, not used to improve Microsoft or third-party products or services without explicit permission, and not used to train foundation models without customer permission or instruction. Prompts and responses are processed within the customer-specified geography unless the customer uses Global or DataZone deployment types. Stored data is encrypted at rest with AES-256 by default, with customer-managed keys available for supported scenarios.

Microsoft's Transparency Note states that Azure OpenAI wraps OpenAI models inside Microsoft-managed guardrails and abuse-detection systems. Microsoft announced a confidential inferencing preview for Azure OpenAI Whisper in September 2024, targeting end-to-end privacy in regulated industries.

Compliance: Azure inherits the broader Microsoft Azure compliance program rather than Whisper having a separately branded compliance regime. Microsoft says Azure has more than 100 compliance offerings and points to third-party reports and certifications such as ISO 27001, SOC 2, FedRAMP, and HITRUST across the Azure platform. Microsoft states that Azure supports customers subject to HIPAA and that Azure and Azure Government align with the NIST Cybersecurity Framework and are certified under ISO/IEC 27001. Teams with regulated deployments need to validate service-specific scope in the Service Trust Portal and their contractual setup.

Comparison across Azure offerings and external services, as stated in the source:

Offering Core lane Languages stated publicly Streaming / real-time Diarization Custom adaptation or tuning On-prem / container path Best fit Sources
Azure OpenAI Whisper Managed file-based transcription and speech-to-English translation 57 at launch/GA messaging Not for the core whisper lane; Microsoft now points low-latency streaming to GPT Realtime Whisper instead Not documented for core whisper endpoint No public customer-visible tuning flow documented for the core managed Azure OpenAI whisper SKU No first-party Whisper container documented in retrieved sources Small prerecorded multilingual files inside Azure OpenAI workflows
Azure Speech with Whisper Batch transcription using Whisper Whisper multilingual plus translation to English Batch only Yes For domain tuning, Microsoft generally steers customers to Azure Speech custom speech features rather than the basic Whisper batch lane Azure Speech as a service supports containers more broadly, but Whisper-specific container packaging is not documented in the retrieved sources Large prerecorded files, batch jobs, call recordings, post-call analytics
Azure Speech native STT / fast / LLM Speech Speech-specialized platform Broad locale matrix Yes Yes Yes Yes Deterministic enterprise STT, custom vocabulary/acoustics, regulated deployment constraints
Google Cloud Speech-to-Text / Chirp Managed cloud STT 85+ languages and variants Yes Yes Yes, model adaptation Not documented in retrieved sources Multilingual cloud STT with strong adaptation support
Amazon Transcribe Managed cloud STT Broad supported-language matrix Yes Yes Yes, custom vocabulary and custom language models Not documented in retrieved sources AWS-native batch and streaming transcription, especially call analytics workflows

Pricing

Pricing is consumption-based. The official Azure Speech pricing page says speech-to-text hours are measured by hours of audio sent and billed in one-second increments, with separate SKUs for standard transcription, fast transcription, batch transcription, and LLM Speech.

Microsoft moderators publicly cited Azure OpenAI Whisper pricing at $0.36 per audio hour in the US region in late 2023. The primary Azure pricing pages are dynamically rendered and the retrieved HTML does not expose a stable current Whisper line item; the source states this figure should not be treated as a guaranteed current price and that pricing is region-sensitive.

Development and ownership

The underlying model was created by OpenAI, not Microsoft. OpenAI's Whisper paper is authored by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. The model was trained on 680,000 hours of multilingual and multitask supervised audio data from the web.

Microsoft's role was productization, hosting, governance, and service integration. Public Azure messaging places Whisper under the Azure AI / Azure OpenAI / Azure Speech umbrella. The public preview announcement was published by Microsoft's Heiko Rausch in the Azure AI services blog, and the GA announcement was published by Marco Casalaina, Vice President of Products for Azure AI. Microsoft does not publish a named Azure Whisper engineering team roster in the retrieved sources.

The Azure OpenAI managed whisper endpoint is not the same as the open-source openai/whisper repository, which exposes model sizes tiny, base, small, medium, large, and turbo, with explicit parameter counts and hardware tradeoffs. Azure also separately catalogs openai-whisper-large-v3 on managed compute.

Microsoft's stated motivation was enterprise-facing: its GA announcement says enterprises struggle to analyze voice interactions across many languages while preserving security and privacy guardrails, and it frames Whisper on Azure as giving customers a choice between Azure OpenAI and Azure Speech for call-center analytics, captions and accessibility, and mining audio and video for insights.

The source states there is no meaningful public evidence that Meta played a role in Azure's Whisper offer; Meta models such as MMS belong to a separate multilingual speech line.

Release history

Date Milestone What changed Why Microsoft updated it Sources
September 21, 2022 OpenAI releases Whisper Whisper is launched as an ASR system trained on 680,000 hours of multilingual/multitask audio data Established the upstream model that Azure later commercialized
September 15, 2023 Azure public preview Microsoft announces Whisper preview in both Azure OpenAI Service and Azure AI Speech To bring OpenAI Whisper into Azure with enterprise controls and two usage paths: simple API and batch speech pipeline
September 2023 Azure Speech REST API v3.2 preview Azure Speech v3.2 preview adds the API path that supports Whisper for batch transcription To enable Whisper inside the existing Speech batch ecosystem
March 13, 2024 General availability Whisper becomes GA in Azure OpenAI and Azure Speech To move Whisper into production workloads under Azure's enterprise-readiness positioning
June 2024 Azure Speech REST API v3.2 GA Speech batch API version supporting Whisper goes GA To stabilize the Azure Speech integration path for production batch transcription
September 24, 2024 Confidential Inferencing preview Microsoft announces confidential inferencing preview for Azure OpenAI Whisper To support verifiable end-to-end privacy for regulated workloads
April 2025 GPT-4o transcription models released on Azure gpt-4o-transcribe and gpt-4o-mini-transcribe arrive in Azure OpenAI To improve accuracy, flexibility, and voice-first application support beyond classic Whisper
May 2026 GPT Realtime Whisper documentation appears Azure publishes GPT Realtime Whisper concept docs To cover low-latency streaming transcription for captions, monitoring, and archival workflows, a gap in the original file-based Whisper API

The most recent Whisper-family update in the source is the May 2026 publication of GPT Realtime Whisper documentation, described by Microsoft as covering low-latency, stream-based transcription for live captions and monitoring.

Sources

The platform

Put these benchmarks to work

The same evaluations behind these dispatches drive OpenTranscription — one API that routes every job to the right speech model for your audio, language, and budget.

© 2026 OpenTranscription · Signal is our journal.Set in system grotesque, serif & mono