OpenTranscription/ Blog
2026-07-03 · ANALYSIS

Deepgram Base in 2026: what the legacy model still does well

Where Deepgram Base fits in 2026: API behavior, variants, latency, concurrency, missing benchmarks, and when to pick Nova-3 or Flux instead.

Abstract illustration of an older, simpler audio waveform path running parallel to newer, denser signal paths on a slate-teal background

Deepgram Base is still in Deepgram's 2026 documentation and rate-limit tables, but the company has moved on. Current materials file Base under Legacy Models next to the older Nova and Enhanced lines, recommend Flux for real-time voice agents, and point to Nova-3 for the hardest transcription work: noisy audio, multilingual input, far-field microphones, crosstalk. So why write about Base at all? Because it is still operational in the public API, still the default for the generic model parameter, and still a sensible choice for a specific kind of team: one that needs legacy compatibility, Deepgram's mature streaming API shape, domain variants like base-phonecall and base-video, or a high-volume pipeline where good-enough accuracy plus solid timestamps beats chasing the flagship.

One caveat before anything else, because it shapes every recommendation in this post. Deepgram's current public docs do not publish a self-serve Base price or an official Base WER or CER figure. The public pricing page lists Flux, Nova-3, and Custom STT. Official benchmark materials give platform-wide latency guidance and older comparative WER numbers for Enhanced and Nova-2 against Whisper, but there is no equivalent current single-number benchmark for Base itself. Treat Base as a model family that requires your own evaluation on your own audio, especially for regulated workflows, domain terminology, diarization quality, or difficult acoustics.

If you want the short version for a build decision: for new real-time systems, Deepgram's own positioning favors Flux, which has model-native turn detection and is built for conversational agents. For batch, Base still works if you want Deepgram's API ergonomics and its supported language and domain variants, but benchmark it directly against Nova-3, which is where Deepgram's accuracy, multilingual, and customization investment now lives.

What Deepgram Base is

Deepgram describes Base as built on its end-to-end speech-to-text architecture, offering a "solid combination of accuracy and cost effectiveness in some cases." That phrasing is about as unenthusiastic as vendor copy gets, and the docs reinforce it: Base sits under Legacy Models, and Deepgram says Enhanced generally has higher accuracy and better uncommon-word handling than Base. Yet the API reference still defaults the generic model parameter to base-general. Legacy, but live.

Base is not one unified flagship. It is a classic family of task-oriented variants: base-general (the default), base-meeting, base-phonecall, base-voicemail, base-finance, base-conversationalai, and base-video. Deepgram frames these as optimizations for everyday audio, conference-room multi-speaker audio, low-bandwidth telephony, voicemail, finance calls, human-to-bot conversations such as IVR and kiosks, and audio pulled from video.

Base variant Official positioning Supported languages and dialect tags
base or base-general Everyday audio processing; default Base model Chinese zh, zh-CN, zh-TW; Danish da; Dutch nl; English en, en-US; French fr, fr-CA; German de; Hindi hi, hi-Latn; Indonesian id; Italian it; Japanese ja; Korean ko; Norwegian no; Polish pl; Portuguese pt, pt-BR, pt-PT; Russian ru; Spanish es, es-419, es-LATAM; Swedish sv; Tamasheq taq; Turkish tr; Ukrainian uk
base-meeting Conference-room audio with multiple speakers and one microphone English en, en-US
base-phonecall Low-bandwidth phone calls English en, en-US
base-voicemail Low-bandwidth single-speaker audio; derived from phonecall English en, en-US
base-finance Earnings-call style, multiple speakers, finance-heavy vocabulary English en, en-US
base-conversationalai Human speaking to an automated bot, IVR, assistant, kiosk English en, en-US
base-video Audio sourced from video English en, en-US

This table comes straight from the current model and language overview docs. One dialect detail worth flagging: Deepgram says its English models handle global English accents and dialects, but transcript output is normalized to standardized American spelling. If your downstream QA expects "colour" rather than "color," you will need a post-processing step.

Model sizes and pricing tiers

Deepgram enumerates the Base family by variant, not by parameter count. The same model-options page publishes parameter counts for Whisper Cloud sizes but not for Base, so the safest reading is that Deepgram does not disclose Base parameter counts anywhere in its public docs.

Pricing follows the same pattern of quiet omission. The current pricing page is platform-level: Pay As You Go comes with a $200 free credit, Growth starts at $4K+/year with pre-paid credits, and Enterprise is custom. The public STT rate card covers Flux and Nova-3, plus add-ons like redaction and speaker diarization. Base is absent. Deepgram's 2023 benchmark whitepaper did show Base price bands materially below one dollar per hour of audio, varying by annual volume, but those are historical figures, not the 2026 price card.

Pricing or size question What the official sources show
Current public self-serve Base price Not listed on the 2026 public pricing page; the current STT price table lists Flux, Nova-3 Monolingual, Nova-3 Multilingual, and Custom
Current public Base parameter count Not disclosed in the model docs reviewed
Current public plan tiers Pay As You Go, Growth, and Enterprise exist at the platform level
Historical official Base pricing evidence Deepgram's 2023 benchmark whitepaper included historical Base enterprise price bands for batch and streaming; these are historical, not current list prices

Diagram-style illustration of two transcription routes, a thick batch pipeline and a thin real-time streaming thread, converging on a single endpoint

How to use it

Base uses the same core interfaces as the rest of Deepgram's classic transcription stack: REST for pre-recorded audio and WebSocket for live streaming. The REST endpoint is POST /v1/listen; the streaming endpoint is wss://api.deepgram.com/v1/listen. Authentication accepts either Authorization: Token or Authorization: Bearer . Deepgram also documents temporary API tokens with a default TTL of 30 seconds, which is handy for browser and mobile handoff patterns where you do not want a long-lived key in client code.

For pre-recorded audio, you send either a JSON body with a remote URI or the binary audio or video directly. Responses are JSON with the standard transcript structures: transcript alternatives, an overall confidence score, and per-word timing and confidence data. Streaming returns a sequence of WebSocket messages including transcript updates plus SpeechStarted, UtteranceEnd, and metadata events.

Deepgram's official SDK surface covers at least JavaScript/TypeScript and Python, and the wider docs ecosystem shows examples for .NET, Go, Java, Python, and JavaScript in self-hosted and STT guides. The examples below use plain HTTP and WebSocket rather than SDK helpers because the protocol-level behavior is stable and maps directly to the endpoint docs.

Use case Endpoint Input style Output style Notes
Batch transcription POST https://api.deepgram.com/v1/listen JSON with url, or direct file upload JSON response or async callback response Supports features like punctuation, diarization, redaction, topics, intents, and utterances
Live transcription wss://api.deepgram.com/v1/listen Continuous audio over WebSocket Incremental JSON events Supports interim results, endpointing, speech_final, UtteranceEnd, keepalive/finalize flow

Python batch example

import os
import requests

API_KEY = os.environ["DEEPGRAM_API_KEY"]

endpoint = (
    "https://api.deepgram.com/v1/listen"
    "?model=base-general"
    "&language=en-US"
    "&punctuate=true"
    "&smart_format=true"
    "&utterances=true"
)

headers = {
    "Authorization": f"Token {API_KEY}",
    "Content-Type": "application/json",
}

payload = {
    "url": "https://dpgr.am/spacewalk.wav"
}

resp = requests.post(endpoint, headers=headers, json=payload, timeout=300)
resp.raise_for_status()
data = resp.json()

alt = data["results"]["channels"]["alternatives"]
print("Transcript:", alt["transcript"])
print("Confidence:", alt["confidence"])
print("First words:", alt["words"][:5])

This matches the official REST contract: POST /v1/listen, API-key auth via Authorization: Token, a JSON body with a media URL, and standard transcript output including transcript-level confidence and per-word timing and confidence.

Node.js streaming example

import WebSocket from "ws";
import fs from "node:fs";

const apiKey = process.env.DEEPGRAM_API_KEY;

const ws = new WebSocket(
  "wss://api.deepgram.com/v1/listen" +
    "?model=base-phonecall" +
    "&language=en-US" +
    "&encoding=linear16" +
    "&sample_rate=16000" +
    "&interim_results=true" +
    "&endpointing=300",
  {
    headers: {
      Authorization: `Token ${apiKey}`,
    },
  }
);

ws.on("open", () => {
  const stream = fs.createReadStream("./audio.raw", { highWaterMark: 3200 }); // ~100 ms @ 16 kHz mono 16-bit
  stream.on("data", (chunk) => {
    if (chunk.length) ws.send(chunk);
  });
  stream.on("end", () => {
    ws.send(JSON.stringify({ type: "Finalize" }));
  });
});

ws.on("message", (msg) => {
  const event = JSON.parse(msg.toString());

  if (event.type === "Results") {
    const alt = event.channel?.alternatives?.;
    if (alt?.transcript) {
      console.log({
        transcript: alt.transcript,
        is_final: event.is_final,
        speech_final: event.speech_final,
      });
    }
  } else {
    console.log(event);
  }
});

ws.on("close", () => console.log("stream closed"));
ws.on("error", (err) => console.error(err));

This reflects the documented streaming contract: open a WebSocket to /v1/listen, send audio chunks, set encoding and sample_rate when the stream is raw and non-containerized, optionally request interim_results and endpointing, and send Finalize before closing so Deepgram flushes remaining buffered audio.

Real-time and batch behavior

Deepgram splits transcription into pre-recorded and live streaming modes. Pre-recorded works when you can tolerate end-of-job latency or want callback-driven async orchestration. Live mode is for WebSocket-based real-time streaming: captions, telephony, agent-assist. Two operational numbers matter for batch: the quickstart docs note a 2 GB maximum file size, and requests whose processing exceeds 10 minutes on Nova, Base, or Enhanced can return a 504 Gateway Timeout. That 504 behavior is a strong argument for callbacks or an async pattern on longer pipelines instead of waiting synchronously on one HTTP request.

For streaming, Deepgram's guidance is specific: streaming transcription latency is optimized to 300 ms or less, with a typical breakdown of 150-300 ms transcription latency and 200-500 ms total transcript latency end to end depending on network, buffering, and client-side processing. Deepgram recommends sending audio in 20-100 ms chunks. Larger buffers add built-in delay; tiny chunks add overhead.

Latency, throughput, and benchmark evidence

The benchmark picture for Base is lopsided. There is solid official evidence for platform-level latency and throughput, and much weaker current public disclosure of Base-specific WER or CER. The honest reading: Deepgram's stack is operationally fast, and Base's present-day accuracy is something you have to measure yourself.

Source Metric Result Relevance
Deepgram streaming latency guide Typical transcription latency 150-300 ms transcription latency; 200-500 ms total transcript latency Applies to Deepgram streaming workloads generally, including the Base-family streaming endpoint
Deepgram API rate limits Base concurrency Pay As You Go: 50 concurrent pre-recorded / 150 concurrent streaming; Growth: 50 / 225 in North America; Enterprise starts at 200 / 300 Strong evidence that Base remains a supported production model family in 2026
Deepgram batch autoscaling guidance Throughput Deepgram states its self-hosted batch engine can transcribe 1 hour of audio in under 30 seconds Platform-level throughput guidance, not a Base-only number
Deepgram 2023 Whisper benchmark WER on 254 real-world phone-call/meeting files Deepgram Enhanced 10.6% WER, Nova-2 8.4% WER; Whisper sizes ranged 13.1-15.3% WER in that study Official benchmark, but notably not Base-specific
WhisperX paper Speedup on long-form transcription WhisperX reports a 12x transcription speedup using VAD segmentation and batched inference Useful competitor throughput reference for Whisper-family pipelines
2026 independent named-entity audit of speech providers High-stakes proper-noun difficulty Study included base-general and base-phonecall; across 15 ASR models the average transcription error rate on street/business names was 44% Not a vanilla WER benchmark, but a strong external warning about proper nouns and address-like entities in production audio

That missing official benchmark for Base is not an accident of documentation. Deepgram publishes detailed accuracy claims for Nova-2 and Nova-3 but nothing fresh and Base-specific. When a vendor stops benchmarking a model in public, that tells you where its accuracy investment went.

Abstract illustration of a proper-noun failure: a clean waveform passing through a lattice where several nodes fracture into scattered fragments

Features, customization, and where Base breaks

At the transcript layer, Base covers the production features most teams expect: punctuation and capitalization, smart formatting, timestamps, confidence scores, profanity handling, speaker diarization, and word-level speaker assignment. The pre-recorded API docs show transcript alternatives with a transcript-level confidence number plus per-word {word, start, end, confidence} objects, and example responses include speaker, speaker_confidence, and punctuated_word fields when diarization and formatting are enabled.

Feature How it works in Deepgram docs Practical note
Punctuation and casing punctuate=true adds punctuation and capitalization Lowest-friction readability improvement for Base pipelines
Smart formatting smart_format=true adds richer formatting for readability, including dates/currency style transformations Deepgram's pricing page treats smart formatting as included on current STT pricing tiers
Word timestamps words array returns start and end per word Suitable for subtitles, searchable transcripts, and timeline alignment
Confidence scores Transcript-level and word-level confidence values are returned on a 0-1 scale Useful for QA gating but should not be treated as calibrated truth probability without your own validation
Speaker diarization diarize_model enables speaker change detection and labels words by speaker number; batch supports latest/v1/v2, streaming supports latest/v1 Streaming diarization uses older v1 while batch can use v2; this matters if diarization quality is central to your use case
Channel separation multichannel=true transcribes each channel independently Prefer multichannel over diarization when your audio is already channel-separated, such as stereo telephony
Profanity filtering profanity_filter=true converts recognized profanity to the nearest non-profane word or removes it Useful for safe-display workflows but can damage literal or legal fidelity
Utterance segmentation utterances=true segments speech into semantic units Helpful for subtitles, agent-assist panes, and UI chunking
Redaction redact= can redact PII/PHI/PCI classes; Deepgram documents 50+ entity types Better fit than profanity_filter when the concern is regulated data rather than toning down language

Customization and fine-tuning

Customization is where Base shows its age most plainly. Base supports classic keywords boosting and suppression, but keyterm prompting is Nova-3-only, and the "instant self-serve customization without model retraining" messaging in current Deepgram materials applies to Nova-3, not Base. Deepgram does support account-linked custom trained models via custom_id, but the model-options page restricts those to Enterprise customers.

The practical hierarchy: if you stay on Base, your light-touch tuning tools are keywords and careful variant choice (base-phonecall, base-finance, and so on). If you need stronger domain adaptation or faster self-serve vocabulary steering, Deepgram's product direction pushes you toward Nova-3 or a formal custom Enterprise model rather than deeper Base-specific training.

Noise robustness and common failure modes

Deepgram does not publish a current Base noise-robustness benchmark. What it does publish is telling in its own way: the pricing and product pages recommend Nova-3 for background noise, crosstalk, and far-field input, while Base gets much more conservative language. Base can function in noisy real-world audio, but it is no longer Deepgram's answer for the hardest acoustics.

Common failure mode Why it happens Mitigation
Lower accuracy than newer Deepgram models Deepgram explicitly says Enhanced is generally more accurate than Base, and current flagship claims center on Nova-3/Flux Treat Base as a legacy/cost/compatibility choice; benchmark against Nova-3 before locking in
Domain jargon and uncommon proper nouns Base has weaker rare-word handling than Enhanced; independent work shows named entities remain hard even for modern ASR systems Choose a domain variant, use keywords, or move to a custom model or Nova-3 keyterm prompting; add post-ASR entity correction where needed
Streaming errors from wrong audio encoding Raw vs. containerized audio handling is easy to misconfigure For raw audio, set correct encoding and sample_rate; for containerized formats, omit them; test new sources on small samples first
Timeouts on quiet or idle streams Deepgram can close idle streams when no audio arrives Send KeepAlive during silence, or start sending audio within 10 seconds of connection open; avoid empty-byte sends
Misaligned timestamps after reconnect Each new streaming session starts a fresh local timeline Maintain a running offset and add it to returned timestamps after reconnects
Endpointing mistakes in noisy audio Endpointing uses VAD and silence duration thresholds Tune endpointing, use interim results in the UI, and do not over-trust end-of-turn signals in heavy background noise
Diarization confusion on mixed single-channel speech Same-channel overlap and speaker similarity are intrinsically hard Prefer multichannel=true whenever channels are available; otherwise evaluate diarization quality separately from WER

One robustness footnote. Whisper-family systems get praised for noise and accent robustness in general-purpose settings, but open-source Whisper and Whisper-derived pipelines have also been scrutinized for hallucination and fabrication failure modes, especially in sensitive domains. That does not automatically make Base more accurate. It is, though, one reason many production teams still prefer managed STT APIs with stronger operational controls, callbacks, redaction, and streaming state signals.

Privacy, compliance, and deployment

Deepgram's public pricing and security materials say the platform is SOC 2 Type 1 and Type 2 certified, HIPAA compliant with BAAs for Enterprise customers handling ePHI, GDPR ready with an EU endpoint (api.eu.deepgram.com), and CCPA compliant. That is the strongest current public compliance evidence tied directly to the platform.

On retention mechanics, the public docs emphasize data-residency options, a model-improvement opt-out (mip_opt_out), and flexible retention language in trust and security materials. What they do not offer is one simple, globally applicable default-retention statement for Base. For regulated workloads, confirm retention, logging, and model-improvement settings in your contract or support documentation rather than assuming a default from marketing pages.

Deployment options and hardware

Deepgram's self-hosted hardware guidance for STT is concrete: 1 NVIDIA GPU with compute capability 7.0+, 16 GB VRAM, 4 CPU cores, 32 GB RAM, and 50 GB storage as a recommended baseline. The self-hosted docs also note that authentication is not built in for self-hosted deployments, so teams typically put Deepgram behind their own API gateway, reverse proxy, or network controls.

Deployment path What the sources support
Deepgram-hosted cloud API Primary/default path; North America endpoint plus EU endpoint for residency considerations
Self-hosted in customer cloud or on-prem Deepgram explicitly documents self-hosted deployments and says they can run on your own infrastructure, including cloud or on-prem
Amazon SageMaker Deepgram documents deployment via SageMaker and provides autoscaling/batch guidance
Kubernetes / cloud VM patterns Deepgram documents deployment guidance for GCP/Kubernetes-oriented environments and highlights self-hosted patterns generally
Edge / on-device No separate public on-device Base runtime found; the closest documented private-deployment option is self-hosted infrastructure under your control

Integration patterns with AWS, GCP, Azure, and real-time pipelines

For cloud integration, the strongest documented patterns are S3-backed batch pipelines using presigned URLs, SageMaker for managed deployment on AWS, self-hosted cloud or on-prem for private inference, and regional endpoints for residency. Deepgram also publishes migration guides from AWS Transcribe, Google Speech-to-Text, and OpenAI Whisper, which signals that the company expects drop-in API replacement or staged migration in multi-cloud estates.

For real-time media pipelines, the docs show clear compatibility with telephony and agent frameworks. One official Twilio guide highlights 8 kHz raw mu-law as the telephony audio shape Twilio sends, which maps naturally to base-phonecall. Deepgram also documents a LiveKit integration path for agent use cases. These are the reference patterns to reach for when your ingress is browser WebRTC, SIP or Twilio, or a real-time agent orchestrator rather than prerecorded files.

Azure is the odd one out. The reviewed materials point toward self-hosted deployment on Azure infrastructure rather than a fully Azure-native managed Deepgram service. The self-hosted hardware guidance includes Azure GPU-instance examples, which is enough to treat Azure as a documented hosting target for private deployment, even though there is no separate Azure-managed Deepgram product surface in the docs reviewed.

Illustration of four distinct signal paths of different weights and textures converging toward a single selection point, rendered as an abstract circuit map

How Base stacks up against the alternatives

The fair comparison is not "Deepgram Base versus everyone's current flagship." It is "Base as a still-supported legacy managed model family versus the three alternatives teams actually consider in 2026." Base's biggest drawback is not missing functionality. It is that its public benchmark, pricing, and product-investment surface is thinner and staler than Deepgram's own newer models.

System Accuracy and benchmark posture Latency and streaming Languages Pricing posture Customization and deployment Best fit
Deepgram Base No current public official Base WER/CER found in reviewed docs; positioned below Enhanced/Nova-3 in current messaging Native WebSocket streaming; Deepgram guidance targets 150-300 ms transcription latency and 200-500 ms total latency; strong concurrency controls base-general supports 20+ languages plus regional BCP-47 tags; specialty variants are mostly English Current public Base price not listed; public pricing focuses on Flux/Nova-3; historical official Base price bands exist only in older benchmark material Keywords-based adaptation, domain variants, Enterprise custom models, cloud or self-hosted Legacy Deepgram compatibility, classic telephony/video variants, cost/ergonomics-sensitive transcript pipelines
OpenAI Whisper / WhisperX Whisper is trained on 680k hours and is robust to accents/noise/technical language; WhisperX adds VAD, alignment, diarization, and word-level timestamps rather than changing the core acoustic model Core Whisper is not a managed real-time streaming API; WhisperX improves long-form throughput and reports 12x speedup in its paper Broad multilingual support plus translation and language ID Open-source/MIT; no official per-minute managed Whisper/WhisperX price in the primary sources reviewed Self-hosted; excellent for offline/private/open-weight work; WhisperX adds alignment and diarization tooling Researchers, local/private inference, open-source stacks, offline transcription where you control infrastructure
Google Speech-to-Text Official docs do not surface one simple flagship WER number in the reviewed sources; current model emphasis is on Chirp 3 Streaming available, but official docs say streaming recognition is gRPC only; strong cloud integration Broad multilingual coverage via large supported-language tables; Chirp 3 is positioned as multilingual with diarization and auto language detection Public pricing is clear: V2 standard recognition starts at $0.016/min, dynamic batch at $0.003/min PhraseSets, CustomClasses, and model adaptation; cloud-native, strong Google ecosystem fit Teams already deep in GCP, especially if gRPC streaming and cloud-native adaptation are acceptable
AWS Transcribe Official docs do not publish a directly comparable simple flagship WER number in the reviewed sources; current marketing emphasizes a speech foundation model Batch and real-time streaming are both official first-class modes AWS marketing says 100+ languages and language-specific features; docs include large language tables Public pricing in us-east-1 starts at $0.0300/min tier 1 for transcription Custom vocabularies, custom language models, rich AWS integration, cloud-managed Existing AWS estates, S3-centric batch workflows, AWS-native governance/ops patterns

For real-time work

Base is still viable in real time when you want Deepgram's WebSocket API and event model plus the classic telephony and video variants, and especially if you already run a mature application on base-phonecall or base-conversationalai. For a new real-time build, Deepgram's own docs argue against picking Base first. Flux is now the explicitly positioned real-time conversational model, and Nova-3 is the high-accuracy general model for live noisy or multilingual transcription. Choose Base for legacy fit, not for best current product fit.

For batch work

For batch processing, Base remains a reasonable candidate when your workload is stable, you are willing to run your own bakeoff, and the draw is Deepgram's API shape, its supported language list, timestamps, or the self-hosted private deployment path. Starting fresh with accuracy as an economic factor? Benchmark Nova-3 first and pick Base only if it wins on your cost and compatibility constraints. Everything in the public evidence supports that ordering: current pricing, accuracy messaging, and self-serve customization all center on Nova-3 and Flux.

What is still unresolved

The gap is simple to state. There is no current public Base-specific official WER, CER, or self-serve list price in the reviewed sources. There is strong official documentation for Base's API behavior, feature set, concurrency, and supported languages, plus historical pricing evidence and platform-wide latency guidance, but not the modern apples-to-apples benchmark card Deepgram now publishes for Nova-2 and Nova-3. For a rigorous procurement decision, the missing step is a controlled bakeoff on your own real audio, with particular attention to proper nouns, overlapping speakers, telephony codecs, and any regulated-content requirements.

Sources

  1. Models & Languages Overview - https://developers.deepgram.com/docs/models-languages-overview
  2. Deepgram Pricing | Scalable Speech-to-Text, Text-to-Speech & Voice Agent APIs - https://deepgram.com/pricing
  3. Model Options | Deepgram's Docs - https://developers.deepgram.com/docs/model
  4. Pre-Recorded Audio | Deepgram's Docs - https://developers.deepgram.com/reference/speech-to-text/listen-pre-recorded
  5. Official JavaScript SDK for Deepgram - https://developers.deepgram.com/docs/js-sdk-v2-to-v3-migration-guide
  6. Live Audio | Deepgram's Docs - https://developers.deepgram.com/reference/speech-to-text/listen-streaming
  7. Getting Started with Speech to Text - https://developers.deepgram.com/docs/stt/getting-started
  8. Measuring STT Latency | Deepgram's Docs - https://developers.deepgram.com/docs/measuring-streaming-latency
  9. API Rate Limits | Deepgram's Docs - https://developers.deepgram.com/reference/api-rate-limits
  10. Auto-Scaling - https://developers.deepgram.com/docs/autoscaling-best-practices
  11. Deepgram vs Whisper Benchmark whitepaper - https://offers.deepgram.com/hubfs/Whitepaper%20Deepgram%20vs%20Whisper%20Benchmark.pdf
  12. WhisperX: Time-Accurate Speech Transcription of Long-Form Audio - https://arxiv.org/abs/2303.00747
  13. "Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most - https://arxiv.org/html/2602.12249v2
  14. Getting Started | Deepgram's Docs - https://developers.deepgram.com/docs/pre-recorded-audio
  15. Supported Entity Types - https://developers.deepgram.com/docs/supported-entity-types
  16. Determining Your Audio Format for Live Streaming Audio - https://developers.deepgram.com/docs/determining-your-audio-format-for-live-streaming-audio
  17. Audio Keep Alive - https://developers.deepgram.com/docs/audio-keep-alive
  18. Recovering From Connection Errors & Timeouts When Live Streaming - https://developers.deepgram.com/docs/recovering-from-connection-errors-and-timeouts-when-live-streaming-audio
  19. Endpointing | Deepgram's Docs - https://developers.deepgram.com/docs/endpointing
  20. Introducing Whisper - https://openai.com/index/whisper/
  21. Ingress Authentication - https://developers.deepgram.com/docs/self-hosted-ingress-auth
  22. Amazon Web Services | Deepgram's Docs - https://developers.deepgram.com/docs/aws-docker-podman
  23. Google Cloud Platform | Deepgram's Docs - https://developers.deepgram.com/docs/gcp-k8s
  24. AWS S3 Presigned URLs and Deepgram - https://developers.deepgram.com/docs/using-aws-s3-presigned-urls-with-the-deepgram-api
  25. Twilio and Deepgram Voice Agent - https://developers.deepgram.com/docs/twilio-and-deepgram-voice-agent
  26. Robust Speech Recognition via Large-Scale Weak Supervision - https://arxiv.org/abs/2212.04356
  27. Chirp 3 Transcription: Enhanced multilingual accuracy - https://docs.cloud.google.com/speech-to-text/docs/models/chirp-3
  28. Amazon Transcribe - Speech to Text - AWS - https://aws.amazon.com/transcribe/
The platform

Put these benchmarks to work

The same evaluations behind these dispatches drive OpenTranscription — one API that routes every job to the right speech model for your audio, language, and budget.

© 2026 OpenTranscription · Signal is our journal.Set in system grotesque, serif & mono