Deepgram Base in 2026: what the legacy model still does well

Deepgram Base is still in Deepgram's 2026 documentation and rate-limit tables, but the company has moved on. Current materials file Base under Legacy Models next to the older Nova and Enhanced lines, recommend Flux for real-time voice agents, and point to Nova-3 for the hardest transcription work: noisy audio, multilingual input, far-field microphones, crosstalk. So why write about Base at all? Because it is still operational in the public API, still the default for the generic model parameter, and still a sensible choice for a specific kind of team: one that needs legacy compatibility, Deepgram's mature streaming API shape, domain variants like base-phonecall and base-video, or a high-volume pipeline where good-enough accuracy plus solid timestamps beats chasing the flagship.

One caveat before anything else, because it shapes every recommendation in this post. Deepgram's current public docs do not publish a self-serve Base price or an official Base WER or CER figure. The public pricing page lists Flux, Nova-3, and Custom STT. Official benchmark materials give platform-wide latency guidance and older comparative WER numbers for Enhanced and Nova-2 against Whisper, but there is no equivalent current single-number benchmark for Base itself. Treat Base as a model family that requires your own evaluation on your own audio, especially for regulated workflows, domain terminology, diarization quality, or difficult acoustics.

If you want the short version for a build decision: for new real-time systems, Deepgram's own positioning favors Flux, which has model-native turn detection and is built for conversational agents. For batch, Base still works if you want Deepgram's API ergonomics and its supported language and domain variants, but benchmark it directly against Nova-3, which is where Deepgram's accuracy, multilingual, and customization investment now lives.

What Deepgram Base is

Deepgram describes Base as built on its end-to-end speech-to-text architecture, offering a "solid combination of accuracy and cost effectiveness in some cases." That phrasing is about as unenthusiastic as vendor copy gets, and the docs reinforce it: Base sits under Legacy Models, and Deepgram says Enhanced generally has higher accuracy and better uncommon-word handling than Base. Yet the API reference still defaults the generic model parameter to base-general. Legacy, but live.

Base is not one unified flagship. It is a classic family of task-oriented variants: base-general (the default), base-meeting, base-phonecall, base-voicemail, base-finance, base-conversationalai, and base-video. Deepgram frames these as optimizations for everyday audio, conference-room multi-speaker audio, low-bandwidth telephony, voicemail, finance calls, human-to-bot conversations such as IVR and kiosks, and audio pulled from video.

Base variant	Official positioning	Supported languages and dialect tags
base or base-general	Everyday audio processing; default Base model	Chinese zh, zh-CN, zh-TW; Danish da; Dutch nl; English en, en-US; French fr, fr-CA; German de; Hindi hi, hi-Latn; Indonesian id; Italian it; Japanese ja; Korean ko; Norwegian no; Polish pl; Portuguese pt, pt-BR, pt-PT; Russian ru; Spanish es, es-419, es-LATAM; Swedish sv; Tamasheq taq; Turkish tr; Ukrainian uk
base-meeting	Conference-room audio with multiple speakers and one microphone	English en, en-US
base-phonecall	Low-bandwidth phone calls	English en, en-US
base-voicemail	Low-bandwidth single-speaker audio; derived from phonecall	English en, en-US
base-finance	Earnings-call style, multiple speakers, finance-heavy vocabulary	English en, en-US
base-conversationalai	Human speaking to an automated bot, IVR, assistant, kiosk	English en, en-US
base-video	Audio sourced from video	English en, en-US

This table comes straight from the current model and language overview docs. One dialect detail worth flagging: Deepgram says its English models handle global English accents and dialects, but transcript output is normalized to standardized American spelling. If your downstream QA expects "colour" rather than "color," you will need a post-processing step.

Model sizes and pricing tiers

Deepgram enumerates the Base family by variant, not by parameter count. The same model-options page publishes parameter counts for Whisper Cloud sizes but not for Base, so the safest reading is that Deepgram does not disclose Base parameter counts anywhere in its public docs.

Pricing follows the same pattern of quiet omission. The current pricing page is platform-level: Pay As You Go comes with a $200 free credit, Growth starts at $4K+/year with pre-paid credits, and Enterprise is custom. The public STT rate card covers Flux and Nova-3, plus add-ons like redaction and speaker diarization. Base is absent. Deepgram's 2023 benchmark whitepaper did show Base price bands materially below one dollar per hour of audio, varying by annual volume, but those are historical figures, not the 2026 price card.

Pricing or size question	What the official sources show
Current public self-serve Base price	Not listed on the 2026 public pricing page; the current STT price table lists Flux, Nova-3 Monolingual, Nova-3 Multilingual, and Custom
Current public Base parameter count	Not disclosed in the model docs reviewed
Current public plan tiers	Pay As You Go, Growth, and Enterprise exist at the platform level
Historical official Base pricing evidence	Deepgram's 2023 benchmark whitepaper included historical Base enterprise price bands for batch and streaming; these are historical, not current list prices

Diagram-style illustration of two transcription routes, a thick batch pipeline and a thin real-time streaming thread, converging on a single endpoint

How to use it

Base uses the same core interfaces as the rest of Deepgram's classic transcription stack: REST for pre-recorded audio and WebSocket for live streaming. The REST endpoint is POST /v1/listen; the streaming endpoint is wss://api.deepgram.com/v1/listen. Authentication accepts either Authorization: Token or Authorization: Bearer . Deepgram also documents temporary API tokens with a default TTL of 30 seconds, which is handy for browser and mobile handoff patterns where you do not want a long-lived key in client code.

For pre-recorded audio, you send either a JSON body with a remote URI or the binary audio or video directly. Responses are JSON with the standard transcript structures: transcript alternatives, an overall confidence score, and per-word timing and confidence data. Streaming returns a sequence of WebSocket messages including transcript updates plus SpeechStarted, UtteranceEnd, and metadata events.

Deepgram's official SDK surface covers at least JavaScript/TypeScript and Python, and the wider docs ecosystem shows examples for .NET, Go, Java, Python, and JavaScript in self-hosted and STT guides. The examples below use plain HTTP and WebSocket rather than SDK helpers because the protocol-level behavior is stable and maps directly to the endpoint docs.

Use case	Endpoint	Input style	Output style	Notes
Batch transcription	POST https://api.deepgram.com/v1/listen	JSON with url, or direct file upload	JSON response or async callback response	Supports features like punctuation, diarization, redaction, topics, intents, and utterances
Live transcription	wss://api.deepgram.com/v1/listen	Continuous audio over WebSocket	Incremental JSON events	Supports interim results, endpointing, speech_final, UtteranceEnd, keepalive/finalize flow

Python batch example

import os
import requests

API_KEY = os.environ["DEEPGRAM_API_KEY"]

endpoint = (
    "https://api.deepgram.com/v1/listen"
    "?model=base-general"
    "&language=en-US"
    "&punctuate=true"
    "&smart_format=true"
    "&utterances=true"
)

headers = {
    "Authorization": f"Token {API_KEY}",
    "Content-Type": "application/json",
}

payload = {
    "url": "https://dpgr.am/spacewalk.wav"
}

resp = requests.post(endpoint, headers=headers, json=payload, timeout=300)
resp.raise_for_status()
data = resp.json()

alt = data["results"]["channels"]["alternatives"]
print("Transcript:", alt["transcript"])
print("Confidence:", alt["confidence"])
print("First words:", alt["words"][:5])

This matches the official REST contract: POST /v1/listen, API-key auth via Authorization: Token, a JSON body with a media URL, and standard transcript output including transcript-level confidence and per-word timing and confidence.

Node.js streaming example

import WebSocket from "ws";
import fs from "node:fs";

const apiKey = process.env.DEEPGRAM_API_KEY;

const ws = new WebSocket(
  "wss://api.deepgram.com/v1/listen" +
    "?model=base-phonecall" +
    "&language=en-US" +
    "&encoding=linear16" +
    "&sample_rate=16000" +
    "&interim_results=true" +
    "&endpointing=300",
  {
    headers: {
      Authorization: `Token ${apiKey}`,
    },
  }
);

ws.on("open", () => {
  const stream = fs.createReadStream("./audio.raw", { highWaterMark: 3200 }); // ~100 ms @ 16 kHz mono 16-bit
  stream.on("data", (chunk) => {
    if (chunk.length) ws.send(chunk);
  });
  stream.on("end", () => {
    ws.send(JSON.stringify({ type: "Finalize" }));
  });
});

ws.on("message", (msg) => {
  const event = JSON.parse(msg.toString());

  if (event.type === "Results") {
    const alt = event.channel?.alternatives?.;
    if (alt?.transcript) {
      console.log({
        transcript: alt.transcript,
        is_final: event.is_final,
        speech_final: event.speech_final,
      });
    }
  } else {
    console.log(event);
  }
});

ws.on("close", () => console.log("stream closed"));
ws.on("error", (err) => console.error(err));

This reflects the documented streaming contract: open a WebSocket to /v1/listen, send audio chunks, set encoding and sample_rate when the stream is raw and non-containerized, optionally request interim_results and endpointing, and send Finalize before closing so Deepgram flushes remaining buffered audio.

Real-time and batch behavior

Deepgram splits transcription into pre-recorded and live streaming modes. Pre-recorded works when you can tolerate end-of-job latency or want callback-driven async orchestration. Live mode is for WebSocket-based real-time streaming: captions, telephony, agent-assist. Two operational numbers matter for batch: the quickstart docs note a 2 GB maximum file size, and requests whose processing exceeds 10 minutes on Nova, Base, or Enhanced can return a 504 Gateway Timeout. That 504 behavior is a strong argument for callbacks or an async pattern on longer pipelines instead of waiting synchronously on one HTTP request.

For streaming, Deepgram's guidance is specific: streaming transcription latency is optimized to 300 ms or less, with a typical breakdown of 150-300 ms transcription latency and 200-500 ms total transcript latency end to end depending on network, buffering, and client-side processing. Deepgram recommends sending audio in 20-100 ms chunks. Larger buffers add built-in delay; tiny chunks add overhead.

Latency, throughput, and benchmark evidence

The benchmark picture for Base is lopsided. There is solid official evidence for platform-level latency and throughput, and much weaker current public disclosure of Base-specific WER or CER. The honest reading: Deepgram's stack is operationally fast, and Base's present-day accuracy is something you have to measure yourself.

Source	Metric	Result	Relevance
Deepgram streaming latency guide	Typical transcription latency	150-300 ms transcription latency; 200-500 ms total transcript latency	Applies to Deepgram streaming workloads generally, including the Base-family streaming endpoint
Deepgram API rate limits	Base concurrency	Pay As You Go: 50 concurrent pre-recorded / 150 concurrent streaming; Growth: 50 / 225 in North America; Enterprise starts at 200 / 300	Strong evidence that Base remains a supported production model family in 2026
Deepgram batch autoscaling guidance	Throughput	Deepgram states its self-hosted batch engine can transcribe 1 hour of audio in under 30 seconds	Platform-level throughput guidance, not a Base-only number
Deepgram 2023 Whisper benchmark	WER on 254 real-world phone-call/meeting files	Deepgram Enhanced 10.6% WER, Nova-2 8.4% WER; Whisper sizes ranged 13.1-15.3% WER in that study	Official benchmark, but notably not Base-specific
WhisperX paper	Speedup on long-form transcription	WhisperX reports a 12x transcription speedup using VAD segmentation and batched inference	Useful competitor throughput reference for Whisper-family pipelines
2026 independent named-entity audit of speech providers	High-stakes proper-noun difficulty	Study included base-general and base-phonecall; across 15 ASR models the average transcription error rate on street/business names was 44%	Not a vanilla WER benchmark, but a strong external warning about proper nouns and address-like entities in production audio

That missing official benchmark for Base is not an accident of documentation. Deepgram publishes detailed accuracy claims for Nova-2 and Nova-3 but nothing fresh and Base-specific. When a vendor stops benchmarking a model in public, that tells you where its accuracy investment went.

$Abstract illustration of a proper-noun failure: a clean waveform passing through a lattice where several nodes fracture into scattered fragments$

Features, customization, and where Base breaks

At the transcript layer, Base covers the production features most teams expect: punctuation and capitalization, smart formatting, timestamps, confidence scores, profanity handling, speaker diarization, and word-level speaker assignment. The pre-recorded API docs show transcript alternatives with a transcript-level confidence number plus per-word {word, start, end, confidence} objects, and example responses include speaker, speaker_confidence, and punctuated_word fields when diarization and formatting are enabled.

Feature	How it works in Deepgram docs	Practical note
Punctuation and casing	punctuate=true adds punctuation and capitalization	Lowest-friction readability improvement for Base pipelines
Smart formatting	smart_format=true adds richer formatting for readability, including dates/currency style transformations	Deepgram's pricing page treats smart formatting as included on current STT pricing tiers
Word timestamps	words array returns start and end per word	Suitable for subtitles, searchable transcripts, and timeline alignment
Confidence scores	Transcript-level and word-level confidence values are returned on a 0-1 scale	Useful for QA gating but should not be treated as calibrated truth probability without your own validation
Speaker diarization	diarize_model enables speaker change detection and labels words by speaker number; batch supports latest/v1/v2, streaming supports latest/v1	Streaming diarization uses older v1 while batch can use v2; this matters if diarization quality is central to your use case
Channel separation	multichannel=true transcribes each channel independently	Prefer multichannel over diarization when your audio is already channel-separated, such as stereo telephony
Profanity filtering	profanity_filter=true converts recognized profanity to the nearest non-profane word or removes it	Useful for safe-display workflows but can damage literal or legal fidelity
Utterance segmentation	utterances=true segments speech into semantic units	Helpful for subtitles, agent-assist panes, and UI chunking
Redaction	redact= can redact PII/PHI/PCI classes; Deepgram documents 50+ entity types	Better fit than profanity_filter when the concern is regulated data rather than toning down language

Customization and fine-tuning

Customization is where Base shows its age most plainly. Base supports classic keywords boosting and suppression, but keyterm prompting is Nova-3-only, and the "instant self-serve customization without model retraining" messaging in current Deepgram materials applies to Nova-3, not Base. Deepgram does support account-linked custom trained models via custom_id, but the model-options page restricts those to Enterprise customers.

The practical hierarchy: if you stay on Base, your light-touch tuning tools are keywords and careful variant choice (base-phonecall, base-finance, and so on). If you need stronger domain adaptation or faster self-serve vocabulary steering, Deepgram's product direction pushes you toward Nova-3 or a formal custom Enterprise model rather than deeper Base-specific training.

Noise robustness and common failure modes

Deepgram does not publish a current Base noise-robustness benchmark. What it does publish is telling in its own way: the pricing and product pages recommend Nova-3 for background noise, crosstalk, and far-field input, while Base gets much more conservative language. Base can function in noisy real-world audio, but it is no longer Deepgram's answer for the hardest acoustics.

Common failure mode	Why it happens	Mitigation
Lower accuracy than newer Deepgram models	Deepgram explicitly says Enhanced is generally more accurate than Base, and current flagship claims center on Nova-3/Flux	Treat Base as a legacy/cost/compatibility choice; benchmark against Nova-3 before locking in
Domain jargon and uncommon proper nouns	Base has weaker rare-word handling than Enhanced; independent work shows named entities remain hard even for modern ASR systems	Choose a domain variant, use keywords, or move to a custom model or Nova-3 keyterm prompting; add post-ASR entity correction where needed
Streaming errors from wrong audio encoding	Raw vs. containerized audio handling is easy to misconfigure	For raw audio, set correct encoding and sample_rate; for containerized formats, omit them; test new sources on small samples first
Timeouts on quiet or idle streams	Deepgram can close idle streams when no audio arrives	Send KeepAlive during silence, or start sending audio within 10 seconds of connection open; avoid empty-byte sends
Misaligned timestamps after reconnect	Each new streaming session starts a fresh local timeline	Maintain a running offset and add it to returned timestamps after reconnects
Endpointing mistakes in noisy audio	Endpointing uses VAD and silence duration thresholds	Tune endpointing, use interim results in the UI, and do not over-trust end-of-turn signals in heavy background noise
Diarization confusion on mixed single-channel speech	Same-channel overlap and speaker similarity are intrinsically hard	Prefer multichannel=true whenever channels are available; otherwise evaluate diarization quality separately from WER

One robustness footnote. Whisper-family systems get praised for noise and accent robustness in general-purpose settings, but open-source Whisper and Whisper-derived pipelines have also been scrutinized for hallucination and fabrication failure modes, especially in sensitive domains. That does not automatically make Base more accurate. It is, though, one reason many production teams still prefer managed STT APIs with stronger operational controls, callbacks, redaction, and streaming state signals.

Privacy, compliance, and deployment

Deepgram's public pricing and security materials say the platform is SOC 2 Type 1 and Type 2 certified, HIPAA compliant with BAAs for Enterprise customers handling ePHI, GDPR ready with an EU endpoint (api.eu.deepgram.com), and CCPA compliant. That is the strongest current public compliance evidence tied directly to the platform.

On retention mechanics, the public docs emphasize data-residency options, a model-improvement opt-out (mip_opt_out), and flexible retention language in trust and security materials. What they do not offer is one simple, globally applicable default-retention statement for Base. For regulated workloads, confirm retention, logging, and model-improvement settings in your contract or support documentation rather than assuming a default from marketing pages.

Deployment options and hardware

Deepgram's self-hosted hardware guidance for STT is concrete: 1 NVIDIA GPU with compute capability 7.0+, 16 GB VRAM, 4 CPU cores, 32 GB RAM, and 50 GB storage as a recommended baseline. The self-hosted docs also note that authentication is not built in for self-hosted deployments, so teams typically put Deepgram behind their own API gateway, reverse proxy, or network controls.

Deployment path	What the sources support
Deepgram-hosted cloud API	Primary/default path; North America endpoint plus EU endpoint for residency considerations
Self-hosted in customer cloud or on-prem	Deepgram explicitly documents self-hosted deployments and says they can run on your own infrastructure, including cloud or on-prem
Amazon SageMaker	Deepgram documents deployment via SageMaker and provides autoscaling/batch guidance
Kubernetes / cloud VM patterns	Deepgram documents deployment guidance for GCP/Kubernetes-oriented environments and highlights self-hosted patterns generally
Edge / on-device	No separate public on-device Base runtime found; the closest documented private-deployment option is self-hosted infrastructure under your control

Integration patterns with AWS, GCP, Azure, and real-time pipelines

For cloud integration, the strongest documented patterns are S3-backed batch pipelines using presigned URLs, SageMaker for managed deployment on AWS, self-hosted cloud or on-prem for private inference, and regional endpoints for residency. Deepgram also publishes migration guides from AWS Transcribe, Google Speech-to-Text, and OpenAI Whisper, which signals that the company expects drop-in API replacement or staged migration in multi-cloud estates.

For real-time media pipelines, the docs show clear compatibility with telephony and agent frameworks. One official Twilio guide highlights 8 kHz raw mu-law as the telephony audio shape Twilio sends, which maps naturally to base-phonecall. Deepgram also documents a LiveKit integration path for agent use cases. These are the reference patterns to reach for when your ingress is browser WebRTC, SIP or Twilio, or a real-time agent orchestrator rather than prerecorded files.

Azure is the odd one out. The reviewed materials point toward self-hosted deployment on Azure infrastructure rather than a fully Azure-native managed Deepgram service. The self-hosted hardware guidance includes Azure GPU-instance examples, which is enough to treat Azure as a documented hosting target for private deployment, even though there is no separate Azure-managed Deepgram product surface in the docs reviewed.

Illustration of four distinct signal paths of different weights and textures converging toward a single selection point, rendered as an abstract circuit map

How Base stacks up against the alternatives

The fair comparison is not "Deepgram Base versus everyone's current flagship." It is "Base as a still-supported legacy managed model family versus the three alternatives teams actually consider in 2026." Base's biggest drawback is not missing functionality. It is that its public benchmark, pricing, and product-investment surface is thinner and staler than Deepgram's own newer models.

System	Accuracy and benchmark posture	Latency and streaming	Languages	Pricing posture	Customization and deployment	Best fit
Deepgram Base	No current public official Base WER/CER found in reviewed docs; positioned below Enhanced/Nova-3 in current messaging	Native WebSocket streaming; Deepgram guidance targets 150-300 ms transcription latency and 200-500 ms total latency; strong concurrency controls	base-general supports 20+ languages plus regional BCP-47 tags; specialty variants are mostly English	Current public Base price not listed; public pricing focuses on Flux/Nova-3; historical official Base price bands exist only in older benchmark material	Keywords-based adaptation, domain variants, Enterprise custom models, cloud or self-hosted	Legacy Deepgram compatibility, classic telephony/video variants, cost/ergonomics-sensitive transcript pipelines
OpenAI Whisper / WhisperX	Whisper is trained on 680k hours and is robust to accents/noise/technical language; WhisperX adds VAD, alignment, diarization, and word-level timestamps rather than changing the core acoustic model	Core Whisper is not a managed real-time streaming API; WhisperX improves long-form throughput and reports 12x speedup in its paper	Broad multilingual support plus translation and language ID	Open-source/MIT; no official per-minute managed Whisper/WhisperX price in the primary sources reviewed	Self-hosted; excellent for offline/private/open-weight work; WhisperX adds alignment and diarization tooling	Researchers, local/private inference, open-source stacks, offline transcription where you control infrastructure
Google Speech-to-Text	Official docs do not surface one simple flagship WER number in the reviewed sources; current model emphasis is on Chirp 3	Streaming available, but official docs say streaming recognition is gRPC only; strong cloud integration	Broad multilingual coverage via large supported-language tables; Chirp 3 is positioned as multilingual with diarization and auto language detection	Public pricing is clear: V2 standard recognition starts at $0.016/min, dynamic batch at $0.003/min	PhraseSets, CustomClasses, and model adaptation; cloud-native, strong Google ecosystem fit	Teams already deep in GCP, especially if gRPC streaming and cloud-native adaptation are acceptable
AWS Transcribe	Official docs do not publish a directly comparable simple flagship WER number in the reviewed sources; current marketing emphasizes a speech foundation model	Batch and real-time streaming are both official first-class modes	AWS marketing says 100+ languages and language-specific features; docs include large language tables	Public pricing in us-east-1 starts at $0.0300/min tier 1 for transcription	Custom vocabularies, custom language models, rich AWS integration, cloud-managed	Existing AWS estates, S3-centric batch workflows, AWS-native governance/ops patterns

For real-time work

Base is still viable in real time when you want Deepgram's WebSocket API and event model plus the classic telephony and video variants, and especially if you already run a mature application on base-phonecall or base-conversationalai. For a new real-time build, Deepgram's own docs argue against picking Base first. Flux is now the explicitly positioned real-time conversational model, and Nova-3 is the high-accuracy general model for live noisy or multilingual transcription. Choose Base for legacy fit, not for best current product fit.

For batch work

For batch processing, Base remains a reasonable candidate when your workload is stable, you are willing to run your own bakeoff, and the draw is Deepgram's API shape, its supported language list, timestamps, or the self-hosted private deployment path. Starting fresh with accuracy as an economic factor? Benchmark Nova-3 first and pick Base only if it wins on your cost and compatibility constraints. Everything in the public evidence supports that ordering: current pricing, accuracy messaging, and self-serve customization all center on Nova-3 and Flux.

What is still unresolved

The gap is simple to state. There is no current public Base-specific official WER, CER, or self-serve list price in the reviewed sources. There is strong official documentation for Base's API behavior, feature set, concurrency, and supported languages, plus historical pricing evidence and platform-wide latency guidance, but not the modern apples-to-apples benchmark card Deepgram now publishes for Nova-2 and Nova-3. For a rigorous procurement decision, the missing step is a controlled bakeoff on your own real audio, with particular attention to proper nouns, overlapping speakers, telephony codecs, and any regulated-content requirements.

Sources

Models & Languages Overview - https://developers.deepgram.com/docs/models-languages-overview
Deepgram Pricing | Scalable Speech-to-Text, Text-to-Speech & Voice Agent APIs - https://deepgram.com/pricing
Model Options | Deepgram's Docs - https://developers.deepgram.com/docs/model
Pre-Recorded Audio | Deepgram's Docs - https://developers.deepgram.com/reference/speech-to-text/listen-pre-recorded
Official JavaScript SDK for Deepgram - https://developers.deepgram.com/docs/js-sdk-v2-to-v3-migration-guide
Live Audio | Deepgram's Docs - https://developers.deepgram.com/reference/speech-to-text/listen-streaming
Getting Started with Speech to Text - https://developers.deepgram.com/docs/stt/getting-started
Measuring STT Latency | Deepgram's Docs - https://developers.deepgram.com/docs/measuring-streaming-latency
API Rate Limits | Deepgram's Docs - https://developers.deepgram.com/reference/api-rate-limits
Auto-Scaling - https://developers.deepgram.com/docs/autoscaling-best-practices
Deepgram vs Whisper Benchmark whitepaper - https://offers.deepgram.com/hubfs/Whitepaper%20Deepgram%20vs%20Whisper%20Benchmark.pdf
WhisperX: Time-Accurate Speech Transcription of Long-Form Audio - https://arxiv.org/abs/2303.00747
"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most - https://arxiv.org/html/2602.12249v2
Getting Started | Deepgram's Docs - https://developers.deepgram.com/docs/pre-recorded-audio
Supported Entity Types - https://developers.deepgram.com/docs/supported-entity-types
Determining Your Audio Format for Live Streaming Audio - https://developers.deepgram.com/docs/determining-your-audio-format-for-live-streaming-audio
Audio Keep Alive - https://developers.deepgram.com/docs/audio-keep-alive
Recovering From Connection Errors & Timeouts When Live Streaming - https://developers.deepgram.com/docs/recovering-from-connection-errors-and-timeouts-when-live-streaming-audio
Endpointing | Deepgram's Docs - https://developers.deepgram.com/docs/endpointing
Introducing Whisper - https://openai.com/index/whisper/
Ingress Authentication - https://developers.deepgram.com/docs/self-hosted-ingress-auth
Amazon Web Services | Deepgram's Docs - https://developers.deepgram.com/docs/aws-docker-podman
Google Cloud Platform | Deepgram's Docs - https://developers.deepgram.com/docs/gcp-k8s
AWS S3 Presigned URLs and Deepgram - https://developers.deepgram.com/docs/using-aws-s3-presigned-urls-with-the-deepgram-api
Twilio and Deepgram Voice Agent - https://developers.deepgram.com/docs/twilio-and-deepgram-voice-agent
Robust Speech Recognition via Large-Scale Weak Supervision - https://arxiv.org/abs/2212.04356
Chirp 3 Transcription: Enhanced multilingual accuracy - https://docs.cloud.google.com/speech-to-text/docs/models/chirp-3
Amazon Transcribe - Speech to Text - AWS - https://aws.amazon.com/transcribe/