Deepgram Base: model profile
Reference profile of Deepgram Base, a legacy speech-to-text model family with task-specific variants, batch and streaming APIs, and self-hosted deployment.
Deepgram Base is a legacy speech-to-text model family built on Deepgram's end-to-end architecture, offered through the company's REST and WebSocket transcription APIs.
Specifications
| Developer | Deepgram |
| Model type | End-to-end speech-to-text model family; positioned under Legacy Models in current documentation |
| Languages | base-general supports 20+ languages plus regional BCP-47 tags; specialty variants are English only |
| Modes (batch / streaming) | Both: REST pre-recorded transcription and WebSocket live streaming |
| Latency | Deepgram-wide streaming guidance: 150-300 ms transcription latency, 200-500 ms total transcript latency |
| Throughput / concurrency | Pay As You Go: 50 concurrent pre-recorded / 150 concurrent streaming; Growth: 50 / 225 in North America; Enterprise starts at 200 / 300 |
| Deployment | Deepgram-hosted cloud API (North America and EU endpoints), self-hosted cloud or on-prem, Amazon SageMaker, Kubernetes and cloud VM patterns |
| Pricing | Current public self-serve Base price not listed; the current STT price table lists Flux, Nova-3, and Custom |
Not disclosedReleased · Parameters · Training data · License
Full technical breakdown9 sections
Overview
Deepgram describes Base as being built on its end-to-end speech-to-text architecture and says it offers a "solid combination of accuracy and cost effectiveness in some cases." Current documentation places Base under Legacy Models, alongside older Nova and Enhanced lines, and states that Enhanced generally has higher accuracy and better uncommon-word handling than Base. Base remains present in Deepgram's 2026 documentation and rate-limit tables, and current API reference pages still default the generic model parameter to base-general.
In current official materials, Deepgram recommends Flux for real-time voice-agent use cases and Nova-3 for highest-accuracy general transcription, especially for noisy, multilingual, far-field, or crosstalk-heavy audio. Base is a family of task-oriented variants rather than one unified model: base-general (default), base-meeting, base-phonecall, base-voicemail, base-finance, base-conversationalai, and base-video.
Capabilities and features
Base variants and their documented positioning:
| Base variant | Official positioning | Supported languages and dialect tags |
|---|---|---|
| base or base-general | Everyday audio processing; default Base model | Chinese zh, zh-CN, zh-TW; Danish da; Dutch nl; English en, en-US; French fr, fr-CA; German de; Hindi hi, hi-Latn; Indonesian id; Italian it; Japanese ja; Korean ko; Norwegian no; Polish pl; Portuguese pt, pt-BR, pt-PT; Russian ru; Spanish es, es-419, es-LATAM; Swedish sv; Tamasheq taq; Turkish tr; Ukrainian uk |
| base-meeting | Conference-room audio with multiple speakers and one microphone | English en, en-US |
| base-phonecall | Low-bandwidth phone calls | English en, en-US |
| base-voicemail | Low-bandwidth single-speaker audio; derived from phonecall | English en, en-US |
| base-finance | Earnings-call style, multiple speakers, finance-heavy vocabulary | English en, en-US |
| base-conversationalai | Human speaking to an automated bot, IVR, assistant, kiosk | English en, en-US |
| base-video | Audio sourced from video | English en, en-US |
Transcript-layer features documented for Base:
| Feature | How it works in Deepgram docs | Documented note |
|---|---|---|
| Punctuation and casing | punctuate=true adds punctuation and capitalization | Readability improvement for Base pipelines |
| Smart formatting | smart_format=true adds richer formatting for readability, including dates/currency style transformations | Deepgram's pricing page treats smart formatting as included on current STT pricing tiers |
| Word timestamps | words array returns start and end per word | Suitable for subtitles, searchable transcripts, and timeline alignment |
| Confidence scores | Transcript-level and word-level confidence values are returned on a 0-1 scale | Returned in the standard response structure |
| Speaker diarization | diarize_model enables speaker change detection and labels words by speaker number; batch supports latest/v1/v2, streaming supports latest/v1 | Streaming diarization uses older v1 while batch can use v2 |
| Channel separation | multichannel=true transcribes each channel independently | Applies when audio is already channel-separated, such as stereo telephony |
| Profanity filtering | profanity_filter=true converts recognized profanity to the nearest non-profane word or removes it | Alters literal transcript content |
| Utterance segmentation | utterances=true segments speech into semantic units | Used for subtitles, agent-assist panes, and UI chunking |
| Redaction | redact= can redact PII/PHI/PCI classes; Deepgram documents 50+ entity types | Distinct from profanity filtering |
The pre-recorded API returns transcript alternatives that include a transcript-level confidence number plus per-word {word, start, end, confidence} objects. Example responses include speaker, speaker_confidence, and punctuated_word fields when diarization and formatting are enabled.
Customization
Base supports classic keywords boosting/suppression. Keyterm prompting is Nova-3 only, and Deepgram's "instant self-serve customization without model retraining" messaging applies to Nova-3, not Base. Deepgram also supports account-linked custom trained models via custom_id; the model-options page says these custom models are available only to Enterprise customers.
Language support
The base-general variant supports the following languages and dialect tags: Chinese (zh, zh-CN, zh-TW), Danish (da), Dutch (nl), English (en, en-US), French (fr, fr-CA), German (de), Hindi (hi, hi-Latn), Indonesian (id), Italian (it), Japanese (ja), Korean (ko), Norwegian (no), Polish (pl), Portuguese (pt, pt-BR, pt-PT), Russian (ru), Spanish (es, es-419, es-LATAM), Swedish (sv), Tamasheq (taq), Turkish (tr), and Ukrainian (uk).
All specialty variants (base-meeting, base-phonecall, base-voicemail, base-finance, base-conversationalai, base-video) support English (en, en-US) only.
Deepgram's language documentation states that its English models are designed to handle global English accents and dialects, but transcript output is normalized to standardized American spelling.
Performance and benchmarks
Deepgram's current public documentation does not publish a current official Base-specific WER or CER figure. Current docs publish detailed accuracy claims for Nova-2 and Nova-3, but not an equivalent single-number benchmark for Base.
| Source | Metric | Result | Relevance |
|---|---|---|---|
| Deepgram streaming latency guide | Typical transcription latency | 150-300 ms transcription latency; 200-500 ms total transcript latency | Applies to Deepgram streaming workloads generally, including the Base-family streaming endpoint |
| Deepgram API rate limits | Base concurrency | Pay As You Go: 50 concurrent pre-recorded / 150 concurrent streaming; Growth: 50 / 225 in North America; Enterprise starts at 200 / 300 | Base remains a supported production model family in 2026 |
| Deepgram batch autoscaling guidance | Throughput | Deepgram states its self-hosted batch engine can transcribe 1 hour of audio in under 30 seconds | Platform-level throughput guidance, not a Base-only number |
| Deepgram 2023 Whisper benchmark | WER on 254 real-world phone-call/meeting files | Vendor-reported: Deepgram Enhanced 10.6% WER, Nova-2 8.4% WER; Whisper sizes ranged 13.1-15.3% WER in that study | Official benchmark, not Base-specific |
| WhisperX paper | Speedup on long-form transcription | WhisperX reports a 12x transcription speedup using VAD segmentation and batched inference | Competitor throughput reference for Whisper-family pipelines |
| 2026 independent named-entity audit of speech providers | High-stakes proper-noun difficulty | Third-party evaluation: study included base-general and base-phonecall; across 15 ASR models the average transcription error rate on street/business names was 44% | Not a vanilla WER benchmark; measures proper nouns and address-like entities in production audio |
Latency and throughput
Deepgram's current guidance states that streaming transcription latency is optimized to 300 ms or less, with a typical breakdown of 150-300 ms transcription latency and 200-500 ms total transcript latency end to end depending on network, buffering, and client-side processing. Deepgram recommends sending audio in 20-100 ms chunks; larger buffers increase built-in delay, while smaller chunks increase overhead.
For pre-recorded jobs, Deepgram's quickstart documentation notes a 2 GB maximum file size and states that requests whose processing exceeds 10 minutes for Nova/Base/Enhanced can return a 504 Gateway Timeout.
Deepgram states its self-hosted batch engine can transcribe 1 hour of audio in under 30 seconds; this is platform-level guidance, not a Base-only figure.
Concurrency limits documented for Base: Pay As You Go allows 50 concurrent pre-recorded and 150 concurrent streaming requests; Growth allows 50 / 225 in North America; Enterprise starts at 200 / 300.
Deployment and integrations
API surface
Base uses the same core STT interfaces as the rest of Deepgram's classic transcription stack.
| Use case | Endpoint | Input style | Output style | Notes |
|---|---|---|---|---|
| Batch transcription | POST https://api.deepgram.com/v1/listen | JSON with url, or direct file upload | JSON response or async callback response | Supports features like punctuation, diarization, redaction, topics, intents, and utterances |
| Live transcription | wss://api.deepgram.com/v1/listen | Continuous audio over WebSocket | Incremental JSON events | Supports interim results, endpointing, speech_final, UtteranceEnd, keepalive/finalize flow |
Authentication supports either Authorization: Token
For pre-recorded audio, a request can send a JSON body with a remote URI or upload binary audio/video directly. Responses return transcript alternatives, overall confidence, and per-word timing/confidence data. Streaming returns a sequence of WebSocket messages including transcript updates plus SpeechStarted, UtteranceEnd, and metadata events.
Deepgram's official SDK surface includes at least JavaScript/TypeScript and Python, and the wider docs ecosystem also has examples for .NET, Go, Java, Python, and JavaScript in self-hosted/STT guides.
Deployment paths
| Deployment path | What the sources support |
|---|---|
| Deepgram-hosted cloud API | Primary/default path; North America endpoint plus EU endpoint for residency considerations |
| Self-hosted in customer cloud or on-prem | Deepgram documents self-hosted deployments running on customer infrastructure, cloud or on-prem |
| Amazon SageMaker | Deepgram documents deployment via SageMaker and provides autoscaling/batch guidance |
| Kubernetes / cloud VM patterns | Deepgram documents deployment guidance for GCP/Kubernetes-oriented environments and self-hosted patterns generally |
| Edge / on-device | No separate public on-device Base runtime found in the reviewed sources; the closest documented private-deployment option is self-hosted infrastructure |
Deepgram's self-hosted hardware guidance for STT lists a recommended baseline of 1 NVIDIA GPU with compute capability 7.0+, 16 GB VRAM, 4 CPU cores, 32 GB RAM, and 50 GB storage. The self-hosted docs also note that authentication is not built in for self-hosted deployments, so teams place Deepgram behind their own API gateway, reverse proxy, or network controls.
Cloud and pipeline integrations
Documented cloud integration patterns include S3-backed batch pipelines using presigned URLs, SageMaker for managed deployment on AWS, self-hosted cloud/on-prem for private inference, and regional endpoints for residency. Deepgram publishes migration guides from AWS Transcribe, Google Speech-to-Text, and OpenAI Whisper to Deepgram.
For real-time media pipelines, an official Twilio/Deepgram guide identifies 8 kHz raw mu-law as the telephony audio shape Twilio sends, which maps to base-phonecall. Deepgram also documents a LiveKit integration path for agent use cases.
For Azure, the reviewed materials point toward self-hosted deployment on Azure infrastructure; Deepgram's self-hosted hardware guidance includes Azure GPU-instance examples. The reviewed docs do not expose a separate Azure-managed Deepgram product surface.
Privacy and compliance
Deepgram's public pricing/security materials state that the platform is SOC 2 Type 1 and Type 2 certified, HIPAA compliant with BAAs for Enterprise customers handling ePHI, GDPR ready with an EU endpoint (api.eu.deepgram.com), and CCPA compliant.
On retention and privacy mechanics, the reviewed public docs describe data-residency options, a model-improvement opt-out (mip_opt_out), and flexible retention language in trust/security materials, but they do not present one globally applicable default-retention statement for Base.
Pricing
| Pricing or size question | What the official sources show |
|---|---|
| Current public self-serve Base price | Not listed on the 2026 public pricing page; the current STT price table lists Flux, Nova-3 Monolingual, Nova-3 Multilingual, and Custom |
| Current public Base parameter count | Not disclosed in the model docs reviewed |
| Current public plan tiers | Pay As You Go, Growth, and Enterprise exist at the platform level |
| Historical official Base pricing evidence | Deepgram's 2023 benchmark whitepaper included historical Base enterprise price bands for batch and streaming; these are historical, not current list prices |
Platform-level plan details on the current public pricing page: Pay As You Go includes a $200 free credit, Growth starts at $4K+/year with pre-paid credits, and Enterprise is custom. Historically, Deepgram's 2023 benchmark whitepaper showed Base price bands materially below one dollar per hour of audio, varying by annual volume; those figures are not the current 2026 public price card.
Development and ownership
Deepgram develops and operates the Base model family as part of its speech-to-text platform. Base is built on Deepgram's end-to-end speech-to-text architecture. Release date, training data, and parameter counts are not publicly disclosed in the reviewed sources.
Release history
Not publicly disclosed. The reviewed sources establish that Base is listed under Legacy Models in Deepgram's 2026 documentation, that it remains operational in the public API surface with base-general as the default model parameter, and that Deepgram's 2023 benchmark whitepaper contained historical Base pricing; the sources do not state original release dates or version history for the Base family.
Sources
- Models & Languages Overview: https://developers.deepgram.com/docs/models-languages-overview
- Deepgram Pricing | Scalable Speech-to-Text, Text-to-Speech & Voice Agent APIs: https://deepgram.com/pricing
- Model Options | Deepgram's Docs: https://developers.deepgram.com/docs/model
- Pre-Recorded Audio | Deepgram's Docs: https://developers.deepgram.com/reference/speech-to-text/listen-pre-recorded
- Official JavaScript SDK for Deepgram: https://developers.deepgram.com/docs/js-sdk-v2-to-v3-migration-guide
- Live Audio | Deepgram's Docs: https://developers.deepgram.com/reference/speech-to-text/listen-streaming
- Getting Started with Speech to Text: https://developers.deepgram.com/docs/stt/getting-started
- Measuring STT Latency | Deepgram's Docs: https://developers.deepgram.com/docs/measuring-streaming-latency
- API Rate Limits | Deepgram's Docs: https://developers.deepgram.com/reference/api-rate-limits
- Auto-Scaling: https://developers.deepgram.com/docs/autoscaling-best-practices
- Deepgram vs Whisper Benchmark whitepaper: https://offers.deepgram.com/hubfs/Whitepaper%20Deepgram%20vs%20Whisper%20Benchmark.pdf
- WhisperX: Time-Accurate Speech Transcription of Long-Form Audio: https://arxiv.org/abs/2303.00747
- "Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most: https://arxiv.org/html/2602.12249v2
- Getting Started | Deepgram's Docs: https://developers.deepgram.com/docs/pre-recorded-audio
- Supported Entity Types: https://developers.deepgram.com/docs/supported-entity-types
- Determining Your Audio Format for Live Streaming Audio: https://developers.deepgram.com/docs/determining-your-audio-format-for-live-streaming-audio
- Audio Keep Alive: https://developers.deepgram.com/docs/audio-keep-alive
- Recovering From Connection Errors & Timeouts When Live Streaming | Deepgram's Docs: https://developers.deepgram.com/docs/recovering-from-connection-errors-and-timeouts-when-live-streaming-audio
- Endpointing | Deepgram's Docs: https://developers.deepgram.com/docs/endpointing
- Introducing Whisper: https://openai.com/index/whisper/
- Ingress Authentication: https://developers.deepgram.com/docs/self-hosted-ingress-auth
- Amazon Web Services | Deepgram's Docs: https://developers.deepgram.com/docs/aws-docker-podman
- Google Cloud Platform | Deepgram's Docs: https://developers.deepgram.com/docs/gcp-k8s
- AWS S3 Presigned URLs and Deepgram: https://developers.deepgram.com/docs/using-aws-s3-presigned-urls-with-the-deepgram-api
- Twilio and Deepgram Voice Agent: https://developers.deepgram.com/docs/twilio-and-deepgram-voice-agent
- Robust Speech Recognition via Large-Scale Weak Supervision: https://arxiv.org/abs/2212.04356
- Chirp 3 Transcription: Enhanced multilingual accuracy: https://docs.cloud.google.com/speech-to-text/docs/models/chirp-3
- Amazon Transcribe - Speech to Text - AWS: https://aws.amazon.com/transcribe/