Deepgram Flux: turn detection moves inside the speech model

Most voice agents in production today run the same brittle stack: a streaming ASR model, an external voice activity detector, a pile of silence-timeout rules, and custom orchestration code to guess when the caller has finished talking. Deepgram Flux is a bet that this whole layer belongs inside the recognition model itself. Flux is a real-time conversational speech recognition model that combines streaming speech-to-text with model-native turn detection, interruption handling, and a structured turn-state machine, all delivered over a WebSocket API. Deepgram is not shy about the framing: it calls this a new category beyond conventional ASR and pitches Flux directly against the "stitched pipeline" architecture, alongside its broader Voice Agent API stack.

Whether you buy the category framing or not, the engineering claim underneath it is specific and testable, and that makes Flux worth a close look.

What Flux is and when it shipped

Flux is Deepgram's "first conversational speech recognition model built specifically for voice agents." The company describes it as a streaming model that knows "when to listen, when to think, and when to speak," combining transcription with integrated end-of-turn detection and barge-in awareness. In Deepgram's official model overview, Flux is the recommendation for real-time agents, customer support bots, and interactive turn-based experiences. Nova-3 remains the recommendation for general transcription, meetings, multilingual noisy audio, and far-field use cases that do not need turn-aware behavior.

The launch chronology is well documented. Deepgram published its launch article on October 1, 2025, which is the clearest official public release date. The developer changelog followed on October 2, 2025, and Cloudflare announced same-day availability on Workers AI. The next major milestone came on April 29, 2026, when Deepgram announced general availability of Flux Multilingual, which supports 10 languages in a single streaming model.

Two gaps in the public record are worth flagging. Deepgram's materials do not expose a formal preview or beta page for Flux before October 2025, so any earlier private preview dates are unspecified. And while the docs say Flux is available for self-hosted deployments, no single launch post pins down the first self-hosted release date; later changelog entries do document self-hosted metrics and fixes.

There is also useful prehistory. Deepgram launched a Voice Agent API in 2024 with "end-of-thought detection," and the company's technical story repeatedly roots Flux in experience from that stack. Deepgram says it spent "the last two years rethinking how transcription should work for real-time voice agents" before launch.

Who built it

Deepgram does not publish a Flux org chart, but the public bylines tell a reasonably clear story. The most visible product owner is Nick Kaimakis, credited on the launch article as Senior Product Manager. His official Deepgram author page confirms the role, and Twilio SIGNAL's speaker page goes further, describing him as Senior Product Manager, STT, and saying he leads Speech-to-Text at Deepgram, including Flux.

The most visible technical lead is Jack Kearney, Staff Research Scientist, who wrote the "Flux Chronicles" architecture post and the reinforcement-learning update post, and coauthored writeups on turn-detection evaluation and keyterm boosting. Public coauthors on Flux technical work include Chau Luu (Senior Research Scientist), Federico Landini (Research Scientist), and Julia Strout (Deep Learning Engineer). That byline pattern suggests a team spanning product, research science, and applied deep-learning engineering, but a full internal ownership map is not public.

The motivation is unusually explicit for a vendor. In its explainer introducing the term CSR (conversational speech recognition), Deepgram argues that ASR was built for transcription, not dialogue: real conversations contain pauses that do not mean "I'm done," plus overlapping speech and interruptions, so voice agents need turn-aware recognition and sub-second responsiveness rather than a bare word stream. The stated design goals were better end-of-turn accuracy than external detectors, lower conversational latency, fewer false interruptions, and simpler integration. Kearney's technical post frames Flux as a step toward a more integrated speech-to-speech future, and Deepgram's Coval post places it inside a broader "Neuroplex" architecture meant to connect STT, LLMs, and TTS with shared context signals. Treat Neuroplex as roadmap and research framing, not a documented Flux implementation spec.

Turn-state machine rendered as an abstract signal path with branching and rejoining tracks

The architecture: a state machine, not a transcript stream

At the API level, Flux uses Deepgram's /v2/listen WebSocket endpoint rather than the older /v1/listen, with model names flux-general-en and flux-general-multi. Deepgram recommends 80 ms audio chunks and mono audio, and supports common telephony and streaming encodings: linear16, linear32, mulaw, alaw, opus, and ogg-opus, plus containerized WAV, Ogg/Opus, and, since January 2026, WebM/Opus.

The interesting part is what comes back. Instead of the usual partial/final transcript stream, Flux emits a turn lifecycle: Update, StartOfTurn, EagerEndOfTurn, TurnResumed, and EndOfTurn. Update messages arrive roughly every 0.25 seconds of transcribed audio. EagerEndOfTurn is optional and only appears if configured. TurnResumed only ever follows an EagerEndOfTurn. And Deepgram guarantees that the final EndOfTurn transcript matches the immediately preceding EagerEndOfTurn transcript unless a TurnResumed intervenes. That guarantee is what makes speculative LLM calls practical: you can start generating a response on EagerEndOfTurn and only throw the work away if the caller resumes. Deepgram also recommends StartOfTurn for barge-in handling, on the grounds that it is more reliable than external VAD and always carries a non-empty transcript.

In practice the signal flow looks like this: client audio goes into the /v2/listen WebSocket and through the Flux recognition core, which emits TurnInfo updates and the turn events above. An EagerEndOfTurn can trigger a speculative LLM call; a confirmed EndOfTurn hands the transcript to your LLM and business logic, which feeds TTS for the agent's response. A Configure control message flows back into the recognition core mid-stream.

That Configure message is one of the more practical additions. Keyterms and thresholds can be updated mid-stream without disconnecting, which the docs describe as "context injection for speech recognition." The use cases are concrete: switching vocabulary when a call moves into OTP collection, medical terminology, or product-specific jargon. The tunable thresholds are eot_threshold, eager_eot_threshold, and eot_timeout_ms. A CloseStream control message forces final processing before shutdown.

Flux Multilingual extends the same surface to 10 languages (English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, and Dutch) through a single model and connection. It adds language_hint biasing and per-turn language reporting in TurnInfo.languages and TurnInfo.languages_hinted. Deepgram's claim is that this removes the need for separate language-detection services and model-routing logic, which is a real operational simplification if it holds up in your traffic.

Tooling is reasonably complete: official docs, an OpenAPI/AsyncAPI mirror in deepgram-api-specs, official JavaScript/TypeScript, Python, .NET, and Go SDKs, the deepgram/recipes and deepgram/skills repos, deepgram/starter-contracts, and several Flux demos (deepgram-demos-flux-streaming-transcription, deepgram-demos-flux-streaming, deepgram-demos-flux-agent, deepgram-demos-composite-flux-agent). The docs also cover raw WebSocket integration for developers building their own clients.

On the integration side, Flux runs on Cloudflare Workers AI, and Deepgram says Flux Multilingual is supported through Twilio, Vapi, LiveKit, Pipecat, and Jambonz. Deepgram's build guides show Flux composed with OpenAI and Deepgram TTS, and the company increasingly frames the choice between a composable Flux stack and its bundled Voice Agent API as an architectural decision rather than an upsell.

Two parallel signal paths, one fragmented into many small segments and one flowing as a continuous band, over a circuit-trace background

Where Flux sits: against Nova-3 and against competitors

Inside Deepgram's own catalog the split is simple: Flux for conversation, Nova-3 for transcription. Nova-3 is Deepgram's highest-performing general-purpose ASR and the pick for meetings, event captioning, and far-field multilingual audio. The migration guide draws the contrast plainly: Nova-3 streams transcript fragments and leaves turn logic to you, while Flux emits conversation events from a built-in turn-state machine.

Flux also sits below the Voice Agent API in the stack. If you want composability, you build Flux plus your own LLM plus your own TTS. If you want a bundled voice-to-voice path, the Voice Agent API packages STT, LLM orchestration, and TTS with unified pricing and built-in interruption handling. Flux is both a stand-alone model and a component of that larger stack.

Against the field, here is how the main options compare.

Product	Core type	Native turn detection	Turn-state events	Language coverage	Self-hosted option	Public pricing signal
Deepgram Flux	Conversational STT for voice agents	Yes; configurable eot_threshold, eager_eot_threshold, eot_timeout_ms	Yes: StartOfTurn, EagerEndOfTurn, TurnResumed, EndOfTurn, Update	English or 10-language multilingual model	Yes	Flux English $0.0065/min PAYG; Flux Multilingual $0.0078/min PAYG
AssemblyAI Universal-3 Pro Streaming	Streaming STT for voice agents	Yes; low-latency turn detection and voice-agent focus	Yes, though simpler message model such as SpeechStarted and turn finalization	6 real-time languages now, more listed as coming soon	Yes, enterprise/self-hosted streaming docs	Pricing page lists Universal-3 Pro at $0.21/hr; streaming billed per open session
Speechmatics Realtime STT	Realtime STT	Yes; configurable EndOfUtterance silence trigger	Yes, but primarily end-of-utterance signaling rather than Flux-like turn lifecycle	55+ languages	Yes; cloud and on-prem/Kubernetes	Pro pricing from $0.24/hr
OpenAI Realtime API	Broader realtime speech-to-speech / transcription API	Yes; server VAD and semantic VAD depending on session/model	Yes: speech started/stopped and automatic buffer commit in server VAD mode	Model-dependent; broader realtime voice stack rather than dedicated CSR STT	No self-hosted option documented	gpt-realtime-whisper $0.017/min; gpt-realtime-2 audio priced per 1M tokens

The differentiator is how much turn-taking logic gets pushed into the recognition layer. AssemblyAI and Speechmatics both market turn-aware streaming STT now, but Deepgram exposes the most explicit state-machine abstraction, built specifically for speculative LLM calls and barge-in logic. OpenAI's Realtime API goes a different direction entirely: broader and more end-to-end, but not a like-for-like stand-alone conversational STT layer. If you want a narrowly composable voice-agent STT stack, AssemblyAI is the closest competitor. For multilingual deployment flexibility, Speechmatics. For a broader realtime voice substrate, OpenAI.

Abstract benchmark motif: staggered amber bars of varying spacing along a timing axis rendered as waveform pulses

Reception, benchmarks, pricing, and the complaints

The launch endorsements were strong, which is what launch endorsements are for. LiveKit CTO and cofounder David Zhao said Flux "redefines Speech-to-Text by integrating turn detection." Cloudflare called it the "next evolution" of conversational speech recognition and shipped it on Workers AI as a launch partner. Lindy described Flux as close to a drop-in replacement that eliminated complexity around turn-taking and speculative response generation.

The clearest customer case study is Lindy Gaia. Lindy's quotes emphasize reduced latency and less custom code; Deepgram highlights sub-300 ms turn detection with Nova-3-level accuracy. For the multilingual release, Deepgram quotes Twilio's Omar Paul saying customers no longer need to stitch multiple models and complex routing together to deploy globally.

The benchmark story needs more care. Deepgram cites validation from Coval: 50% lower latency to first token than Nova-3, faster and more reliable turn detection, and WER equivalent to Nova-3. Coval does operate a public benchmark site, but the publicly available extracts do not expose the full supporting detail, and most of the quantitative comparisons are mediated through Deepgram's own write-up. Strong claims, not independently reproducible from the public record alone.

The most interesting external data point cuts the other way. Daily/Pipecat's February 2026 STT benchmark explicitly excluded Flux because its internal turn detection cannot be disabled, and the benchmark was testing STT under external turn-detection control. Daily still called Flux Deepgram's "flagship model" and said it should be included when evaluation does not require application-level turn control. That exclusion captures the tradeoff neatly: the integrated design that makes Flux valuable in production is exactly what makes it hard to benchmark apples-to-apples against conventional STT.

Developer feedback surfaces real rough edges. In Deepgram's own GitHub discussions, users reported missed short utterances, initial clipping after reconnects, trouble recognizing the acronym "PPF" in an automotive ordering flow, short and soft utterance drops in some multilingual settings, and earlier SDK friction around Flux's requirement for the /v2/listen protocol. None of this invalidates the architecture, but it shows that production quality varied by audio path, session management pattern, language, and domain vocabulary. If your traffic is heavy on short confirmations ("yes," "the second one") or niche acronyms, test those paths specifically before committing.

Pricing is straightforward. Flux English is $0.0065/min pay-as-you-go and Flux Multilingual is $0.0078/min, with lower Growth-tier rates. Deepgram ran an "OktoberFLUX" promotion making Flux free during October 2025 for up to 50 concurrent connections. For scale context, Deepgram's company-level numbers are 200,000+ developers, 1,300+ organizations, 50,000+ years of audio processed, and over 1 trillion words transcribed. Those are not Flux-specific figures; no public Flux-only customer, revenue, or usage totals exist.

For anyone digging further, the most useful public assets are the launch article, the "Flux Chronicles" technical post, the "Meet Flux" live session with Nick Kaimakis and Jack Kearney, the Coval discussion between Scott and Brooke Hopkins that Deepgram links, and the build guides and demo repos for Flux voice agents with OpenAI and Deepgram TTS.

What remains open

Several things are still unspecified in public sources: the full Flux org structure, the exact original conception date, and a complete list of builders beyond the visible product and research contributors. Flux-specific adoption metrics, whether active customers, call volume, revenue, or an enterprise customer roster, are also not public.

The bigger open question is benchmarking. Deepgram's strongest performance claims lean on its own summary of Coval's work, while the one prominent independent benchmark excluded Flux for methodological reasons. The fair reading: Flux looks meaningfully differentiated for integrated turn-aware voice-agent pipelines, but cross-vendor benchmarking of turn-aware models is architecture-sensitive and not yet standardized.

My own take is that Flux matters less as another STT model and more as a precedent. It formalizes turn-aware recognition as a first-class API surface, and once one vendor guarantees things like "EndOfTurn will match the preceding EagerEndOfTurn," application developers start designing around those guarantees. The real decision for builders is architectural: do you want turn-taking inside the speech model, or do you want to keep that logic in your application stack where you control it? Flux is the strongest available argument for the first answer.

Sources

Getting Started with Flux, Deepgram Docs - https://developers.deepgram.com/docs/flux/quickstart
Introducing Flux: Conversational Speech Recognition, Deepgram - https://deepgram.com/learn/introducing-flux-conversational-speech-recognition
Nick Kaimakis, Deepgram author page - https://deepgram.com/authors/nick-kaimakis
Flux Feature Overview, Deepgram Docs - https://developers.deepgram.com/docs/flux/feature-overview
From ASR to CSR: Why Conversation Changes Everything, Deepgram - https://deepgram.com/learn/from-asr-to-csr-why-conversation-changes-everything
Coval validates Flux: no tradeoff between latency and interruption, Deepgram - https://deepgram.com/learn/coval-validates-flux-no-tradeoff-between-latency-and-interruption
Fluxing Conversational State and Speech-to-Text, Deepgram - https://deepgram.com/learn/fluxing-conversational-state-and-speech-to-text
Evaluating End-of-Turn (Turn Detection) Models, Deepgram - https://deepgram.com/learn/evaluating-end-of-turn-detection-models
Understanding the Flux State Machine, Deepgram Docs - https://developers.deepgram.com/docs/flux/state
Introducing Flux Multilingual, Deepgram - https://deepgram.com/learn/introducing-flux-multilingual
Deepgram API Specs, GitHub - https://github.com/deepgram/deepgram-api-specs
New Deepgram Flux model available on Workers AI, Cloudflare changelog - https://developers.cloudflare.com/changelog/post/2025-10-02-deepgram-flux/
Models and Languages Overview, Deepgram Docs - https://developers.deepgram.com/docs/models-languages-overview
Introducing Deepgram's Voice Agent API - https://deepgram.com/learn/introducing-ai-voice-agent-api
Universal-3 Pro Streaming, AssemblyAI Docs - https://assemblyai.com/docs/streaming/universal-3-pro
Turn detection, Speechmatics Docs - https://docs.speechmatics.com/speech-to-text/realtime/turn-detection
Voice activity detection (VAD), OpenAI API Docs - https://developers.openai.com/api/docs/guides/realtime-vad
Lindy Gaia Launches with Deepgram Flux - https://deepgram.com/learn/lindy-gaia-launches-with-deepgram-flux
Benchmarking STT for Voice Agents, Daily - https://www.daily.co/blog/benchmarking-stt-for-voice-agents/
Issues with Deepgram Flux Model: Missed Speech Events and Initial Clipping, Deepgram GitHub Discussion #1463 - https://github.com/orgs/deepgram/discussions/1463
Deepgram Pricing - https://deepgram.com/pricing
October 2, 2025, Deepgram changelog - https://developers.deepgram.com/changelog/2025/10/2
Model selection, AssemblyAI Docs - https://assemblyai.com/docs/streaming/select-the-speech-model
Speaker Details: SIGNAL San Francisco 2026, Twilio - https://signal.twilio.com/2026/speaker/2309636/nick-kaimakis