Skip to content

Add live audio transcription streaming support to Foundry Local Python SDK#612

Open
rui-ren wants to merge 4 commits intomainfrom
ruiren/live-audio-stream-python
Open

Add live audio transcription streaming support to Foundry Local Python SDK#612
rui-ren wants to merge 4 commits intomainfrom
ruiren/live-audio-stream-python

Conversation

@rui-ren
Copy link
Copy Markdown
Contributor

@rui-ren rui-ren commented Apr 8, 2026

Description

Adds real-time audio streaming support to the Foundry Local Python SDK, enabling live microphone-to-text transcription via ONNX Runtime GenAI's StreamingProcessor API (Nemotron ASR).

This is the Python port of C# PR #485 with full feature parity. The existing AudioClient only supports file-based transcription. This PR introduces LiveAudioTranscriptionSession that accepts continuous PCM audio chunks (e.g., from a microphone) and returns partial/final transcription results as a synchronous generator.

What's included

New files

  • src/openai/live_audio_transcription_client.py — Streaming session with start(), append(), get_transcription_stream(), stop()
  • src/openai/live_audio_transcription_types.pyLiveAudioTranscriptionResponse (ConversationItem-shaped), LiveAudioTranscriptionOptions, CoreErrorResponse, TranscriptionContentPart
  • test/openai/test_live_audio_transcription.py — 22 unit tests for deserialization, settings, state guards, streaming pipeline
  • test/openai/test_live_audio_transcription_e2e.py — E2E test with real native DLLs and nemotron model
  • test/openai/conftest.py — DLL preload for E2E tests
  • samples/python/live-audio-transcription/src/app.py — Live microphone transcription demo

Modified files

  • src/openai/audio_client.py — Added create_live_transcription_session() factory method
  • src/detail/core_interop.py — Added StreamingRequestBuffer struct, execute_command_with_binary(), start_audio_stream, push_audio_data, stop_audio_stream methods, and _load_dll_win() for robust DLL loading on Windows
  • src/openai/__init__.py — Exported new live transcription types
  • test/conftest.py — Pre-load ORT/GenAI DLLs before brotli import to avoid Windows DLL search conflicts

API surface

audio_client = model.get_audio_client()
session = audio_client.create_live_transcription_session()

session.settings.sample_rate = 16000
session.settings.channels = 1
session.settings.language = "en"

session.start()

# Push audio from microphone callback (thread-safe)
session.append(pcm_bytes)

# Read results as synchronous generator
for result in session.get_transcription_stream():
    print(result.content[0].text)

session.stop()

C# parity

C# API Python API Notes
CreateLiveTranscriptionSession() create_live_transcription_session()
StartAsync(ct) start() Sync (matches Python SDK convention)
AppendAsync(ReadOnlyMemory<byte>, ct) append(bytes) Thread-safe, copies data
GetTranscriptionStream() get_transcription_stream() Generator (sync equivalent of IAsyncEnumerable)
StopAsync(ct) stop() Drains push queue, sends native stop, surfaces final result
IAsyncDisposable Context manager (with) Idiomatic Python equivalent
LiveAudioTranscriptionOptions LiveAudioTranscriptionOptions Same fields: sample_rate, channels, bits_per_sample, language, push_queue_capacity
LiveAudioTranscriptionResponse LiveAudioTranscriptionResponse ConversationItem-shaped: content[0].text/transcript, is_final, start_time, end_time

Design highlights

  • Output type alignmentLiveAudioTranscriptionResponse uses the OpenAI Realtime ConversationItem shape (content[0].text/transcript) for forward compatibility
  • Internal push queue — Bounded queue.Queue serializes audio pushes from any thread (safe for mic callbacks) with backpressure
  • Fail-fast on errors — Push loop terminates immediately on any native error (no retry logic)
  • Settings freeze — Audio format settings are snapshot-copied at start() and immutable during the session
  • Buffer copyappend() copies input data to avoid issues with callers reusing buffers (e.g., PyAudio)
  • Routes through existing exportsstart_audio_stream and stop_audio_stream route through execute_command; push_audio_data routes through execute_command_with_binary — no new native entry points required
  • DLL loading fix — Uses LoadLibraryExW with LOAD_WITH_ALTERED_SEARCH_PATH on Windows to prevent conflicts with stale system-level ORT DLLs

Verified working

  • ✅ 22 unit tests passing (deserialization, settings, state guards, streaming pipeline with mocked core)
  • ✅ E2E test passing (SDK → Core.dll → onnxruntime-genai.dll → onnxruntime.dll with nemotron model)
  • ✅ Full session lifecycle: start → push synthetic PCM → stop → verify results
  • ✅ Existing tests unaffected

Copilot AI review requested due to automatic review settings April 8, 2026 18:48
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 8, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
foundry-local Ready Ready Preview, Comment Apr 8, 2026 11:25pm

Request Review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds end-to-end live audio (PCM chunk) streaming transcription to the Foundry Local Python SDK, including session lifecycle management, native interop support for binary payloads, and tests/samples to validate Windows DLL loading and Nemotron ASR streaming.

Changes:

  • Introduces LiveAudioTranscriptionSession + supporting response/options/error types for streaming microphone-style PCM input.
  • Extends CoreInterop with a StreamingRequestBuffer and execute_command_with_binary() to push raw audio to native core.
  • Adds unit + E2E coverage and a sample app, including Windows DLL preload workarounds for brotli/LoadLibrary behavior.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
sdk/python/src/openai/live_audio_transcription_client.py Implements the streaming session (start/append/stream/stop) and background push loop.
sdk/python/src/openai/live_audio_transcription_types.py Adds response/options/error DTOs and JSON parsing helpers.
sdk/python/src/detail/core_interop.py Adds binary-command execution path and Windows DLL loading hardening for ORT/GenAI.
sdk/python/src/openai/audio_client.py Adds factory method to create the live transcription session.
sdk/python/src/openai/init.py Exports new session and types from the openai package surface.
sdk/python/test/openai/test_live_audio_transcription.py Unit tests for parsing/options/state guards and mocked streaming behavior.
sdk/python/test/openai/test_live_audio_transcription_e2e.py Windows-only E2E test exercising real native DLLs and nemotron model pipeline.
sdk/python/test/openai/conftest.py Preloads ORT/GenAI DLLs for E2E to avoid brotli-related DLL search changes.
sdk/python/test/conftest.py Preloads ORT/GenAI DLLs early in all tests to avoid Windows DLL search conflicts.
samples/python/live-audio-transcription/src/app.py Demonstration app using PyAudio to stream microphone PCM into the session.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants