Skip to content

Add live audio transcription streaming support to Foundry Local Rust SDK#613

Open
rui-ren wants to merge 8 commits intomainfrom
ruiren/live-audio-stream-rust
Open

Add live audio transcription streaming support to Foundry Local Rust SDK#613
rui-ren wants to merge 8 commits intomainfrom
ruiren/live-audio-stream-rust

Conversation

@rui-ren
Copy link
Copy Markdown
Contributor

@rui-ren rui-ren commented Apr 8, 2026

Add live audio transcription streaming support to Foundry Local Rust SDK

Description

Ports the C# live audio transcription feature (PR #485) to the Rust SDK with full API parity.

The existing AudioClient only supports file-based transcription. This PR introduces LiveAudioTranscriptionSession that accepts continuous PCM audio chunks (e.g., from a microphone) and returns partial/final transcription results as an async stream.

What's included

New files

  • sdk/rust/src/openai/live_audio_client.rs — Streaming session with start(), append(), get_transcription_stream(), stop(), plus types, cancellation support, and unit tests
  • sdk/rust/tests/integration/live_audio_test.rs — E2E integration test with synthetic PCM audio
  • samples/rust/live-audio-transcription-example/ — Full sample with real microphone capture (cpal) and resampling

Modified files

  • sdk/rust/src/detail/core_interop.rs — Added StreamingRequestBuffer FFI struct and execute_command_with_binary() for binary audio data
  • sdk/rust/src/openai/audio_client.rs — Added create_live_transcription_session() factory method
  • sdk/rust/src/detail/model.rs, model_variant.rs — Wired factory method to Model
  • sdk/rust/src/openai/mod.rs, src/lib.rs — Module registration and public exports
  • sdk/rust/Cargo.toml — Added tokio-util dependency for CancellationToken

API surface

let audio_client = model.create_audio_client();
let session = audio_client.create_live_transcription_session();

session.settings.sample_rate = 16000;
session.settings.channels = 1;
session.settings.language = Some("en".into());

session.start(None).await?;

// Push audio from microphone callback
session.append(&pcm_bytes, None).await?;

// Read results as async stream
use tokio_stream::StreamExt;
let mut stream = session.get_transcription_stream()?;
while let Some(result) = stream.next().await {
    let result = result?;
    println!("{}", result.content[0].text);
}

session.stop(None).await?;

C# API parity

C# Rust Status
CreateLiveTranscriptionSession() create_live_transcription_session()
StartAsync(CancellationToken) start(Option<CancellationToken>)
AppendAsync(ReadOnlyMemory<byte>, CancellationToken) append(&[u8], Option<CancellationToken>)
GetTranscriptionStream(CancellationToken) get_transcription_stream()
StopAsync(CancellationToken) + cancel-safe cleanup stop(Option<CancellationToken>) + cancel-safe cleanup
IAsyncDisposable.DisposeAsync() Drop with best-effort native stop
LiveAudioTranscriptionResponse.Content[0].Text response.content[0].text
LiveAudioTranscriptionResponse.Content[0].Transcript response.content[0].transcript
LiveAudioTranscriptionResponse.IsFinal response.is_final
LiveAudioTranscriptionResponse.StartTime/EndTime response.start_time / response.end_time
LiveAudioTranscriptionOptions (SampleRate, Channels, BitsPerSample, Language, PushQueueCapacity) LiveAudioTranscriptionOptions (sample_rate, channels, bits_per_sample, language, push_queue_capacity)
CoreErrorResponse.TryParse() CoreErrorResponse::try_parse()
Native commands: audio_stream_start, audio_stream_push, audio_stream_stop Same commands via execute_command / execute_command_with_binary

Design highlights

  • CancellationToken supportstart/append/stop accept Option<CancellationToken> via tokio_util::sync::CancellationToken
  • Cancel-safe stopstop() always performs native audio_stream_stop even if token fires, preventing native session leaks (matches C# StopAsync pattern)
  • Response envelopeLiveAudioTranscriptionResponse uses content: Vec<ContentPart> matching C#'s ConversationItem.Content[0].Text/Transcript
  • Bounded push queue — Backpressure via bounded channel (capacity=100); prevents unbounded memory growth
  • Push loop on blocking threadexecute_command_with_binary FFI calls run on spawn_blocking, keeping async runtime free
  • Settings freeze — Audio format settings are cloned at start() and immutable during the session
  • Drop safety — Best-effort synchronous audio_stream_stop in Drop to prevent native session leaks
  • FFI null pointer safety — Empty binary slices use std::ptr::null() to avoid dangling pointer across FFI boundary

Verified working

  • ✅ SDK build succeeds (0 errors, 0 clippy warnings)
  • ✅ 13 unit tests passing (JSON deserialization, settings defaults, error parsing, content envelope)
  • ✅ E2E pipeline: Microphone (48kHz/2ch/F32) → Resample (16kHz/mono/16-bit) → SDK → Core.dll → onnxruntime-genai.dll → nemotron model
  • ✅ Synthetic audio test: 30 chunks (96KB PCM) pushed with clean session lifecycle
  • ✅ Live microphone test: real-time capture, session start/stop, no native errors

Stats

  • 14 files changed, 1,329 additions, 2 deletions

Copilot AI review requested due to automatic review settings April 8, 2026 19:45
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 8, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
foundry-local Ready Ready Preview, Comment Apr 8, 2026 11:30pm

Request Review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds live (chunked) PCM audio transcription streaming to the Foundry Local Rust SDK, aligning the Rust API with the existing C# live audio transcription session feature and extending the SDK beyond file-based transcription.

Changes:

  • Introduces LiveAudioTranscriptionSession + associated response/options/types and stream wrapper in the Rust SDK.
  • Extends the Rust FFI bridge with execute_command_with_binary() to send JSON params + binary PCM payloads.
  • Adds integration test coverage and a new Rust sample demonstrating microphone capture (cpal) and streaming transcription.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
sdk/rust/src/openai/live_audio_client.rs New streaming session implementation, types, cancellation support, and unit tests
sdk/rust/src/detail/core_interop.rs Adds StreamingRequestBuffer + optional execute_command_with_binary symbol support
sdk/rust/src/openai/audio_client.rs Adds create_live_transcription_session() factory method
sdk/rust/src/detail/model.rs Exposes Model::create_live_transcription_session()
sdk/rust/src/detail/model_variant.rs Wires variant factory for live transcription sessions
sdk/rust/src/openai/mod.rs Registers and re-exports live audio transcription module/types
sdk/rust/src/lib.rs Public re-exports for the new live transcription session/types
sdk/rust/Cargo.toml Adds tokio-util for CancellationToken
sdk/rust/tests/integration/main.rs Registers the new integration test module
sdk/rust/tests/integration/live_audio_test.rs New E2E-ish integration test using synthetic PCM audio
samples/rust/live-audio-transcription-example/src/main.rs New microphone/synthetic streaming transcription sample
samples/rust/live-audio-transcription-example/Cargo.toml Declares sample dependencies (cpal, tokio, sdk path dep)
samples/rust/Cargo.toml Adds the new sample crate to the workspace
codex-feedback.md Adds review/validation notes for the feature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +315 to +319
_ = token.cancelled() => {
return Err(FoundryLocalError::CommandExecution {
reason: "Start cancelled".into(),
});
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the cancellation token fires during start(), this branch returns without awaiting start_future. The spawn_blocking task will continue running; if it ends up creating a native session successfully, the session handle is dropped and the native session may leak. To keep the “clean (not-started) state” guarantee, ensure you await start_future and, if a handle is produced after cancellation, issue a best-effort audio_stream_stop cleanup (possibly in a detached background task) before returning.

Copilot uses AI. Check for mistakes.
ruiren_microsoft and others added 7 commits April 8, 2026 14:28
Port the C# live audio transcription feature (PR #485) to the Rust SDK
with full API parity.

New files:
- src/openai/live_audio_client.rs: LiveAudioTranscriptionSession with
  start/append/get_transcription_stream/stop lifecycle, response types,
  CoreErrorResponse, and unit tests
- tests/integration/live_audio_test.rs: E2E test with synthetic PCM audio

Modified files:
- src/detail/core_interop.rs: StreamingRequestBuffer FFI struct and
  execute_command_with_binary method for binary audio data
- src/openai/audio_client.rs: create_live_transcription_session() factory
- src/detail/model.rs, model_variant.rs: create_live_transcription_session()
- src/openai/mod.rs, src/lib.rs: Module and public type exports

API surface:
  let audio_client = model.create_audio_client();
  let session = audio_client.create_live_transcription_session();
  session.settings.sample_rate = 16000;
  session.start().await?;
  session.append(&pcm_bytes).await?;
  let mut stream = session.get_transcription_stream()?;
  // use tokio_stream::StreamExt;
  while let Some(result) = stream.next().await { ... }
  session.stop().await?;

Design highlights:
- Bounded push channel with backpressure (capacity=100)
- Push loop runs on blocking thread via spawn_blocking
- Fail-fast on native errors (no retry logic)
- Settings frozen at start() via clone snapshot
- Output channel completed on stop() after final result

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds samples/rust/live-audio-transcription-example/ that demonstrates
the full pipeline: SDK  Core.dll  onnxruntime-genai.dll  nemotron.

Tested E2E with synthetic 440Hz PCM audio (30 chunks, 96000 bytes):
- FoundryLocalManager initialized
- nemotron model loaded
- audio_stream_start (session handle)
- audio_stream_push  30 (execute_command_with_binary)
- audio_stream_stop (clean shutdown)
- No errors from native core / onnxruntime-genai

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Uses cpal for cross-platform microphone capture with automatic
format adaptation:
- Queries device default config (e.g. 48kHz/2ch/F32)
- Resamples to 16kHz mono via linear interpolation
- Converts f32  16-bit PCM little-endian for the SDK

Two modes:
  cargo run              # Live microphone (press ENTER to stop)
  cargo run -- --synth   # Synthetic 440Hz sine wave

Tested E2E: Microphone  SDK  Core.dll  onnxruntime-genai.dll

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- core_interop.rs: Use std::ptr::null() for empty binary_data slices
  to avoid passing dangling pointer across FFI boundary
- live_audio_client.rs: Call native audio_stream_stop synchronously
  in Drop to prevent native session leaks when stop() is not called

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address codex-feedback.md parity gaps:

1. CancellationToken support: start/append/stop now accept
   Option<CancellationToken> (via tokio_util::sync::CancellationToken).
   stop() uses cancel-safe pattern matching C# StopAsync  native
   session stop is always performed even if token fires.

2. Response envelope matches C#: LiveAudioTranscriptionResponse now
   has content: Vec<ContentPart> with text/transcript fields, so
   callers use result.content[0].text (identical to C# Content[0].Text).

3. Added tokio-util dependency for CancellationToken.

Updated E2E sample and integration test to use new API shape.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Sample: update download progress callback from &str to f64 to match
  upstream API change (PR #608)
- Apply cargo fmt to all SDK and sample files for CI compliance

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The function and its AudioClient import triggered -D warnings (dead_code)
in the CI build. The E2E test creates the session directly via
model.create_audio_client() and doesn't use this helper.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants