Text-to-Speech
ONNX
GGUF
speech-translation
streaming-speech-translation
speech
audio
speech-recognition
automatic-speech-recognition
streaming-asr
ASR
NeMo
ONNX
cache-aware ASR
FastConformer
RNNT
Parakeet
neural-machine-translation
NMT
gemma3
llama-cpp
GGUF
conversational
TTS
xtts
xttsv2
voice-clone
gpt2
hifigan
multilingual
vq
perceiver-encoder
websocket
Streaming Speech Translation Pipeline
Real-time English → Russian speech translation: Audio In → ASR → NMT → TTS → Audio Out
Translates spoken English into spoken Russian with streaming output over WebSocket.
Input can only be English for now (due to ASR NeMo), while output language depending on TranslateGemma (NMT) and XTTSv2 (TTS). You can modify these accordingly.
Architecture
Audio Input → ASR (ONNX) → NMT (GGUF) → TTS (ONNX) → Audio Output
(PCM16) Conformer RNN-T TranslateGemma XTTSv2 (PCM16)
- ASR: NVIDIA NeMo Conformer RNN-T (cache-aware streaming, ONNX)
- NMT: TranslateGemma 4B (GGUF Q8_0, llama-cpp-python) with streaming segmentation and translation merging
- TTS: XTTSv2 with GPT-2 AR model + HiFi-GAN vocoder (ONNX), 24kHz output
See ARCHITECTURE.md for detailed design documentation.
Requirements
- Python 3.10+
- Model files:
- ASR: NeMo Conformer RNN-T ONNX model directory
- NMT: TranslateGemma 4B GGUF file
- TTS: XTTSv2 ONNX model directory, BPE vocab, mel normalization stats, reference audio
Installation
pip install -r requirements.txt
System Dependencies
# Ubuntu/Debian
apt-get install libsndfile1 libportaudio2
Usage
Start the Server
Recommended to use --tts-int8-gpt if using CPU.
python app.py \
--asr-onnx-path models/asr/nemo-cache-aware-streaming-560ms-onnx/ \
--nmt-gguf-path models/nmt/translategemma-4b-it-q8_0-gguf/translategemma-4b-it-q8_0.gguf \
--tts-model-dir models/tts/xttsv2-onnx/ \
--tts-vocab-path models/tts/xttsv2-onnx/vocab.json \
--tts-mel-norms-path models/tts/xttsv2-onnx/mel_stats.npy \
--tts-ref-audio-path audio_ref/male_stewie.mp3 \
--tts-int8-gpt \
--host 0.0.0.0 \
--port 8765
CLI Options
| Flag | Default | Description |
|---|---|---|
--asr-onnx-path |
(required) | ASR ONNX model directory |
--asr-chunk-ms |
10 | ASR audio chunk duration (ms) |
--asr-sample-rate |
16000 | ASR expected sample rate |
--nmt-gguf-path |
(required) | NMT GGUF model file |
--nmt-n-threads |
4 | NMT CPU threads |
--tts-model-dir |
(required) | TTS ONNX model directory |
--tts-vocab-path |
(required) | TTS BPE vocab.json |
--tts-mel-norms-path |
(required) | TTS mel_stats.npy |
--tts-ref-audio-path |
(required) | TTS reference speaker audio |
--tts-language |
ru | TTS target language code |
--tts-int8-gpt |
False | Use INT8 quantized GPT |
--tts-threads-gpt |
2 | TTS GPT ONNX threads |
--tts-chunk-size |
20 | TTS AR tokens per vocoder chunk |
--audio-queue-max |
256 | Audio input queue max size |
--text-queue-max |
64 | Text queue max size |
--tts-queue-max |
16 | NMT→TTS text queue max size |
--audio-out-queue-max |
32 | Audio output queue max size |
--host |
0.0.0.0 | Server bind host |
--port |
8765 | Server port |
Python Client
Captures microphone audio and plays back translated speech:
pip install -r requirements_client.txt
python clients/python_client.py --uri ws://localhost:8765
Web Client
Open clients/web_client.html in a browser. Click "Connect" to start streaming.
WebSocket Protocol
| Direction | Type | Format | Description |
|---|---|---|---|
| Client→ | Binary | PCM16 | Raw audio at declared sample rate |
| Client→ | Text | JSON | {"action": "start", "sample_rate": 16000} |
| Client→ | Text | JSON | {"action": "stop"} |
| →Client | Binary | PCM16 | Synthesized audio at 24kHz |
| →Client | Text | JSON | {"type": "transcript", "text": "..."} |
| →Client | Text | JSON | {"type": "translation", "text": "..."} |
| →Client | Text | JSON | {"type": "status", "status": "started"} |
Docker
docker build -t streaming-translation .
docker run -p 8765:8765 \
-v /path/to/models:/models \
streaming-translation \
--asr-onnx-path /models/asr/ \
--nmt-gguf-path /models/translategemma-4b-it-q8_0.gguf \
--tts-model-dir /models/xtts/ \
--tts-vocab-path /models/xtts/vocab.json \
--tts-mel-norms-path /models/xtts/mel_stats.npy \
--tts-ref-audio-path /models/reference.wav
Project Structure
streaming_speech_translation/
├── app.py # Main entry point
├── requirements.txt
├── Dockerfile
├── ARCHITECTURE.md
├── models/
│ ├── asr/
│ │ ├── nemo-cache-aware-streaming-560ms-onnx/
│ ├── nmt/
│ │ ├── translategemma-4b-it-q8_0-gguf/
│ ├── tts/
│ │ ├── xttsv2-onnx/
├── src/
│ ├── asr/
│ │ ├── streaming_asr.py # StreamingASR wrapper
│ │ ├── pipeline.py # ThreadedSpeechTranslator (reference)
│ │ ├── cache_aware_modules.py # Audio buffer + streaming ASR
│ │ ├── cache_aware_modules_config.py
│ │ ├── modules.py # ONNX model loading
│ │ ├── modules_config.py
│ │ ├── onnx_utils.py
│ │ └── utils.py # Audio utilities
│ ├── nmt/
│ │ ├── streaming_nmt.py # StreamingNMT wrapper
│ │ ├── streaming_segmenter.py # Word-group segmentation
│ │ ├── streaming_translation_merger.py
│ │ └── translator_module.py # TranslateGemma via llama-cpp
│ ├── tts/
│ │ ├── streaming_tts.py # StreamingTTS wrapper
│ │ ├── xtts_streaming_pipeline.py # Full TTS pipeline
│ │ ├── xtts_onnx_orchestrator.py # GPT-2 AR + vocoder
│ │ ├── xtts_tokenizer.py # BPE tokenizer
│ │ └── zh_num2words.py # Chinese text normalization
│ ├── pipeline/
│ │ ├── orchestrator.py # PipelineOrchestrator
│ │ └── config.py # PipelineConfig
│ └── server/
│ └── websocket_server.py # WebSocket server
└── clients/
├── python_client.py # Python CLI client
└── web_client.html # Browser client
- Downloads last month
- 10
Hardware compatibility
Log In to add your hardware
8-bit
Model tree for pltobing/streaming-speech-translation
Base model
nvidia/nemotron-speech-streaming-en-0.6b