Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR
Abstract
Multimodal conversational context improves LLM-based speech recognition, and abstract compression reduces computational overhead by replacing audio sequences with learned latent tokens while preserving transcripts.
Standard LLM-based speech recognition systems typically process utterances in isolation, limiting their ability to leverage conversational context. In this work, we study whether multimodal context from prior turns improves LLM-based ASR and how to represent that context efficiently. We find that, after supervised multi-turn training, conversational context mainly helps with the recognition of contextual entities. However, conditioning on raw context is expensive because the prior-turn audio token sequence grows rapidly with conversation length. To address this, we propose Abstract Compression, which replaces the audio portion of prior turns with a fixed number of learned latent tokens while retaining corresponding transcripts explicitly. On both in-domain and out-of-domain test sets, the compressed model recovers part of the gains of raw-context conditioning with a smaller prior-turn audio footprint. We also provide targeted analyses of the compression setup and its trade-offs.
Community
Standard LLM-based speech recognition systems typically process utterances in isolation, limiting their ability to leverage conversational context. In this work, we study whether multimodal context from prior turns improves LLM-based ASR and how to represent that context efficiently. We find that, after supervised multi-turn training, conversational context mainly helps with the recognition of contextual entities. However, conditioning on raw context is expensive because the prior-turn audio token sequence grows rapidly with conversation length. To address this, we propose Abstract Compression, which replaces the audio portion of prior turns with a fixed number of learned latent tokens while retaining corresponding transcripts explicitly. On both in-domain and out-of-domain test sets, the compressed model recovers part of the gains of raw-context conditioning with a smaller prior-turn audio footprint. We also provide targeted analyses of the compression setup and its trade-offs.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Speak in Context: Multilingual ASR with Speech Context Alignment via Contrastive Learning (2026)
- From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs (2026)
- CLAR: CIF-Localized Alignment for Retrieval-Augmented Speech LLM-Based Contextual ASR (2026)
- NLE: Non-autoregressive LLM-based ASR by Transcript Editing (2026)
- Speech-XL: Towards Long-Form Speech Understanding in Large Speech Language Models (2026)
- TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment (2026)
- ALARM: Audio-Language Alignment for Reasoning Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.26246 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper