Saying “speech tokens are just another language” is technically true and practically incomplete. You don’t know if the llm is using speech unless you break things on purpose.
For example,
If you mask the speech tokens while keeping the text prompt unchanged and the model suddenly fails. Its a strong signal the information was coming from audio.
if swapping the spoken content (but not the text) changes the output, it’s pretty clear the model is extracting meaning from speech itself.
If attention and gradient attribution consistently point to specific speech token regions tied to the answer, its another indicator.