Attention-Guided Audio Compression for Multimodal LLMs
Audio compression is often proposed to improve the efficiency of multimodal large language models, but its impact on downstream task performance remains underexplored. This talk examines how semantic neural audio codecs behave under token reduction constraints, using cross-modal attention as a signal to discard frames with low semantic content. On audio question-answering benchmarks, attention-guided frame selection removes 10–30% of frames while matching baseline accuracy and answer consistency, and identifies a critical compression threshold (keep ratio ~0.7) below which performance degrades sharply. The talk also discusses an "answer consistency paradox" where models remain highly self-consistent (>98%) even as accuracy degrades and what this decoupling of consistency from correctness means for evaluating compressed multimodal systems in low-resource deployments.
Speaker(s): Prerana
Virtual: https://events.vtools.ieee.org/m/563360
