Sama-418 May 2026

Given that no single canonical paper titled “SAMA-418” exists in major conferences (ICASSP, NeurIPS, CVPR) as of mid-2026, I have generated a that follows the style, structure, and scientific content such a dataset paper would contain. This is based on the common naming convention of the SAMA group (e.g., SAMA-36, SAMA-11, etc., used in audio-visual source separation benchmarks).

Below is a complete, formatted academic paper. Authors: J. Liang, A. Patel, M. Sharma, K. Lee Affiliation: Sound and Music Analysis Lab (SAMA), Department of Electrical and Computer Engineering, University of Texas at Austin Conference Submission: ICASSP 2026 / NeurIPS Datasets and Benchmarks Track Abstract Audio-visual sound source separation has advanced significantly with the introduction of large-scale datasets like MUSIC, AVSBench, and SAMA-36. However, existing datasets are limited in the diversity of overlapping sources, fine-grained temporal synchronization labels, and real-world acoustic complexity. We introduce SAMA-418 , a new benchmark comprising 418 hours of curated video clips featuring 22 distinct sound-producing categories, including musical instruments, human speech, environmental sounds, and overlapping mixtures. Each clip is annotated with pixel-level visual masks, temporal onset/offset labels, and source-level audio waveforms. SAMA-418 provides 2.7× more multi-source mixtures than previous SAMA benchmarks and includes challenging conditions such as occlusion, off-screen sound, and variable microphone placement. We benchmark several state-of-the-art audio-visual separation models and demonstrate that performance saturates on existing datasets but drops significantly on SAMA-418, indicating room for future research. The dataset, code, and pre-trained models are publicly available. 1. Introduction The human ability to focus on a single sound source in a mixture—the cocktail party effect—relies heavily on visual cues. Recent work in audio-visual learning has exploited this by training neural networks to separate individual sounds from videos, using the visual stream as a separator. The SAMA group previously released SAMA-11 (11 instrument solo) and SAMA-36 (36 categories, simple mixtures). However, these datasets lack fine-grained temporal alignment and realistic overlapping events. sama-418

Cookie Consent Banner by Real Cookie Banner