Search for a command to run...
Description STAIRS26 (Sony-Tau Acoustic Images of Real-World Scapes) is a spatial audio dataset designed to benchmark Semantic Acoustic Imaging: the task of visualizing sound energy and identifying semantic sounding objects in space. This release serves as the development set for Task 3 of the DCASE 2026 Challenge. STAIRS26 fundamentally extends the legacy STARSS23 dataset, shifting the paradigm from sparse point-based localization to dense acoustic field estimation. It upgrades the original real-world recordings (captured in Finland and Japan) with two critical features: 32-Channel Raw Audio: Full microphone array signals enabling high-resolution beamforming and acoustic super-resolution. Acoustic Radiance Maps: High-definition energy acoustic images that serve as ground-truth labels for training models to visually reconstruct acoustic fields. (Note: For details on the physical recording setup, hardware specifications, and scene scripting, please refer to the original STARSS23 Dataset). Aim The primary goal of STAIRS26 is to train and evaluate models on acoustic super-resolution: reconstructing high-fidelity, class-aware energy maps from standard 4-channel inputs. By providing full 32-channel recordings and ground-truth images, the dataset enables researchers to: Develop deep learning architectures that output dense polygon masks encoding event class, spatial location, and acoustic energy intensity. Evaluate high-resolution Direction-of-Arrival (DOA) estimation and multi-source tracking algorithms. Bridge audio signal processing with computer-vision-based semantic segmentation. Specifications Volume and Data Split Size: ~7.5 hours of recordings across 168 development clips. Scope: This release contains only the development data (audio and labels) used for training and validation. Compatibility: File naming and splits are identical to STARSS23. To utilize the full multimodal suite, users should pair this dataset with the STARSS23 audio and video files. Audio Format Sampling rate: 24 kHz Bit depth: 16-bit Format: 32-channel (raw Eigenmike recordings) Acoustic Maps (Labels) High-definition acoustic images, generated via proximal gradient descent from the 32-channel recordings, are provided as individual .json files (one per recording). Structure: The annotations key contains a list of dictionaries. Each dictionary represents a single active sound object at a specific frame (10 FPS temporal resolution). Multi-source Frames: If a frame contains multiple sources, multiple dictionaries are present. Silent frames have no annotations. Metadata: Inherits frame indices and the 13 source classes from the DCASE2023/STARSS23 metadata .csv files. Segmentation: Polygon masks are stored as an array of shape (n_pixels, 3). Each row represents [x, y, amplitude]: x and y: Integer spatial coordinates corresponding to a 1-pixel-per-degree angular grid (x ∈ [0, 359], y ∈ [0, 179]). amplitude: Standardized acoustic energy intensity ([0.0, 1.0]), where 1.0 represents the loudest pixel within the entire training dataset. File Downloads 32ch_audio_dev.zip: Development audio in the raw 32-channel Eigenmike format. labels_dev_std.zip: Generated acoustic-image labels in .json format. (Download and extract using standard compression utilities). Citation If you use this dataset, please cite the following: Roman, I. R., Politis, A., Shimada, K., Cheston, H., Sudarsanam, P., Díaz-Guerra, D., Sun, Y., Shibuya, T., Takahashi, S., & Mitsufuji, Y. (2026). STAIRS26: Sony-Tau Acoustic Images of Real-World Scapes [Data set]. Zenodo. https://doi.org/10.5281/zenodo.18171005 Shimada, K., et al. (2023). STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events. Advances in Neural Information Processing Systems 36 (NeurIPS 2023).