* indicates equal contributions
We introduce a novel, general purpose audio generation framework specifically designed for Audio Anomaly Detection(AAD) and Localization. Unlike existing datasets that predominantly focus on industrial and machine-related sounds, our framework focuses on a broader range of environments, particularly useful in real-world scenarios where only audio data are available, such as telephonic audio. To generate such data, we propose a new method, Anomalous Audio Data Generation(AADG), inspired by the LLM-Modulo [1] framework, which leverages Large Language Models(LLMs) as world models to simulate such real-world scenarios. This tool is modular, allowing for a plug-and-play approach. It works by first using LLMs to predict plausible real-world scenarios. An LLM further extracts the constituent sounds, the order and the way in which these should be merged to create coherent wholes. Constituent audios are then generated using text-to-audio models and merged based on the LLM's instructions. We include a rigorous verification of each output stage, ensuring the reliability of the generated data. The data produced using Anomalous Audio Data Generation (AADG) allows us to train an anomaly detection model which we test on real-world data to prove the effectiveness of our framework. We also show the shortcomings of current SOTA audio models, particularly in handling out-of-distribution scenarios thus making a case for improving their performance by including data using AADG. Our contributions fill a critical void in audio anomaly detection resources and provide a scalable tool for generating diverse, realistic audio data.
Recording the ambient sound of a busy coffee shop, including background music, people chatting, barista making coffee, and occasional clinks of cups and cutlery.
A busy school cafeteria filled with ambient sounds of students chatting, utensils clinking, trays sliding, and occasional laughter. Anomalously, there is a persistent low mechanical whirring resembling a small drone or faulty air conditioner, blending yet standing out upon close hearing.
An outdoor fruit market in a town square with a lion’s roar.
A recording session of nature sounds in a secluded forest with birds chirping, a babbling brook, rustling underbrush, and woodpecker pecking rhythm. A loud, obnoxious human scream suddenly pierces through the serene nature sounds, completely out of place in the tranquil environment.
A street ambience recording with people chatting, cars passing by, birds chirping, and a breeze rustling leaves. Occasionally, a faint and distorted broadcast signal with indistinct speech and fragmented musical tunes interrupts.
An open-plan office area during regular work hours with sounds of typing on keyboards, sporadic low-level conversations among employees, soft hum of the air conditioner, distant ringing of a few telephones, occasional chair movements, and printers working periodically. Anomalous is an extended sound of jingling keys, soft whispers near the recording device, and a faint scratching sound that resembles nails on chalkboard.
The audio is dominated by bird sounds, with occasional human speech and wind noise. The presence of both bird and human sounds suggests an outdoor, natural setting, possibly a park or a garden near a water body. The ab- sence of other environmental sounds like traffic or ma- chinery suggests a peaceful, rural or suburban area.
Recorded ambiance sounds in a regular classroom filled with students are mixed with dense and random wildlife sounds, distinctly audible in the classroom environment.
The audio suggests a busy urban environment with a mix of human and vehicle sounds, possibly a street or a busy intersection. The wind noise could indicate an open win- dow or a windy day. The impact sounds could be from a vehicle or a door closing, adding to the urban, bustling atmosphere.
A busy street environment with typical city sounds such as traffic, distant honks, people talking, and footsteps. Hidden within this environment are subtle tonal changes like an odd string hum or a faint digital interference blip that doesn’t belong to the typical ambience.
The audio is dominated by the sound of a crowd, possibly a group of children, and the sound of a coin dropping, suggesting a casual, outdoor setting.
An open-air market bustling with activities: people chat- ting, merchants calling out, children shouting, back- ground noises like rustling of bags and clanking of metal- lic objects. Among these, an inconsistent clicking or beeping sound appears sporadically amidst normal foot- steps.
We evaluated our model—trained exclusively on synthetic anomalies— on this real-world dataset using the RTFM architecture. From 3, the model achieved a ROC-AUC of 87.00 and a Precision-Recall-AUC of 86.79, demonstrating that training on synthetic samples can enable effective anomaly detection in real-world scenarios without requiring labeled real anomalies during training
The proposed framework, AADG, allows scalable and realistic anomalous audio data generation. Unlike traditional datasets that focus on industrial sounds, it leverages LLMs to simulate real-world scenarios, making it particularly valuable for audio-only applications.
The modular design enables integration of various LLMs and text-to-audio models, allowing the generation of complex, anomalous scenarios that are hard to capture in real-world data. While current text-to-audio models still face challenges with generating realistic audio for complex prompts and anomalies, the framework introduces multi-stage verification processes to minimize logical flaws, misalignment, and inconsistent outputs.
Additionally, we train a benchmark anomaly detection model on synthetic AADG data and demonstrate its effectiveness in real-world test cases. Although this process still has limitations in handling certain out-of-distribution cases, we fill a critical gap in the field by providing a tool for creating diverse and realistic datasets, which are essential for advancing audio anomaly detection.
For further questions and suggestions, please contact Ksheeraja Raghavan (ksheerar@alumni.cmu.edu) and Samiran Gode (sgode@alumni.cmu.edu).