Did You Hear That? Introducing AADG: A Framework for Anomalous Audio Data Generation

Ksheeraja Raghavan^, Samiran Gode^ , Ankit Shah, Surabhi Raghavan, Wolfram Burgard, Bhiksha Raj, Rita Singh

^* indicates equal contributions

SynthData@ICLR 2025, Poster Presentation

Link: Read the paper

ABSTRACT

Did You Hear That? — Anomalous Audio Data Generation (AADG), a framework that synthetically generates real life Audio Data with Anomalies by leveraging LLMs as a world model

We introduce a novel, general purpose audio generation framework specifically designed for Audio Anomaly Detection(AAD) and Localization. Unlike existing datasets that predominantly focus on industrial and machine-related sounds, our framework focuses on a broader range of environments, particularly useful in real-world scenarios where only audio data are available, such as telephonic audio. To generate such data, we propose a new method, Anomalous Audio Data Generation(AADG), inspired by the LLM-Modulo [1] framework, which leverages Large Language Models(LLMs) as world models to simulate such real-world scenarios. This tool is modular, allowing for a plug-and-play approach. It works by first using LLMs to predict plausible real-world scenarios. An LLM further extracts the constituent sounds, the order and the way in which these should be merged to create coherent wholes. Constituent audios are then generated using text-to-audio models and merged based on the LLM's instructions. We include a rigorous verification of each output stage, ensuring the reliability of the generated data. The data produced using Anomalous Audio Data Generation (AADG) allows us to train an anomaly detection model which we test on real-world data to prove the effectiveness of our framework. We also show the shortcomings of current SOTA audio models, particularly in handling out-of-distribution scenarios thus making a case for improving their performance by including data using AADG. Our contributions fill a critical void in audio anomaly detection resources and provide a scalable tool for generating diverse, realistic audio data.

METHOD

Pipeline — Illustration of the pipeline for generating and verifying anomalous audio data. The process begins with scene generation, followed by information extraction using a Large Language Model (LLM). Individual audio components are synthesized from text descriptions and meticulously verified for accuracy and merged according to LLM instructions, culminating in a dataset of realistic anomalous audio.

EXAMPLES OF AUDIO GENERATED BY OUR MODEL

Prompt

Recording the ambient sound of a busy coffee shop, including background music, people chatting, barista making coffee, and occasional clinks of cups and cutlery.

Our Audio

Prompt

A busy school cafeteria filled with ambient sounds of students chatting, utensils clinking, trays sliding, and occasional laughter. Anomalously, there is a persistent low mechanical whirring resembling a small drone or faulty air conditioner, blending yet standing out upon close hearing.

Our Audio

Prompt

An outdoor fruit market in a town square with a lion’s roar.

Our Audio

EVALUATION

Comparison with Stable Audio

Prompt

A recording session of nature sounds in a secluded forest with birds chirping, a babbling brook, rustling underbrush, and woodpecker pecking rhythm. A loud, obnoxious human scream suddenly pierces through the serene nature sounds, completely out of place in the tranquil environment.

Our Audio

Stable Audio

Prompt

A street ambience recording with people chatting, cars passing by, birds chirping, and a breeze rustling leaves. Occasionally, a faint and distorted broadcast signal with indistinct speech and fragmented musical tunes interrupts.

Our Audio

Stable Audio

Prompt

An open-plan office area during regular work hours with sounds of typing on keyboards, sporadic low-level conversations among employees, soft hum of the air conditioner, distant ringing of a few telephones, occasional chair movements, and printers working periodically. Anomalous is an extended sound of jingling keys, soft whispers near the recording device, and a faint scratching sound that resembles nails on chalkboard.

Our Audio

Stable Audio

What Does GAMA Have to Say?

GAMA Output

The audio is dominated by bird sounds, with occasional human speech and wind noise. The presence of both bird and human sounds suggests an outdoor, natural setting, possibly a park or a garden near a water body. The ab- sence of other environmental sounds like traffic or ma- chinery suggests a peaceful, rural or suburban area.

Original Prompt

Recorded ambiance sounds in a regular classroom filled with students are mixed with dense and random wildlife sounds, distinctly audible in the classroom environment.

GAMA Output

The audio suggests a busy urban environment with a mix of human and vehicle sounds, possibly a street or a busy intersection. The wind noise could indicate an open win- dow or a windy day. The impact sounds could be from a vehicle or a door closing, adding to the urban, bustling atmosphere.

Original Prompt

A busy street environment with typical city sounds such as traffic, distant honks, people talking, and footsteps. Hidden within this environment are subtle tonal changes like an odd string hum or a faint digital interference blip that doesn’t belong to the typical ambience.

GAMA Output

The audio is dominated by the sound of a crowd, possibly a group of children, and the sound of a coin dropping, suggesting a casual, outdoor setting.

Original Prompt

An open-air market bustling with activities: people chat- ting, merchants calling out, children shouting, back- ground noises like rustling of bags and clanking of metal- lic objects. Among these, an inconsistent clicking or beeping sound appears sporadically amidst normal foot- steps.

Real-world Audio Anomaly Detection

Precision-Recall Curve

(a) Precision-Recall Curve

ROC Curve

(b) ROC Curve

Performance of our model on the real-world test dataset.

Train Crash

(a) Train Crash Audio

Normal Office

(b) Normal Office Scenario

Anomaly scores and Feature Magnitudes from our method.

We evaluated our model—trained exclusively on synthetic anomalies— on this real-world dataset using the RTFM architecture. From 3, the model achieved a ROC-AUC of 87.00 and a Precision-Recall-AUC of 86.79, demonstrating that training on synthetic samples can enable effective anomaly detection in real-world scenarios without requiring labeled real anomalies during training

CONCLUSION

The proposed framework, AADG, allows scalable and realistic anomalous audio data generation. Unlike traditional datasets that focus on industrial sounds, it leverages LLMs to simulate real-world scenarios, making it particularly valuable for audio-only applications.

The modular design enables integration of various LLMs and text-to-audio models, allowing the generation of complex, anomalous scenarios that are hard to capture in real-world data. While current text-to-audio models still face challenges with generating realistic audio for complex prompts and anomalies, the framework introduces multi-stage verification processes to minimize logical flaws, misalignment, and inconsistent outputs.

Additionally, we train a benchmark anomaly detection model on synthetic AADG data and demonstrate its effectiveness in real-world test cases. Although this process still has limitations in handling certain out-of-distribution cases, we fill a critical gap in the field by providing a tool for creating diverse and realistic datasets, which are essential for advancing audio anomaly detection.

CONTACT

For further questions and suggestions, please contact Ksheeraja Raghavan (ksheerar@alumni.cmu.edu) and Samiran Gode (sgode@alumni.cmu.edu).