🎙️ Speaker Diarization in SONATA

SONATA provides state-of-the-art speaker diarization capabilities that can identify different speakers in audio recordings. This feature is especially useful for transcribing conversations, interviews, meetings, podcasts, and other multi-speaker audio content.

Overview

Speaker diarization is the process of partitioning an audio stream into segments according to the speaker identity. SONATA implements an advanced diarization pipeline that includes:

Voice Activity Detection (VAD) - Identifies speech segments in the audio using Silero VAD
Speaker Embedding Extraction - Creates speaker-specific embeddings using WavLM
Speaker Change Detection - Identifies points where speakers change
Speaker Clustering - Groups speech segments by speaker identity
Speaker Assignment - Assigns speaker labels to each speech segment and integrates with transcription

Technical Implementation

SONATA’s speaker diarization uses the following components:

Voice Activity Detection

Silero VAD provides accurate voice activity detection, identifying speech segments while filtering out silence and non-speech audio. It creates a foundation for further speaker analysis by focusing only on relevant speech segments.

Speaker Embeddings

SONATA uses WavLM (specifically the microsoft/wavlm-base-plus-sv model) to extract speaker embeddings - high-dimensional vectors that capture the unique vocal characteristics of each speaker. These embeddings are used to distinguish between different speakers in the recording.

WavLM is Microsoft’s state-of-the-art model for audio and speech understanding, and the speaker verification variant is specifically optimized for tasks like speaker diarization.

Speaker Change Detection

SONATA implements advanced change-point detection algorithms for improved speaker segmentation:

BIC-based Change Detection: Uses Bayesian Information Criterion to identify statistical changes in audio features
Embedding-based Change Detection: Analyzes changes in speaker embedding vectors across sliding windows
Feature-based Analysis: Utilizes MFCCs, delta features and spectral contrast for robust change detection

Speaker Clustering

The extracted embeddings are clustered using hierarchical clustering algorithms (specifically Agglomerative Clustering) to group speech segments by speaker identity. If the number of speakers is known in advance, this can be specified; otherwise, SONATA will attempt to estimate the optimal number of speakers using silhouette analysis.

Usage

Command Line Interface

# Basic diarization
sonata-asr path/to/audio.wav --diarize

# Specify the number of speakers
sonata-asr path/to/audio.wav --diarize --num-speakers 3

# Full transcription with speaker diarization
sonata-asr path/to/audio.wav --diarize --output transcript.json

Python API

from sonata.core.transcriber import IntegratedTranscriber

# Initialize the transcriber
transcriber = IntegratedTranscriber(device="cpu")

# Transcribe with speaker diarization
result = transcriber.process_audio(
    audio_path="path/to/audio.wav",
    language="en",
    diarize=True,
    num_speakers=3  # Optional: specify if known
)

# Access speaker-labeled transcript
for segment in result["integrated_transcript"]["rich_text"]:
    if "speaker" in segment:
        print(f"{segment['speaker']}: {segment['text']}")

Standalone Diarization

For applications that only need speaker diarization without transcription:

from sonata.core.speaker_diarization import SpeakerDiarizer

# Initialize the diarizer
diarizer = SpeakerDiarizer(device="cpu")

# Perform diarization
speaker_segments = diarizer.diarize(
    audio_path="path/to/audio.wav",
    num_speakers=3,  # Optional: specify if known
    show_progress=True
)

# Print results
for segment in speaker_segments:
    print(f"{segment.start:.2f} - {segment.end:.2f}: {segment.speaker}")

Output Format

The speaker diarization output is integrated into SONATA’s transcription results:

{
  "integrated_transcript": {
    "rich_text": [
      {
        "text": "Hello, welcome to the podcast.",
        "start": 0.5,
        "end": 2.3,
        "speaker": "SPEAKER_1"
      },
      {
        "text": "Thanks for having me today.",
        "start": 2.8,
        "end": 4.1,
        "speaker": "SPEAKER_2"
      }
    ]
  }
}

Performance Considerations

Speaker diarization is a computationally intensive process, especially for longer recordings.
The accuracy of speaker identification improves with longer speech segments per speaker.
Performance is generally better when the number of speakers is provided in advance.
For optimal results, use high-quality audio with minimal background noise.
GPU acceleration is recommended for processing longer recordings.

Advanced Features

Overlapped Speech Detection: SONATA can identify segments where multiple speakers are talking simultaneously
Automatic Speaker Number Estimation: When the number of speakers is not known, SONATA can estimate it
Embedding Fusion: Uses multiple embedding models (WavLM and optionally ECAPA-TDNN) for improved speaker discrimination

Limitations

Speaker identities are assigned arbitrary labels (SPEAKER_1, SPEAKER_2, etc.) rather than actual names.
Very short utterances (less than 1-2 seconds) may be difficult to assign to the correct speaker.
Performance may vary based on audio quality, background noise, and the number of speakers.

Advanced Configuration

For advanced users, SONATA provides ways to customize the diarization process:

Custom VAD parameters - Adjust voice activity detection sensitivity
Clustering algorithms - Use different speaker clustering methods
Embedding models - Configure or substitute alternative speaker embedding extraction

Please refer to the API documentation for details on advanced configuration options.