SONATA Usage Guide 📝

This document provides detailed examples for using SONATA in different scenarios.

Basic Usage

Initializing the Transcriber

from sonata.core.transcriber import IntegratedTranscriber

# Basic initialization (CPU)
transcriber = IntegratedTranscriber()

# With specific model and device
transcriber = IntegratedTranscriber(
    asr_model="large-v3",  # Options: tiny, base, small, medium, large, large-v2, large-v3
    device="cuda",         # Options: cpu, cuda, mps (Mac)
    compute_type="float16" # Options: float32, float16, int8
)

Processing Audio Files

# Basic transcription with default options
result = transcriber.process_audio("path/to/audio.wav")

# Transcription with specific language
result = transcriber.process_audio(
    audio_path="path/to/audio.wav",
    language="ko"  # Specify language code (e.g., en, ko, zh)
)

# Adjust audio event detection threshold (lower = more sensitive)
result = transcriber.process_audio(
    audio_path="path/to/audio.wav",
    audio_threshold=0.2  # Default is 0.3
)

Saving and Retrieving Results

# Save output to JSON file
transcriber.save_result(result, "output.json")

# Get plain text transcript
plain_text = result["integrated_transcript"]["plain_text"]

# Get formatted transcript with different options
concise_format = transcriber.get_formatted_transcript(result, format_type="concise")
default_format = transcriber.get_formatted_transcript(result, format_type="default")
extended_format = transcriber.get_formatted_transcript(result, format_type="extended")

# Save formatted transcript to a text file
with open("transcript.txt", "w", encoding="utf-8") as f:
    f.write(concise_format)

Advanced Usage

Speaker Diarization

# Process audio with speaker diarization
result = transcriber.process_audio(
    audio_path="path/to/audio.wav",
    diarize=True,
    hf_token="YOUR_HUGGINGFACE_TOKEN",  # Required for diarization
    min_speakers=2,  # Optional: minimum number of speakers
    max_speakers=5   # Optional: maximum number of speakers
)

# Get transcript with speaker labels
transcript = transcriber.get_formatted_transcript(result, format_type="concise")

Working with Timestamps

# Extract words with timestamps
for item in result["integrated_transcript"]["rich_text"]:
    if item["type"] == "word":
        word = item["content"]
        start_time = item["start"]
        end_time = item["end"]
        print(f"{word}: {start_time:.2f}s - {end_time:.2f}s")
        
    # You can also access audio events
    elif item["type"] == "audio_event":
        event_type = item["event_type"]
        start_time = item["start"]
        confidence = item["confidence"]
        print(f"{event_type}: {start_time:.2f}s (confidence: {confidence:.2f})")

Processing Long Audio

For very long audio files, use the split option:

from sonata.utils.audio import split_audio

# Split audio into manageable segments
split_dir = "splits_directory"
segments = split_audio(
    "path/to/long_audio.wav",
    split_dir,
    segment_length=30,  # Length of each segment in seconds
    overlap=5           # Overlap between segments in seconds
)

# Process each segment and combine results
all_results = []
for segment in segments:
    result = transcriber.process_audio(segment["path"])
    all_results.append(result)
    
# Combine results (simplified example)
combined_plain_text = " ".join([r["integrated_transcript"]["plain_text"] for r in all_results])

Audio Preprocessing

from sonata.utils.audio import convert_audio_file, trim_silence

# Convert file format
converted_file = convert_audio_file("input.mp3", "output.wav")

# Trim silence
trimmed_file = trim_silence("input.wav", silence_threshold=-50, min_silence_len=500)

# Process preprocessed audio
result = transcriber.process_audio(trimmed_file)

Working with Results

The result dictionary contains three main sections:

{
    "raw_asr": {
        # Original WhisperX output
        "segments": [...],  # List of transcript segments
        "language": "en"    # Detected language
    },
    "audio_events": [
        # List of detected audio events
        {
            "type": "laughter",
            "start": 12.5,
            "end": 14.2,
            "confidence": 0.89
        },
        ...
    ],
    "integrated_transcript": {
        # Integrated transcript with words and audio events
        "plain_text": "This is a transcript with [laughter] and more text.",
        "rich_text": [
            # Detailed list of words and events with timing information
            {"type": "word", "content": "This", "start": 0.5, "end": 0.7, ...},
            ...
        ]
    }
}

Performance Tips

Use GPU acceleration (device="cuda") for faster processing
Set compute_type="float16" for reduced memory usage with minimal accuracy loss
For long audio files, use the split audio functionality
Preprocess audio to remove silence for faster and more accurate results