SONATA: SOund and Narrative Advanced Transcription Assistant

SONATA is an advanced audio transcription system that captures human expressions including emotive sounds and non-verbal cues, providing rich and contextual transcription results.

View on GitHub Get Started


System Architecture

The following diagram illustrates the SONATA system architecture:

graph TD
    A[Audio Input] --> B[WhisperX]
    A --> C[AudioSet AST]
    B --> D[Speech Transcription]
    C --> E[Audio Event Detection]
    D --> F[Integrated Transcriber]
    E --> F
    F --> G[Rich Transcript with<br/>Emotive Sounds]
    
    subgraph "Speaker Diarization"
    H[PyAnnote Audio] --> I[Speaker Labels]
    end
    
    I --> F

Features

Speech Recognition

High-accuracy speech-to-text transcription powered by WhisperX with support for multiple languages.

Emotive Sounds

Recognition of 523+ emotive sounds and non-verbal cues for more human-like transcripts.

Multi-language

Support for 10 languages including English, Korean, Chinese, Japanese, and major European languages.

Speaker Diarization

Identify and label different speakers in conversations with Silero VAD and WavLM embeddings.

Rich Timestamps

Precise timing information at the word level for perfect audio alignment.

Customizable

Adjustable audio event detection thresholds and flexible output formats.

Quick Start

Installation

pip install sonata-asr

Basic Usage

```python from sonata.core.transcriber import IntegratedTranscriber # Initialize the transcriber transcriber = IntegratedTranscriber(asr_model="large-v3", device="cpu") # Transcribe an audio file result = transcriber.process_audio("path/to/audio.wav", language="en") print(result["integrated_transcript"]["plain_text"]) ```
```bash # Basic usage sonata-asr path/to/audio.wav # With speaker diarization sonata-asr path/to/audio.wav --diarize --num-speakers 3 ```

Sample Output

{
  "integrated_transcript": {
    "plain_text": "Hello everyone [laughter] I'm excited to show you SONATA today.",
    "rich_text": [
      {"type": "word", "content": "Hello", "start": 0.5, "end": 0.7},
      {"type": "word", "content": "everyone", "start": 0.8, "end": 1.2},
      {"type": "audio_event", "event_type": "laughter", "start": 1.3, "end": 2.1, "confidence": 0.92},
      {"type": "word", "content": "I'm", "start": 2.3, "end": 2.4},
      {"type": "word", "content": "excited", "start": 2.5, "end": 2.9},
      {"type": "word", "content": "to", "start": 3.0, "end": 3.1},
      {"type": "word", "content": "show", "start": 3.2, "end": 3.4},
      {"type": "word", "content": "you", "start": 3.5, "end": 3.6},
      {"type": "word", "content": "SONATA", "start": 3.7, "end": 4.0},
      {"type": "word", "content": "today", "start": 4.1, "end": 4.3}
    ]
  }
}

Need help? Have questions? Join our community discussions or open an issue on GitHub.

Documentation

Features

Detailed overview of SONATA's capabilities and features

Usage Guide

Comprehensive guide to using SONATA in your projects

CLI Reference

Command-line interface documentation and examples

Audio Events

List of supported audio events and detection capabilities

Speaker Diarization

Guide to using speaker diarization to identify different speakers


Back to top

Copyright © 2024 SONATA. Distributed under GPLv3 license.