SONATA Command Line Interface ⌨️

SONATA provides a powerful command-line interface for quick and efficient audio transcription.

Basic Usage

sonata-asr <audio_file>

This will transcribe the audio file using default settings and save the results to <filename>.json.

Command Line Options

Input and Output

Option Description
<audio_file> Path to the input audio file
-o, --output FILE Path to save JSON output (default: <filename>.json)
--text-output Save transcript to text file (default: <filename>.txt)
--format [TYPE] Format for text output: concise, default, extended (default: default)

Model Options

Option Description
-m, --model MODEL WhisperX model size to use: tiny, base, small, medium, large, large-v2, large-v3 (default: large-v3)
-d, --device DEVICE Device to run on: cpu, cuda, mps (default: cpu)
-e, --audio-model PATH Path to custom audio event detection model
-t, --threshold FLOAT Threshold for audio event detection (default: 0.5)
--custom-thresholds PATH Path to JSON file with custom audio event thresholds

Language Options

Option Description
-l, --language LANG Language code for transcription: en (English), ko (Korean), zh (Chinese), ja (Japanese), fr (French), de (German), es (Spanish), it (Italian), pt (Portuguese), ru (Russian), and other languages supported by WhisperX (default: en)

Preprocessing Options

Option Description
--preprocess Preprocess audio (convert format and trim silence)
--split Split long audio into segments
--split-length SECONDS Length of split segments in seconds (default: 30)
--split-overlap SECONDS Overlap between split segments in seconds (default: 5)

Speaker Diarization Options

Option Description
--diarize Enable SOTA speaker diarization using Silero VAD and WavLM embeddings
--num-speakers N Number of speakers if known (estimated automatically if not provided)

Audio Event Detection Options

Option Description
--deep-detect Enable multi-scale audio event detection with parallel window sizes for better paralinguistic detection
--deep-detect-scales {1,2,3} Number of scales to use for deep detection (default: 3, fewer scales = faster processing)
--deep-detect-window-sizes SIZES Comma-separated list of window sizes in seconds for deep detection (default: 0.2,1.0,2.5)
--deep-detect-hop-sizes SIZES Comma-separated list of hop sizes in seconds for deep detection (default: 0.1,0.5,1.0)
--deep-detect-parallel Use parallel processing for multi-scale detection (automatically enables –deep-detect)
--deep-detect-progress Show detailed progress bars for deep detection processing

Miscellaneous

Option Description
--version Show SONATA version and exit
--help Show help message and exit

Examples

Basic Transcription

sonata-asr recording.wav

Using GPU with a Specific Model

sonata-asr recording.wav --device cuda --model medium

Transcribing Non-English Audio

sonata-asr korean_speech.mp3 --language ko

With Speaker Diarization

sonata-asr interview.wav --diarize --num-speakers 3

Enhanced Audio Event Detection

sonata-asr podcast.wav --deep-detect --deep-detect-parallel

Preprocessing Long Audio

sonata-asr long_podcast.mp3 --preprocess --split --split-length 60 --split-overlap 10

Customizing Output

sonata-asr meeting.wav --output meeting_data.json --text-output --format extended

Adjusting Detection Sensitivity

sonata-asr comedy_show.wav --threshold 0.2

Lower threshold values will detect more audio events but may increase false positives.

Exit Codes

  • 0: Success
  • 1: Invalid arguments or file not found
  • 2: Processing error

Advanced Examples

# Using custom audio event thresholds
sonata-asr audio.wav --custom-thresholds thresholds.json

# Deep detection with custom settings
sonata-asr interview.wav --deep-detect --deep-detect-scales 2 --deep-detect-window-sizes 0.5,2.0 --deep-detect-hop-sizes 0.2,1.0

# Combining multiple options
sonata-asr audio.mp3 --language ko --device cuda --diarize --num-speakers 2 --deep-detect --custom-thresholds custom_thresholds.json

Note: Speaker diarization works with all supported languages. The language option affects only the transcription part, not the speaker identification.


Back to top

Copyright © 2024 SONATA. Distributed under GPLv3 license.