SONATA provides a powerful command-line interface for quick and efficient audio transcription.
Basic Usage
sonata-asr <audio_file>
This will transcribe the audio file using default settings and save the results to <filename>.json.
Command Line Options
Input and Output
Option
Description
<audio_file>
Path to the input audio file
-o, --output FILE
Path to save JSON output (default: <filename>.json)
--text-output
Save transcript to text file (default: <filename>.txt)
--format [TYPE]
Format for text output: concise, default, extended (default: default)
Model Options
Option
Description
-m, --model MODEL
WhisperX model size to use: tiny, base, small, medium, large, large-v2, large-v3 (default: large-v3)
-d, --device DEVICE
Device to run on: cpu, cuda, mps (default: cpu)
-e, --audio-model PATH
Path to custom audio event detection model
-t, --threshold FLOAT
Threshold for audio event detection (default: 0.5)
--custom-thresholds PATH
Path to JSON file with custom audio event thresholds
Language Options
Option
Description
-l, --language LANG
Language code for transcription: en (English), ko (Korean), zh (Chinese), ja (Japanese), fr (French), de (German), es (Spanish), it (Italian), pt (Portuguese), ru (Russian), and other languages supported by WhisperX (default: en)
Preprocessing Options
Option
Description
--preprocess
Preprocess audio (convert format and trim silence)
--split
Split long audio into segments
--split-length SECONDS
Length of split segments in seconds (default: 30)
--split-overlap SECONDS
Overlap between split segments in seconds (default: 5)
Speaker Diarization Options
Option
Description
--diarize
Enable SOTA speaker diarization using Silero VAD and WavLM embeddings
--num-speakers N
Number of speakers if known (estimated automatically if not provided)
Audio Event Detection Options
Option
Description
--deep-detect
Enable multi-scale audio event detection with parallel window sizes for better paralinguistic detection
--deep-detect-scales {1,2,3}
Number of scales to use for deep detection (default: 3, fewer scales = faster processing)
--deep-detect-window-sizes SIZES
Comma-separated list of window sizes in seconds for deep detection (default: 0.2,1.0,2.5)
--deep-detect-hop-sizes SIZES
Comma-separated list of hop sizes in seconds for deep detection (default: 0.1,0.5,1.0)
--deep-detect-parallel
Use parallel processing for multi-scale detection (automatically enables –deep-detect)
--deep-detect-progress
Show detailed progress bars for deep detection processing
Miscellaneous
Option
Description
--version
Show SONATA version and exit
--help
Show help message and exit
Examples
Basic Transcription
sonata-asr recording.wav
Using GPU with a Specific Model
sonata-asr recording.wav --device cuda --model medium