Supported Languages

SONATA leverages WhisperX (based on Whisper large-v3) to provide high-quality speech recognition across multiple languages with varying levels of transcription accuracy.

Language Performance

Whisper large-v3 shows improved performance (10-20% error reduction) over previous versions across a wide variety of languages. Below is a summary of supported languages and their performance characteristics.

Language	Code	Transcription Accuracy	Audio Event Detection	Speaker Diarization	Notes
English (US/UK)	`en`	Excellent (<5% WER)	Full Support	Full Support	Best overall performance
Spanish	`es`	Very High (~7% WER)	Full Support	Full Support	Strong performance for most dialects
French	`fr`	Very High (~8% WER)	Full Support	Full Support	Good handling of accents
German	`de`	Very High (~7% WER)	Full Support	Full Support	Reliable for most contexts
Portuguese	`pt`	Very High (~9% WER)	Full Support	Full Support	Works well with both European and Brazilian variants
Italian	`it`	Very High (~8% WER)	Full Support	Full Support	Consistent performance
Dutch	`nl`	High (~10% WER)	Full Support	Full Support	Good overall reliability
Polish	`pl`	High (~12% WER)	Full Support	Full Support	Strong for clear speech
Ukrainian	`uk`	High (~12% WER)	Full Support	Full Support	Improved in large-v3
Russian	`ru`	High (~13% WER)	Full Support	Full Support	Better with formal speech
Japanese	`ja`	High* (~11% CER)	Full Support	Full Support	*Uses Character Error Rate
Chinese (Mandarin)	`zh`	High* (~12% CER)	Full Support	Full Support	*Uses Character Error Rate, supports both traditional and simplified characters
Cantonese	`yue`	High* (~15% CER)	Full Support	Full Support	*New in large-v3, uses Character Error Rate
Korean	`ko`	High* (~12% CER)	Full Support	Full Support	*Uses Character Error Rate
Hindi	`hi`	Moderate (~17% WER)	Full Support	Full Support	Better with standard accents
Turkish	`tr`	Moderate (~16% WER)	Full Support	Full Support	Improved in large-v3
Arabic	`ar`	Moderate (~18% WER)	Full Support	Full Support	Handles Modern Standard Arabic best
Vietnamese	`vi`	Moderate (~15% WER)	Full Support	Full Support	Better with northern dialects
Tamil	`ta`	Moderate (~19% WER)	Full Support	Full Support	Improving with model updates
Urdu	`ur`	Moderate (~19% WER)	Full Support	Full Support	Better with clear speech

Additionally, Whisper large-v3 supports approximately 80 more languages with varying levels of accuracy. Languages with less training data typically have higher error rates (20-60% WER/CER).

Using Different Languages

To specify a language, use the language code parameter in both the API and CLI:

CLI Usage

# Specify the language with the -l or --language flag
sonata-asr audio.wav -l ko

# Example with multiple options
sonata-asr audio.wav -l zh --diarize --deep-detect

Python API Usage

from sonata.core.transcriber import IntegratedTranscriber

# Initialize the transcriber
transcriber = IntegratedTranscriber(asr_model="large-v3", device="cpu")

# Specify the language in process_audio
result = transcriber.process_audio("audio.wav", language="ja")

Language Auto-detection

If no language is specified, SONATA will attempt to auto-detect the language in the audio. This works best with longer audio samples (>30 seconds).

# Auto-detect language
sonata-asr audio.wav

# Auto-detect language in Python
result = transcriber.process_audio("audio.wav")  # language not specified
detected_language = result["detected_language"]

Performance Factors

Several factors influence transcription accuracy:

Audio Quality: Clear audio with minimal background noise yields better results
Speaker Accents: Non-native accents may reduce accuracy
Technical Terminology: Specialized vocabulary can be challenging
Speech Clarity: Mumbling or very fast speech may increase error rates
Audio Length: Longer samples provide more context for accurate transcription

Tips for Better Transcription

Choose the Right Model: Use the largest model your hardware can support for best accuracy
Specify Language: Always specify the language when known rather than relying on auto-detection
Clean Audio: When possible, use audio with minimal background noise
Pre-processing: Use audio normalization for better results with quiet recordings
Post-processing: For specialized domains, consider implementing domain-specific post-processing

Future Language Support

The SONATA team continuously works to improve language support through:

Integration of the latest WhisperX/Whisper models as they become available
Fine-tuning for specific languages and domains
Expanding audio event detection capabilities across languages
Improved handling of code-switching (language mixing within a single audio file)

If you’re interested in contributing to language support or have testing data in specific languages, please see our contribution guidelines.