Speech Emotion Valence Classifier - Multilingual
Transformer-based emotion classifier trained on multilingual wav2vec2 multi-layer embeddings for classifying emotional valence in speech (9 languages)
Model Details
Model Description
- Architecture: Transformer Encoder with patch embedding and learnable positional embeddings
- Input Features: WAV2VEC2 embeddings (768-dimensional, last hidden state from wav2vec2-large-xlsr-53)
- Pre-trained Base: facebook/wav2vec2-large-xlsr-53 (multilingual)
- Framework: PyTorch
- Task: Audio emotion valence classification
- Model Size: ~3M parameters
Model Architecture
Input (768-dim wav2vec2 last hidden state, mean pooled)
β
Patch Embedding (patch_size=64 β 12 patches, d_model=256)
β
Learnable Positional Embeddings + CLS Token
β
Transformer Encoder (4 layers, 8 heads, 1024-dim FFN, dropout=0.2)
β
Classification Head (256 β 128 β 3 classes)
β
Output: [negative, neutral, positive]
Architecture Details:
- Input: 768-dim embeddings (last hidden state from wav2vec2-large-xlsr-53, mean pooled across time)
- Patch Embedding: 256-dim with patch_size=64 β 12 patches
- Transformer Encoder: 4 layers, 8 attention heads (head_dim=32), 1024-dim FFN
- Dropout: 0.2
- Total Parameters: ~3M
Intended Use
Classify emotional valence in speech audio into three categories:
- Negative: Sad, Angry, Fearful, Disgust
- Neutral: Neutral, Calm
- Positive: Happy, Surprised, Surprised
Multilingual Support
Trained on wav2vec2 multi-layer embeddings covering 9 languages:
- English (eng) - 78.77% accuracy
- Chinese (cmn) - 95.00% accuracy
- German (deu) - 93.52% accuracy
- Urdu (urd) - 76.47% accuracy
- Portuguese (por) - 78.95% accuracy
- Greek (ell) - 69.57% accuracy
- Persian (pes) - 80.83% accuracy
- Estonian (est) - 55.28% accuracy
- French (fra) - 73.87% accuracy
Training Data
- Original Dataset: Unified Multilingual Dataset of Emotional Human Utterances (GitHub)
- Source: https://github.com/michen00/unified_multilingual_dataset_of_emotional_human_utterances
- 83,545 audio samples (filtered and processed)
- 22 source emotion datasets combined: CREMA-D, RAVDESS, TESS, EmoDB, ShEMO, and more
- 9 languages: English, Chinese, German, Greek, Urdu, Estonian, French, Portuguese, Persian
- Pre-processed: 16kHz, mono, PCM 16-bit WAV
- Features: Wav2Vec2 Features (facebook/wav2vec2-large-xlsr-53, 768-dim last hidden state, mean pooled)
- Feature Extraction: Embeddings extracted using facebook/wav2vec2-large-xlsr-53 from Vocametrix
- Sample Rate: Standardized to 16kHz via wav2vec2 preprocessor
- Valence Labels: 3-class (negative, neutral, positive)
Performance
Test Set Results (16,709 samples):
- Overall Accuracy: 86.25%
- Macro F1-Score: 84.60%
- Weighted F1-Score: 86.04%
- Classes: negative, neutral, positive
- Epochs Trained: 111
- Training Samples: 83,545
Per-Class Performance
Precision Recall F1-Score Support
negative 0.92 0.83 0.87 8858
neutral 0.78 0.86 0.82 4452
positive 0.74 0.84 0.79 3399
Macro Average: 0.81 0.84 0.83
Weighted Avg: 0.87 0.86 0.86
Per-Language Performance
Chinese (cmn) - Accuracy: 95.00%, F1: 91.34%
German (deu) - Accuracy: 93.52%, F1: 91.32%
Persian (pes) - Accuracy: 80.83%, F1: (see logs)
Portuguese (por) - Accuracy: 78.95%, F1: 75.64%
English (eng) - Accuracy: 78.77%, F1: (see logs)
Urdu (urd) - Accuracy: 76.47%, F1: 69.87%
French (fra) - Accuracy: 73.87%, F1: (see logs)
Greek (ell) - Accuracy: 69.57%, F1: (see logs)
Estonian (est) - Accuracy: 55.28%, F1: (see logs)
Average Across Languages: 84.83% accuracy, 80.77% F1-score
Confusion Matrix
Predicted: negative neutral positive
Actual negative: 83% 9% 8%
Actual neutral: 6% 86% 8%
Actual positive: 12% 12% 84%
Training Configuration
Model:
Patch Size: 64 β 12 patches
d_model: 256, num_layers: 4, nhead: 8
dim_feedforward: 1024, dropout: 0.2
Hyperparameters:
Optimizer: AdamW (lr=2e-4, weight_decay=0.01)
Loss: CrossEntropyLoss (label_smoothing=0.1, class-weighted)
LR Schedule: Cosine annealing + 5% warmup
Batch Size: 32, Epochs: 111
Regularization: Gradient clipping (max_norm=1.0)
Files
- emotion_classifier_complete.safetensors - Model weights in SafeTensors format (safer than PyTorch .pt files, no arbitrary code execution)
- emotion_classifier_scaler.pkl - Feature normalization (StandardScaler for 1024-dim input)
- config.json - Model architecture configuration
- requirements.txt - Python dependencies
Usage
Installation
Install required packages:
pip install torch torchaudio transformers huggingface-hub scikit-learn safetensors
Or use the complete requirements:
pip install -r requirements.txt
Minimum Requirements:
torch>=2.0.0- PyTorch deep learning frameworktorchaudio>=2.0.0- Audio loading and processing (required for exact feature extraction)transformers>=4.30.0- Hugging Face Transformers (for Wav2Vec2)safetensors>=0.3.0- Safe model serialization formathuggingface-hub>=0.19.0- Download models from Hugging Facescikit-learn>=1.0.0- Feature scaling (StandardScaler)numpy>=1.20.0- Numerical operations
Quick Start
Download and run the inference script:
# Download inference script (model classes included)
wget https://huggingface.co/vocametrix/speech-emotion-valence-classifier/resolve/main/inference.py
# Run on your audio file
python inference.py speech.wav
The script is completely self-contained and will automatically download the model weights from HuggingFace on first run.
Output Example: ```
EMOTION CLASSIFICATION
Device: cuda
Loading model from HuggingFace... β Model loaded (SafeTensors format) β Scaler loaded Loading Wav2Vec2 models... β Wav2Vec2 loaded
Processing audio: speech.wav β Audio loaded: 3.45s
Extracting features... β Extracted 13 layers
Classifying emotion...
============================================================ RESULTS
Emotion: POSITIVE Confidence: 87.3%
Probabilities: Negative: 5.2% Neutral: 7.5% Positive: 87.3%
## Training
The model was trained with:
- **Framework:** PyTorch
- **Optimizer:** AdamW with learning rate 2e-4, weight decay 0.01
- **Loss:** Cross-entropy with label smoothing (0.1) and class weighting
- **Regularization:** Gradient clipping (max_norm=1.0)
- **LR Schedule:** Cosine annealing with 5% warmup
- **Train/Test Split:** 80/20 stratified
- **Batch Size:** 32, Early Stopping (patience=30)
- **Epochs:** 67 trained
See `kaggle_transformer_training.ipynb` for full training script.
## Model Characteristics
β
**Multilingual:** 9 languages (English, Chinese, German, Urdu, Portuguese, Greek, Persian, Estonian, French)
β
**Multi-Layer Features:** All 13 layers (1024-dim) from wav2vec2-large-xlsr-53 for richer representations
β
**Transformer Architecture:** 4-layer encoder, 8 attention heads, patch embeddings
β
**Strong Regularization:** Label smoothing, class weighting, gradient clipping
β
**High Performance:** 86.04% accuracy, 84.46% F1-score on test set
β
**Stable Training:** Cosine annealing with warmup
## Limitations
- Single vector per audio (no temporal dynamics)
- Best performance on speech; music/singing untested
- 9-language training may not generalize to all languages
- Requires 16kHz audio and wav2vec2 multi-layer preprocessing
- Performance variance across languages (55%-95%)
## Bias & Fairness
- Dataset includes speakers from 9 languages with varied accents
- Performance varies by language (55%-95% accuracy range)
- Chinese and German show strongest performance (93-95%)
- Estonian shows lower performance (55%), may need language-specific tuning
- Gender/age representation varies by language
- Recommended to evaluate on domain-specific data before production use
## Ethical Considerations
- Model predictions should not be used for critical decisions affecting individuals
- Emotion classification from speech is inherently imperfect
- Consider user privacy when processing audio
- Disclose use of AI-based emotion analysis to users
- Be aware of cultural differences in emotion expression
## Related Models & Datasets
- **Base Model:** [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) - Multilingual XLSR Wav2Vec2
- **Dataset Source:** [Unified Multilingual Dataset](https://github.com/michen00/unified_multilingual_dataset_of_emotional_human_utterances) - GitHub repository with 87K+ multilingual emotion samples
- **Organization:** [Vocametrix on Hugging Face](https://huggingface.co/vocametrix)
## License
MIT License - See repository for full license text
## Citation
If you use this model, please cite:
```bibtex
@misc{vcmx-emotions-multilingual,
author = {Patrick Marmaroli},
title = {Multilingual Speech Emotion Valence Classifier},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/vocametrix/speech-emotion-valence-classifier}}
}
@article{wav2vec2-xlsr,
title={Unsupervised Cross-lingual Representation Learning for Speech Recognition},
author={Conneau, Alexei and Baevski, Alexei and Collobert, Ronan and Mohamed, Abdelrahman and Amodei, Dario},
journal={Advances in Neural Information Processing Systems},
year={2021}
}
@article{transformer,
title={Attention Is All You Need},
author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and others},
journal={Advances in Neural Information Processing Systems},
year={2017}
}
References
- WAV2VEC2 XLSR: https://arxiv.org/abs/2012.07014
- Transformers Library: https://huggingface.co/docs/transformers/
- Vocametrix Platform: https://github.com/pmarmaroli/vocametrix-platform
- Facebook Research: https://research.facebook.com/
Repository
- Model: https://huggingface.co/vocametrix/speech-emotion-valence-classifier
- Organization: https://huggingface.co/vocametrix
- Platform Code: https://github.com/pmarmaroli/vocametrix-platform
Uploaded: 2025-11-14 17:11:20 UTC
Version: 3.0 (Transformer + Multilingual + Multi-Layer Features)
- Downloads last month
- 20,766