Speech Emotion Valence Classifier - Multilingual

Transformer-based emotion classifier trained on multilingual wav2vec2 multi-layer embeddings for classifying emotional valence in speech (9 languages)

Model Details

Model Description

Architecture: Transformer Encoder with patch embedding and learnable positional embeddings
Input Features: WAV2VEC2 embeddings (768-dimensional, last hidden state from wav2vec2-large-xlsr-53)
Pre-trained Base: facebook/wav2vec2-large-xlsr-53 (multilingual)
Framework: PyTorch
Task: Audio emotion valence classification
Model Size: ~3M parameters

Model Architecture

Input (768-dim wav2vec2 last hidden state, mean pooled)
    ↓
Patch Embedding (patch_size=64 → 12 patches, d_model=256)
    ↓
Learnable Positional Embeddings + CLS Token
    ↓
Transformer Encoder (4 layers, 8 heads, 1024-dim FFN, dropout=0.2)
    ↓
Classification Head (256 → 128 → 3 classes)
    ↓
Output: [negative, neutral, positive]

Architecture Details:

Input: 768-dim embeddings (last hidden state from wav2vec2-large-xlsr-53, mean pooled across time)
Patch Embedding: 256-dim with patch_size=64 → 12 patches
Transformer Encoder: 4 layers, 8 attention heads (head_dim=32), 1024-dim FFN
Dropout: 0.2
Total Parameters: ~3M

Intended Use

Classify emotional valence in speech audio into three categories:

Negative: Sad, Angry, Fearful, Disgust
Neutral: Neutral, Calm
Positive: Happy, Surprised, Surprised

Multilingual Support

Trained on wav2vec2 multi-layer embeddings covering 9 languages:

English (eng) - 78.77% accuracy
Chinese (cmn) - 95.00% accuracy
German (deu) - 93.52% accuracy
Urdu (urd) - 76.47% accuracy
Portuguese (por) - 78.95% accuracy
Greek (ell) - 69.57% accuracy
Persian (pes) - 80.83% accuracy
Estonian (est) - 55.28% accuracy
French (fra) - 73.87% accuracy

Training Data

Original Dataset: Unified Multilingual Dataset of Emotional Human Utterances (GitHub)
- Source: https://github.com/michen00/unified_multilingual_dataset_of_emotional_human_utterances
- 83,545 audio samples (filtered and processed)
- 22 source emotion datasets combined: CREMA-D, RAVDESS, TESS, EmoDB, ShEMO, and more
- 9 languages: English, Chinese, German, Greek, Urdu, Estonian, French, Portuguese, Persian
- Pre-processed: 16kHz, mono, PCM 16-bit WAV
Features: Wav2Vec2 Features (facebook/wav2vec2-large-xlsr-53, 768-dim last hidden state, mean pooled)
Feature Extraction: Embeddings extracted using facebook/wav2vec2-large-xlsr-53 from Vocametrix
Sample Rate: Standardized to 16kHz via wav2vec2 preprocessor
Valence Labels: 3-class (negative, neutral, positive)

Performance

Test Set Results (16,709 samples):

Overall Accuracy: 86.25%
Macro F1-Score: 84.60%
Weighted F1-Score: 86.04%
Classes: negative, neutral, positive
Epochs Trained: 111
Training Samples: 83,545

Per-Class Performance

              Precision  Recall  F1-Score  Support
negative      0.92       0.83    0.87      8858
neutral       0.78       0.86    0.82      4452
positive      0.74       0.84    0.79      3399

Macro Average: 0.81      0.84    0.83
Weighted Avg:  0.87      0.86    0.86

Per-Language Performance

Chinese (cmn)       - Accuracy: 95.00%, F1: 91.34%
German (deu)        - Accuracy: 93.52%, F1: 91.32%
Persian (pes)       - Accuracy: 80.83%, F1: (see logs)
Portuguese (por)    - Accuracy: 78.95%, F1: 75.64%
English (eng)       - Accuracy: 78.77%, F1: (see logs)
Urdu (urd)          - Accuracy: 76.47%, F1: 69.87%
French (fra)        - Accuracy: 73.87%, F1: (see logs)
Greek (ell)         - Accuracy: 69.57%, F1: (see logs)
Estonian (est)      - Accuracy: 55.28%, F1: (see logs)

Average Across Languages: 84.83% accuracy, 80.77% F1-score

Confusion Matrix

Predicted:           negative  neutral  positive
Actual negative:       83%      9%        8%
Actual neutral:         6%     86%        8%
Actual positive:       12%     12%       84%

Training Configuration

Model:
  Patch Size: 64 → 12 patches
  d_model: 256, num_layers: 4, nhead: 8
  dim_feedforward: 1024, dropout: 0.2

Hyperparameters:
  Optimizer: AdamW (lr=2e-4, weight_decay=0.01)
  Loss: CrossEntropyLoss (label_smoothing=0.1, class-weighted)
  LR Schedule: Cosine annealing + 5% warmup
  Batch Size: 32, Epochs: 111
  Regularization: Gradient clipping (max_norm=1.0)

Files

emotion_classifier_complete.safetensors - Model weights in SafeTensors format (safer than PyTorch .pt files, no arbitrary code execution)
emotion_classifier_scaler.pkl - Feature normalization (StandardScaler for 1024-dim input)
config.json - Model architecture configuration
requirements.txt - Python dependencies

Usage

Installation

Install required packages:

pip install torch torchaudio transformers huggingface-hub scikit-learn safetensors

Or use the complete requirements:

pip install -r requirements.txt

Minimum Requirements:

torch>=2.0.0 - PyTorch deep learning framework
torchaudio>=2.0.0 - Audio loading and processing (required for exact feature extraction)
transformers>=4.30.0 - Hugging Face Transformers (for Wav2Vec2)
safetensors>=0.3.0 - Safe model serialization format
huggingface-hub>=0.19.0 - Download models from Hugging Face
scikit-learn>=1.0.0 - Feature scaling (StandardScaler)
numpy>=1.20.0 - Numerical operations

Quick Start

Download and run the inference script:

# Download inference script (model classes included)
wget https://huggingface.co/vocametrix/speech-emotion-valence-classifier/resolve/main/inference.py

# Run on your audio file
python inference.py speech.wav

The script is completely self-contained and will automatically download the model weights from HuggingFace on first run.

Output Example: ```

EMOTION CLASSIFICATION

Device: cuda

Loading model from HuggingFace... ✓ Model loaded (SafeTensors format) ✓ Scaler loaded Loading Wav2Vec2 models... ✓ Wav2Vec2 loaded

Processing audio: speech.wav ✓ Audio loaded: 3.45s

Extracting features... ✓ Extracted 13 layers

Classifying emotion...

============================================================ RESULTS

Emotion: POSITIVE Confidence: 87.3%

Probabilities: Negative: 5.2% Neutral: 7.5% Positive: 87.3%


## Training

The model was trained with:
- **Framework:** PyTorch
- **Optimizer:** AdamW with learning rate 2e-4, weight decay 0.01
- **Loss:** Cross-entropy with label smoothing (0.1) and class weighting
- **Regularization:** Gradient clipping (max_norm=1.0)
- **LR Schedule:** Cosine annealing with 5% warmup
- **Train/Test Split:** 80/20 stratified
- **Batch Size:** 32, Early Stopping (patience=30)
- **Epochs:** 67 trained

See `kaggle_transformer_training.ipynb` for full training script.

## Model Characteristics

✅ **Multilingual:** 9 languages (English, Chinese, German, Urdu, Portuguese, Greek, Persian, Estonian, French)  
✅ **Multi-Layer Features:** All 13 layers (1024-dim) from wav2vec2-large-xlsr-53 for richer representations  
✅ **Transformer Architecture:** 4-layer encoder, 8 attention heads, patch embeddings  
✅ **Strong Regularization:** Label smoothing, class weighting, gradient clipping  
✅ **High Performance:** 86.04% accuracy, 84.46% F1-score on test set  
✅ **Stable Training:** Cosine annealing with warmup  

## Limitations

- Single vector per audio (no temporal dynamics)
- Best performance on speech; music/singing untested
- 9-language training may not generalize to all languages
- Requires 16kHz audio and wav2vec2 multi-layer preprocessing
- Performance variance across languages (55%-95%)

## Bias & Fairness

- Dataset includes speakers from 9 languages with varied accents
- Performance varies by language (55%-95% accuracy range)
- Chinese and German show strongest performance (93-95%)
- Estonian shows lower performance (55%), may need language-specific tuning
- Gender/age representation varies by language
- Recommended to evaluate on domain-specific data before production use

## Ethical Considerations

- Model predictions should not be used for critical decisions affecting individuals
- Emotion classification from speech is inherently imperfect
- Consider user privacy when processing audio
- Disclose use of AI-based emotion analysis to users
- Be aware of cultural differences in emotion expression

## Related Models & Datasets

- **Base Model:** [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) - Multilingual XLSR Wav2Vec2
- **Dataset Source:** [Unified Multilingual Dataset](https://github.com/michen00/unified_multilingual_dataset_of_emotional_human_utterances) - GitHub repository with 87K+ multilingual emotion samples
- **Organization:** [Vocametrix on Hugging Face](https://huggingface.co/vocametrix)

## License

MIT License - See repository for full license text

## Citation

If you use this model, please cite:

```bibtex
@misc{vcmx-emotions-multilingual,
    author = {Patrick Marmaroli},
    title = {Multilingual Speech Emotion Valence Classifier},
    year = {2025},
    publisher = {Hugging Face},
    howpublished = {\url{https://huggingface.co/vocametrix/speech-emotion-valence-classifier}}
}

@article{wav2vec2-xlsr,
    title={Unsupervised Cross-lingual Representation Learning for Speech Recognition},
    author={Conneau, Alexei and Baevski, Alexei and Collobert, Ronan and Mohamed, Abdelrahman and Amodei, Dario},
    journal={Advances in Neural Information Processing Systems},
    year={2021}
}

@article{transformer,
    title={Attention Is All You Need},
    author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and others},
    journal={Advances in Neural Information Processing Systems},
    year={2017}
}

References

WAV2VEC2 XLSR: https://arxiv.org/abs/2012.07014
Transformers Library: https://huggingface.co/docs/transformers/
Vocametrix Platform: https://github.com/pmarmaroli/vocametrix-platform
Facebook Research: https://research.facebook.com/

Repository

Model: https://huggingface.co/vocametrix/speech-emotion-valence-classifier
Organization: https://huggingface.co/vocametrix
Platform Code: https://github.com/pmarmaroli/vocametrix-platform

Uploaded: 2025-11-14 17:11:20 UTC
Version: 3.0 (Transformer + Multilingual + Multi-Layer Features)

Downloads last month: 20,766

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support