Speech Emotion Valence Classifier - Multilingual

Transformer-based emotion classifier trained on multilingual wav2vec2 multi-layer embeddings for classifying emotional valence in speech (9 languages)

Model Details

Model Description

  • Architecture: Transformer Encoder with patch embedding and learnable positional embeddings
  • Input Features: WAV2VEC2 embeddings (768-dimensional, last hidden state from wav2vec2-large-xlsr-53)
  • Pre-trained Base: facebook/wav2vec2-large-xlsr-53 (multilingual)
  • Framework: PyTorch
  • Task: Audio emotion valence classification
  • Model Size: ~3M parameters

Model Architecture

Input (768-dim wav2vec2 last hidden state, mean pooled)
    ↓
Patch Embedding (patch_size=64 β†’ 12 patches, d_model=256)
    ↓
Learnable Positional Embeddings + CLS Token
    ↓
Transformer Encoder (4 layers, 8 heads, 1024-dim FFN, dropout=0.2)
    ↓
Classification Head (256 β†’ 128 β†’ 3 classes)
    ↓
Output: [negative, neutral, positive]

Architecture Details:

  • Input: 768-dim embeddings (last hidden state from wav2vec2-large-xlsr-53, mean pooled across time)
  • Patch Embedding: 256-dim with patch_size=64 β†’ 12 patches
  • Transformer Encoder: 4 layers, 8 attention heads (head_dim=32), 1024-dim FFN
  • Dropout: 0.2
  • Total Parameters: ~3M

Intended Use

Classify emotional valence in speech audio into three categories:

  • Negative: Sad, Angry, Fearful, Disgust
  • Neutral: Neutral, Calm
  • Positive: Happy, Surprised, Surprised

Multilingual Support

Trained on wav2vec2 multi-layer embeddings covering 9 languages:

  • English (eng) - 78.77% accuracy
  • Chinese (cmn) - 95.00% accuracy
  • German (deu) - 93.52% accuracy
  • Urdu (urd) - 76.47% accuracy
  • Portuguese (por) - 78.95% accuracy
  • Greek (ell) - 69.57% accuracy
  • Persian (pes) - 80.83% accuracy
  • Estonian (est) - 55.28% accuracy
  • French (fra) - 73.87% accuracy

Training Data

  • Original Dataset: Unified Multilingual Dataset of Emotional Human Utterances (GitHub)
  • Features: Wav2Vec2 Features (facebook/wav2vec2-large-xlsr-53, 768-dim last hidden state, mean pooled)
  • Feature Extraction: Embeddings extracted using facebook/wav2vec2-large-xlsr-53 from Vocametrix
  • Sample Rate: Standardized to 16kHz via wav2vec2 preprocessor
  • Valence Labels: 3-class (negative, neutral, positive)

Performance

Test Set Results (16,709 samples):

  • Overall Accuracy: 86.25%
  • Macro F1-Score: 84.60%
  • Weighted F1-Score: 86.04%
  • Classes: negative, neutral, positive
  • Epochs Trained: 111
  • Training Samples: 83,545

Per-Class Performance

              Precision  Recall  F1-Score  Support
negative      0.92       0.83    0.87      8858
neutral       0.78       0.86    0.82      4452
positive      0.74       0.84    0.79      3399

Macro Average: 0.81      0.84    0.83
Weighted Avg:  0.87      0.86    0.86

Per-Language Performance

Chinese (cmn)       - Accuracy: 95.00%, F1: 91.34%
German (deu)        - Accuracy: 93.52%, F1: 91.32%
Persian (pes)       - Accuracy: 80.83%, F1: (see logs)
Portuguese (por)    - Accuracy: 78.95%, F1: 75.64%
English (eng)       - Accuracy: 78.77%, F1: (see logs)
Urdu (urd)          - Accuracy: 76.47%, F1: 69.87%
French (fra)        - Accuracy: 73.87%, F1: (see logs)
Greek (ell)         - Accuracy: 69.57%, F1: (see logs)
Estonian (est)      - Accuracy: 55.28%, F1: (see logs)

Average Across Languages: 84.83% accuracy, 80.77% F1-score

Confusion Matrix

Predicted:           negative  neutral  positive
Actual negative:       83%      9%        8%
Actual neutral:         6%     86%        8%
Actual positive:       12%     12%       84%

Training Configuration

Model:
  Patch Size: 64 β†’ 12 patches
  d_model: 256, num_layers: 4, nhead: 8
  dim_feedforward: 1024, dropout: 0.2

Hyperparameters:
  Optimizer: AdamW (lr=2e-4, weight_decay=0.01)
  Loss: CrossEntropyLoss (label_smoothing=0.1, class-weighted)
  LR Schedule: Cosine annealing + 5% warmup
  Batch Size: 32, Epochs: 111
  Regularization: Gradient clipping (max_norm=1.0)

Files

  • emotion_classifier_complete.safetensors - Model weights in SafeTensors format (safer than PyTorch .pt files, no arbitrary code execution)
  • emotion_classifier_scaler.pkl - Feature normalization (StandardScaler for 1024-dim input)
  • config.json - Model architecture configuration
  • requirements.txt - Python dependencies

Usage

Installation

Install required packages:

pip install torch torchaudio transformers huggingface-hub scikit-learn safetensors

Or use the complete requirements:

pip install -r requirements.txt

Minimum Requirements:

  • torch>=2.0.0 - PyTorch deep learning framework
  • torchaudio>=2.0.0 - Audio loading and processing (required for exact feature extraction)
  • transformers>=4.30.0 - Hugging Face Transformers (for Wav2Vec2)
  • safetensors>=0.3.0 - Safe model serialization format
  • huggingface-hub>=0.19.0 - Download models from Hugging Face
  • scikit-learn>=1.0.0 - Feature scaling (StandardScaler)
  • numpy>=1.20.0 - Numerical operations

Quick Start

Download and run the inference script:

# Download inference script (model classes included)
wget https://huggingface.co/vocametrix/speech-emotion-valence-classifier/resolve/main/inference.py

# Run on your audio file
python inference.py speech.wav

The script is completely self-contained and will automatically download the model weights from HuggingFace on first run.

Output Example: ```

EMOTION CLASSIFICATION

Device: cuda

Loading model from HuggingFace... βœ“ Model loaded (SafeTensors format) βœ“ Scaler loaded Loading Wav2Vec2 models... βœ“ Wav2Vec2 loaded

Processing audio: speech.wav βœ“ Audio loaded: 3.45s

Extracting features... βœ“ Extracted 13 layers

Classifying emotion...

============================================================ RESULTS

Emotion: POSITIVE Confidence: 87.3%

Probabilities: Negative: 5.2% Neutral: 7.5% Positive: 87.3%


## Training

The model was trained with:
- **Framework:** PyTorch
- **Optimizer:** AdamW with learning rate 2e-4, weight decay 0.01
- **Loss:** Cross-entropy with label smoothing (0.1) and class weighting
- **Regularization:** Gradient clipping (max_norm=1.0)
- **LR Schedule:** Cosine annealing with 5% warmup
- **Train/Test Split:** 80/20 stratified
- **Batch Size:** 32, Early Stopping (patience=30)
- **Epochs:** 67 trained

See `kaggle_transformer_training.ipynb` for full training script.

## Model Characteristics

βœ… **Multilingual:** 9 languages (English, Chinese, German, Urdu, Portuguese, Greek, Persian, Estonian, French)  
βœ… **Multi-Layer Features:** All 13 layers (1024-dim) from wav2vec2-large-xlsr-53 for richer representations  
βœ… **Transformer Architecture:** 4-layer encoder, 8 attention heads, patch embeddings  
βœ… **Strong Regularization:** Label smoothing, class weighting, gradient clipping  
βœ… **High Performance:** 86.04% accuracy, 84.46% F1-score on test set  
βœ… **Stable Training:** Cosine annealing with warmup  

## Limitations

- Single vector per audio (no temporal dynamics)
- Best performance on speech; music/singing untested
- 9-language training may not generalize to all languages
- Requires 16kHz audio and wav2vec2 multi-layer preprocessing
- Performance variance across languages (55%-95%)

## Bias & Fairness

- Dataset includes speakers from 9 languages with varied accents
- Performance varies by language (55%-95% accuracy range)
- Chinese and German show strongest performance (93-95%)
- Estonian shows lower performance (55%), may need language-specific tuning
- Gender/age representation varies by language
- Recommended to evaluate on domain-specific data before production use

## Ethical Considerations

- Model predictions should not be used for critical decisions affecting individuals
- Emotion classification from speech is inherently imperfect
- Consider user privacy when processing audio
- Disclose use of AI-based emotion analysis to users
- Be aware of cultural differences in emotion expression

## Related Models & Datasets

- **Base Model:** [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) - Multilingual XLSR Wav2Vec2
- **Dataset Source:** [Unified Multilingual Dataset](https://github.com/michen00/unified_multilingual_dataset_of_emotional_human_utterances) - GitHub repository with 87K+ multilingual emotion samples
- **Organization:** [Vocametrix on Hugging Face](https://huggingface.co/vocametrix)

## License

MIT License - See repository for full license text

## Citation

If you use this model, please cite:

```bibtex
@misc{vcmx-emotions-multilingual,
    author = {Patrick Marmaroli},
    title = {Multilingual Speech Emotion Valence Classifier},
    year = {2025},
    publisher = {Hugging Face},
    howpublished = {\url{https://huggingface.co/vocametrix/speech-emotion-valence-classifier}}
}

@article{wav2vec2-xlsr,
    title={Unsupervised Cross-lingual Representation Learning for Speech Recognition},
    author={Conneau, Alexei and Baevski, Alexei and Collobert, Ronan and Mohamed, Abdelrahman and Amodei, Dario},
    journal={Advances in Neural Information Processing Systems},
    year={2021}
}

@article{transformer,
    title={Attention Is All You Need},
    author={Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and others},
    journal={Advances in Neural Information Processing Systems},
    year={2017}
}

References

Repository


Uploaded: 2025-11-14 17:11:20 UTC
Version: 3.0 (Transformer + Multilingual + Multi-Layer Features)

Downloads last month
20,766
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support