|
|
--- |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
pipeline_tag: audio-classification |
|
|
--- |
|
|
|
|
|
# DeEAR: Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment |
|
|
|
|
|
This repository contains the DeEAR model as presented in the paper [Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment](https://huggingface.co/papers/2510.20513). |
|
|
|
|
|
Project Page: [https://freedomintelligence.github.io/ExpressiveSpeech/](https://freedomintelligence.github.io/ExpressiveSpeech/) |
|
|
Code Repository: [https://github.com/FreedomIntelligence/ExpressiveSpeech](https://github.com/FreedomIntelligence/ExpressiveSpeech) |
|
|
Hugging Face Dataset: [FreedomIntelligence/ExpressiveSpeech](https://huggingface.co/datasets/FreedomIntelligence/ExpressiveSpeech) |
|
|
|
|
|
<div align="center"> |
|
|
<img src="https://github.com/FreedomIntelligence/ExpressiveSpeech/raw/main/assets/Architecture.png" alt="DeEAR Framework Diagram" width="45%"/> |
|
|
<br> |
|
|
<em>Figure 1: The DeEAR Framework. (A) The training pipeline involves four stages: decomposition, sub-dimension modeling, learning a fusion function, and distillation. (B) Applications include data filtering and serving as a reward model.</em> |
|
|
</p> |
|
|
|
|
|
## Introduction |
|
|
|
|
|
Recent speech-to-speech (S2S) models can generate intelligible speech but often lack natural expressiveness, largely due to the absence of a reliable evaluation metric. To address this, we present **DeEAR (Decoding the Expressive Preference of eAR)**, a novel framework that converts human preferences for speech expressiveness into an objective score. |
|
|
|
|
|
Grounded in phonetics and psychology, DeEAR evaluates speech across three core dimensions: **Emotion**, **Prosody**, and **Spontaneity**. It achieves strong alignment with human perception (Spearman's Rank Correlation Coefficient, SRCC = 0.86) using fewer than 500 annotated samples. |
|
|
|
|
|
Beyond reliable scoring, DeEAR enables fair benchmarking and targeted data curation. We applied DeEAR to build **ExpressiveSpeech**, a high-quality dataset, and used it to fine-tune an S2S model, which improved its overall expressiveness score from 2.0 to 23.4 (on a 100-point scale). |
|
|
|
|
|
## Key Features |
|
|
|
|
|
* **Multi-dimensional Objective Scoring**: Decomposes speech expressiveness into quantifiable dimensions of Emotion, Prosody, and Spontaneity. |
|
|
* **Strong Alignment with Human Perception**: Achieves a Spearman's Rank Correlation (SRCC) of **0.86** with human ratings for overall expressiveness. |
|
|
* **Data-Efficient and Scalable**: Requires minimal annotated data, making it practical for deployment and scaling. |
|
|
* **Dual Applications**: |
|
|
1. **Automated Model Benchmarking**: Ranks SOTA models with near-perfect correlation (SRCC = **0.96**) to human rankings. |
|
|
2. **Evaluation-Driven Data Curation**: Efficiently filters and curates high-quality, expressive speech datasets. |
|
|
* **Release of ExpressiveSpeech Dataset**: A new large-scale, bilingual (English-Chinese) dataset containing ~14,000 utterances of highly expressive speech. |
|
|
|
|
|
## Quick Start (Inference) |
|
|
|
|
|
To get started with DeEAR, follow these steps to perform inference: |
|
|
|
|
|
1. **Clone the Repository** |
|
|
```bash |
|
|
git clone https://github.com/FreedomIntelligence/ExpressiveSpeech.git |
|
|
cd ExpressiveSpeech |
|
|
``` |
|
|
|
|
|
2. **Setup Environment** |
|
|
```bash |
|
|
conda create -n DeEAR python=3.10 |
|
|
conda activate DeEAR |
|
|
pip install -r requirements.txt |
|
|
conda install -c conda-forge ffmpeg |
|
|
``` |
|
|
|
|
|
3. **Prepare Model** |
|
|
Download the DeEAR_Base model from [FreedomIntelligence/DeEAR_Base](https://huggingface.co/FreedomIntelligence/DeEAR_Base) and place it in the `./models/DeEAR_Base/` directory. |
|
|
|
|
|
4. **Run Inference** |
|
|
```bash |
|
|
python inference.py \ |
|
|
--model_dir ./models \ |
|
|
--input_path /path/to/audio_folder \ |
|
|
--output_file /path/to/save/my_scores.jsonl \ |
|
|
--batch_size 64 |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
If you use our work in your research, please cite the following paper: |
|
|
```bibtex |
|
|
@article{lin2025decoding, |
|
|
title={Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment}, |
|
|
author={Lin, Zhiyu and Yang, Jingwen and Zhao, Jiale and Liu, Meng and Li, Sunzhu and Wang, Benyou}, |
|
|
journal={arXiv preprint arXiv:2510.20513}, |
|
|
year={2025} |
|
|
} |
|
|
``` |