Improve model card: add metadata, links, and sample usage for DeEAR
Browse filesThis PR significantly enhances the model card for the `FreedomIntelligence/DeEAR_Base` model by adding crucial metadata and detailed information.
Specifically, it:
- Adds `library_name: transformers` metadata, as the `config.json` and `preprocessor_config.json` indicate compatibility and usage of `transformers` components (e.g., `model_type: wav2vec2`, `Wav2Vec2FeatureExtractor`). This enables the automated "how to use" widget on the model page.
- Adds `pipeline_tag: audio-classification` to categorize the model for discovery on the Hugging Face Hub, reflecting its task of evaluating speech expressiveness.
- Links to the official Hugging Face paper page: [Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment](https://huggingface.co/papers/2510.20513).
- Provides a link to the project page: [https://freedomintelligence.github.io/ExpressiveSpeech/](https://freedomintelligence.github.io/ExpressiveSpeech/).
- Includes a link to the GitHub repository: [https://github.com/FreedomIntelligence/ExpressiveSpeech](https://github.com/FreedomIntelligence/ExpressiveSpeech).
- Integrates a comprehensive "Introduction", "Key Features", and "Framework Overview" directly from the GitHub README to better describe the model.
- Incorporates a "Sample Usage" section with code snippets directly from the GitHub README's "Quick Start" guide, enabling users to quickly get started with inference.
- Adds the BibTeX citation for proper academic attribution.
These updates aim to make the model more discoverable, easier to understand, and more user-friendly on the Hugging Face Hub.
|
@@ -1,3 +1,78 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: audio-classification
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# DeEAR: Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment
|
| 8 |
+
|
| 9 |
+
This repository contains the DeEAR model as presented in the paper [Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment](https://huggingface.co/papers/2510.20513).
|
| 10 |
+
|
| 11 |
+
Project Page: [https://freedomintelligence.github.io/ExpressiveSpeech/](https://freedomintelligence.github.io/ExpressiveSpeech/)
|
| 12 |
+
Code Repository: [https://github.com/FreedomIntelligence/ExpressiveSpeech](https://github.com/FreedomIntelligence/ExpressiveSpeech)
|
| 13 |
+
Hugging Face Dataset: [FreedomIntelligence/ExpressiveSpeech](https://huggingface.co/datasets/FreedomIntelligence/ExpressiveSpeech)
|
| 14 |
+
|
| 15 |
+
<div align="center">
|
| 16 |
+
<img src="https://github.com/FreedomIntelligence/ExpressiveSpeech/raw/main/assets/Architecture.png" alt="DeEAR Framework Diagram" width="45%"/>
|
| 17 |
+
<br>
|
| 18 |
+
<em>Figure 1: The DeEAR Framework. (A) The training pipeline involves four stages: decomposition, sub-dimension modeling, learning a fusion function, and distillation. (B) Applications include data filtering and serving as a reward model.</em>
|
| 19 |
+
</p>
|
| 20 |
+
|
| 21 |
+
## Introduction
|
| 22 |
+
|
| 23 |
+
Recent speech-to-speech (S2S) models can generate intelligible speech but often lack natural expressiveness, largely due to the absence of a reliable evaluation metric. To address this, we present **DeEAR (Decoding the Expressive Preference of eAR)**, a novel framework that converts human preferences for speech expressiveness into an objective score.
|
| 24 |
+
|
| 25 |
+
Grounded in phonetics and psychology, DeEAR evaluates speech across three core dimensions: **Emotion**, **Prosody**, and **Spontaneity**. It achieves strong alignment with human perception (Spearman's Rank Correlation Coefficient, SRCC = 0.86) using fewer than 500 annotated samples.
|
| 26 |
+
|
| 27 |
+
Beyond reliable scoring, DeEAR enables fair benchmarking and targeted data curation. We applied DeEAR to build **ExpressiveSpeech**, a high-quality dataset, and used it to fine-tune an S2S model, which improved its overall expressiveness score from 2.0 to 23.4 (on a 100-point scale).
|
| 28 |
+
|
| 29 |
+
## Key Features
|
| 30 |
+
|
| 31 |
+
* **Multi-dimensional Objective Scoring**: Decomposes speech expressiveness into quantifiable dimensions of Emotion, Prosody, and Spontaneity.
|
| 32 |
+
* **Strong Alignment with Human Perception**: Achieves a Spearman's Rank Correlation (SRCC) of **0.86** with human ratings for overall expressiveness.
|
| 33 |
+
* **Data-Efficient and Scalable**: Requires minimal annotated data, making it practical for deployment and scaling.
|
| 34 |
+
* **Dual Applications**:
|
| 35 |
+
1. **Automated Model Benchmarking**: Ranks SOTA models with near-perfect correlation (SRCC = **0.96**) to human rankings.
|
| 36 |
+
2. **Evaluation-Driven Data Curation**: Efficiently filters and curates high-quality, expressive speech datasets.
|
| 37 |
+
* **Release of ExpressiveSpeech Dataset**: A new large-scale, bilingual (English-Chinese) dataset containing ~14,000 utterances of highly expressive speech.
|
| 38 |
+
|
| 39 |
+
## Quick Start (Inference)
|
| 40 |
+
|
| 41 |
+
To get started with DeEAR, follow these steps to perform inference:
|
| 42 |
+
|
| 43 |
+
1. **Clone the Repository**
|
| 44 |
+
```bash
|
| 45 |
+
git clone https://github.com/FreedomIntelligence/ExpressiveSpeech.git
|
| 46 |
+
cd ExpressiveSpeech
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
2. **Setup Environment**
|
| 50 |
+
```bash
|
| 51 |
+
conda create -n DeEAR python=3.10
|
| 52 |
+
conda activate DeEAR
|
| 53 |
+
pip install -r requirements.txt
|
| 54 |
+
conda install -c conda-forge ffmpeg
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
3. **Prepare Model**
|
| 58 |
+
Download the DeEAR_Base model from [FreedomIntelligence/DeEAR_Base](https://huggingface.co/FreedomIntelligence/DeEAR_Base) and place it in the `./models/DeEAR_Base/` directory.
|
| 59 |
+
|
| 60 |
+
4. **Run Inference**
|
| 61 |
+
```bash
|
| 62 |
+
python inference.py \
|
| 63 |
+
--model_dir ./models \
|
| 64 |
+
--input_path /path/to/audio_folder \
|
| 65 |
+
--output_file /path/to/save/my_scores.jsonl \
|
| 66 |
+
--batch_size 64
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
## Citation
|
| 70 |
+
If you use our work in your research, please cite the following paper:
|
| 71 |
+
```bibtex
|
| 72 |
+
@article{lin2025decoding,
|
| 73 |
+
title={Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment},
|
| 74 |
+
author={Lin, Zhiyu and Yang, Jingwen and Zhao, Jiale and Liu, Meng and Li, Sunzhu and Wang, Benyou},
|
| 75 |
+
journal={arXiv preprint arXiv:2510.20513},
|
| 76 |
+
year={2025}
|
| 77 |
+
}
|
| 78 |
+
```
|