Improve model card: Add metadata tags, correct license, and expand content
Browse filesThis PR significantly improves the model card for DataMind-Qwen2.5-14B by:
- Correcting the `license` to `apache-2.0` as explicitly stated in the associated GitHub repository.
- Adding `pipeline_tag: text-generation` to ensure proper categorization and enable the "Use in Transformers" widget on the Hugging Face Hub.
- Adding `library_name: transformers` to indicate compatibility with the Hugging Face Transformers library.
- Integrating the paper's abstract, a direct link to the paper on Hugging Face Papers, and the project's GitHub repository link for easy access to critical information.
- Incorporating additional sections from the official GitHub README, including "News", "Training" details, and "Contributors", to provide a more comprehensive and useful overview of the DataMind project.
- Updating the "Evaluation" section to include the data download script, aligning with the latest information from the GitHub repository.
These updates aim to provide a more informative and user-friendly model card for researchers and developers.
|
@@ -1,12 +1,24 @@
|
|
| 1 |
---
|
| 2 |
-
license: mit
|
| 3 |
base_model:
|
| 4 |
- Qwen/Qwen2.5-14B-Instruct
|
|
|
|
|
|
|
|
|
|
| 5 |
---
|
| 6 |
|
| 7 |
-
|
| 8 |
<h1 align="center"> ✨ DataMind </h1>
|
| 9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
## 🔧 Installation
|
| 12 |
|
|
@@ -46,28 +58,88 @@ conda activate DataMind
|
|
| 46 |
pip install -r requirements.txt
|
| 47 |
```
|
| 48 |
|
|
|
|
| 49 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
## 🧐 Evaluation
|
| 52 |
|
| 53 |
> Note:
|
| 54 |
>
|
| 55 |
-
> -
|
| 56 |
-
> -
|
| 57 |
-
> -
|
| 58 |
|
| 59 |
-
**Step 1:
|
|
|
|
| 60 |
|
| 61 |
-
|
| 62 |
|
| 63 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
Here is the example:
|
| 66 |
**`config.yaml`**
|
| 67 |
|
| 68 |
```yaml
|
| 69 |
api_key: your_api_key # your API key for the model with API service. No need for open-source models.
|
| 70 |
-
data_root: /path/to/your/project/DataMind/eval/data # Root directory for data. (absolute path)
|
| 71 |
```
|
| 72 |
|
| 73 |
**`run_eval.sh`**
|
|
@@ -97,7 +169,7 @@ CUDA_VISIBLE_DEVICES=$i python -m vllm.entrypoints.openai.api_server \
|
|
| 97 |
--port $port # API port number, which is consistent with the `api_port` above.
|
| 98 |
```
|
| 99 |
|
| 100 |
-
**Step
|
| 101 |
|
| 102 |
**(Optional)** Deploy the local model if you need.
|
| 103 |
|
|
@@ -111,15 +183,19 @@ Run the shell script to start the process.
|
|
| 111 |
bash run_eval.sh
|
| 112 |
```
|
| 113 |
|
|
|
|
| 114 |
|
|
|
|
|
|
|
| 115 |
|
| 116 |
|
|
|
|
| 117 |
|
| 118 |
## ✍️ Citation
|
| 119 |
|
| 120 |
If you find our work helpful, please use the following citations.
|
| 121 |
|
| 122 |
-
```
|
| 123 |
@article{zhu2025open,
|
| 124 |
title={Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study},
|
| 125 |
author={Zhu, Yuqi and Zhong, Yi and Zhang, Jintian and Zhang, Ziheng and Qiao, Shuofei and Luo, Yujie and Du, Lun and Zheng, Da and Chen, Huajun and Zhang, Ningyu},
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
base_model:
|
| 3 |
- Qwen/Qwen2.5-14B-Instruct
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
pipeline_tag: text-generation
|
| 6 |
+
library_name: transformers
|
| 7 |
---
|
| 8 |
|
|
|
|
| 9 |
<h1 align="center"> ✨ DataMind </h1>
|
| 10 |
|
| 11 |
+
This repository contains the **DataMind** model, a fine-tuned Qwen2.5-14B-Instruct model presented in the paper [Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study](https://huggingface.co/papers/2506.19794).
|
| 12 |
+
|
| 13 |
+
Code: [https://github.com/zjunlp/DataMind](https://github.com/zjunlp/DataMind)
|
| 14 |
+
|
| 15 |
+
## Abstract
|
| 16 |
+
|
| 17 |
+
Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate model behavior across three core dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities.
|
| 18 |
+
|
| 19 |
+
## 🔔 News
|
| 20 |
+
|
| 21 |
+
- **[2025-06]** We release a new paper: "[Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study](https://arxiv.org/pdf/2506.19794)".
|
| 22 |
|
| 23 |
## 🔧 Installation
|
| 24 |
|
|
|
|
| 58 |
pip install -r requirements.txt
|
| 59 |
```
|
| 60 |
|
| 61 |
+
## 💻 Training
|
| 62 |
|
| 63 |
+
Our model training was completed using the powerful and user-friendly **[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)** framework, which provided us with an efficient fine-tuning workflow.
|
| 64 |
+
|
| 65 |
+
##### 1. Training Data
|
| 66 |
+
|
| 67 |
+
Our training dataset is available in `train/datamind-da-dataset.json`
|
| 68 |
+
|
| 69 |
+
##### 2. Training Configuration
|
| 70 |
+
|
| 71 |
+
The following is an example configuration for full-parameter fine-tuning using DeepSpeed ZeRO-3. You can save it as a YAML file (e.g., `datamind_sft.yaml`).
|
| 72 |
+
|
| 73 |
+
```yaml
|
| 74 |
+
### model
|
| 75 |
+
model_name_or_path: Qwen/Qwen2.5-7B-Instruct # Or Qwen/Qwen2.5-14B-Instruct
|
| 76 |
+
|
| 77 |
+
### method
|
| 78 |
+
stage: sft
|
| 79 |
+
do_train: true
|
| 80 |
+
finetuning_type: full
|
| 81 |
+
deepspeed: examples/deepspeed/ds_z3_config.json
|
| 82 |
+
flash_attn: fa2
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
### dataset
|
| 86 |
+
dataset: datamind-da-dataset
|
| 87 |
+
template: qwen
|
| 88 |
+
cutoff_len: 8192
|
| 89 |
+
overwrite_cache: true
|
| 90 |
+
preprocessing_num_workers: 16
|
| 91 |
+
|
| 92 |
+
### output
|
| 93 |
+
output_dir: checkpoints/your-model-name
|
| 94 |
+
logging_steps: 1
|
| 95 |
+
save_strategy: epoch
|
| 96 |
+
plot_loss: true
|
| 97 |
+
overwrite_output_dir: true
|
| 98 |
+
report_to: none
|
| 99 |
+
|
| 100 |
+
### train
|
| 101 |
+
per_device_train_batch_size: 1
|
| 102 |
+
gradient_accumulation_steps: 4
|
| 103 |
+
learning_rate: 1.0e-5
|
| 104 |
+
num_train_epochs: 3.0
|
| 105 |
+
lr_scheduler_type: cosine
|
| 106 |
+
warmup_ratio: 0.1
|
| 107 |
+
bf16: true
|
| 108 |
+
ddp_timeout: 180000000
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
##### 3. Launch Training
|
| 112 |
+
|
| 113 |
+
```bash
|
| 114 |
+
CUDA_VISIBLE_DEVICES=0,1,2,3 llama-factory-cli train datamind_sft.yaml
|
| 115 |
+
```
|
| 116 |
|
| 117 |
## 🧐 Evaluation
|
| 118 |
|
| 119 |
> Note:
|
| 120 |
>
|
| 121 |
+
> - **Ensure** that your working directory is set to the **`eval`** folder in a virtual environment.
|
| 122 |
+
> - If you have more questions, feel free to open an issue with us.
|
| 123 |
+
> - If you need to use local model, you need to deploy it according to **(Optional)`local_model.sh`**.
|
| 124 |
|
| 125 |
+
**Step 1: Download the evaluation datasets and our sft models**
|
| 126 |
+
The evaluation datasets we used are in [QRData](https://github.com/xxxiaol/QRData) and [DiscoveryBench](https://github.com/allenai/discoverybench). The script expects data to be at `data/QRData/benchmark/data/*.csv` and `data/DiscoveryBench/*.csv`.
|
| 127 |
|
| 128 |
+
You can also download our sft models directly from Hugging Face: [DataMind-Qwen2.5-7B](https://huggingface.co/zjunlp/DataMind-Qwen2.5-7B) ,[DataMind-Qwen2.5-14B ](https://huggingface.co/zjunlp/DataMind-Qwen2.5-14B).
|
| 129 |
|
| 130 |
+
You can use the following `bash` script to download the dataset:
|
| 131 |
+
```bash
|
| 132 |
+
bash download_eval_data.sh
|
| 133 |
+
```
|
| 134 |
+
|
| 135 |
+
**Step 2: Prepare the parameter configuration**
|
| 136 |
|
| 137 |
Here is the example:
|
| 138 |
**`config.yaml`**
|
| 139 |
|
| 140 |
```yaml
|
| 141 |
api_key: your_api_key # your API key for the model with API service. No need for open-source models.
|
| 142 |
+
data_root: /path/to/your/project/DataMind/eval/data # Root directory for data. (absolute path !!!)
|
| 143 |
```
|
| 144 |
|
| 145 |
**`run_eval.sh`**
|
|
|
|
| 169 |
--port $port # API port number, which is consistent with the `api_port` above.
|
| 170 |
```
|
| 171 |
|
| 172 |
+
**Step 3: Run the shell script**
|
| 173 |
|
| 174 |
**(Optional)** Deploy the local model if you need.
|
| 175 |
|
|
|
|
| 183 |
bash run_eval.sh
|
| 184 |
```
|
| 185 |
|
| 186 |
+
## 🎉Contributors
|
| 187 |
|
| 188 |
+
<a href="https://github.com/zjunlp/DataMind/graphs/contributors">
|
| 189 |
+
<img src="https://contrib.rocks/image?repo=zjunlp/DataMind" /></a>
|
| 190 |
|
| 191 |
|
| 192 |
+
We deeply appreciate the collaborative efforts of everyone involved. We will continue to enhance and maintain this repository over the long term. If you encounter any issues, feel free to submit them to us!
|
| 193 |
|
| 194 |
## ✍️ Citation
|
| 195 |
|
| 196 |
If you find our work helpful, please use the following citations.
|
| 197 |
|
| 198 |
+
```bibtex
|
| 199 |
@article{zhu2025open,
|
| 200 |
title={Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study},
|
| 201 |
author={Zhu, Yuqi and Zhong, Yi and Zhang, Jintian and Zhang, Ziheng and Qiao, Shuofei and Luo, Yujie and Du, Lun and Zheng, Da and Chen, Huajun and Zhang, Ningyu},
|