This is the Qwen2.5-3B model trained by GRPO Ground Truth method using MATH training set, as presented in Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models.

The Co-rewarding framework is a novel self-supervised RL approach that improves training stability by seeking complementary supervision from another views. It aims to enhance the reasoning ability of Large Language Models (LLMs) and addresses issues like training collapse observed in other self-rewarding methods.

If you are interested in Co-Reward, you can find more details on our GitHub repository.

Citation

@article{zhang2025coreward,
      title={Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models}, 
      author={Zizhuo Zhang and Jianing Zhu and Xinmu Ge and Zihua Zhao and Zhanke Zhou and Xuan Li and Xiao Feng and Jiangchao Yao and Bo Han},
      journal={arXiv preprint arXiv:2508.00410}
      year={2025},
}
Downloads last month
9
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TMLR-Group-HF/GT-Qwen2.5-3B-MATH

Quantizations
1 model

Collection including TMLR-Group-HF/GT-Qwen2.5-3B-MATH