Update README.md
Browse files
README.md
CHANGED
|
@@ -58,7 +58,7 @@ Thanks to its low activation and high sparsity design, Ring-flash-2.0 achieves a
|
|
| 58 |
|
| 59 |
During the RL for MoE models, the discrepancy of precision between the training and inference engines is more pronounced compared to dense models. This gap widens progressively as sequence length and training steps increase—particularly during long-sequence generation and extended training cycles. A more critical issue is that the original GRPO algorithm begins to break down within a limited number of training steps. Specifically, the probabilistic discrepancy for the same token between training and inference phases gradually increases. When this relative difference exceeds 5%, training effectively fails, posing a significant challenge for long-horizon reinforcement learning with lengthy sequences.
|
| 60 |
|
| 61 |
-
To address this issue, we introduced a key solution:
|
| 62 |
|
| 63 |
- Bidirectional Truncation: We truncate not only tokens where the training probability is significantly higher than the inference probability but also the reverse scenario where the training probability is much lower.
|
| 64 |
- Masking: Tokens with excessively large discrepancies are excluded from gradient computation.
|
|
|
|
| 58 |
|
| 59 |
During the RL for MoE models, the discrepancy of precision between the training and inference engines is more pronounced compared to dense models. This gap widens progressively as sequence length and training steps increase—particularly during long-sequence generation and extended training cycles. A more critical issue is that the original GRPO algorithm begins to break down within a limited number of training steps. Specifically, the probabilistic discrepancy for the same token between training and inference phases gradually increases. When this relative difference exceeds 5%, training effectively fails, posing a significant challenge for long-horizon reinforcement learning with lengthy sequences.
|
| 60 |
|
| 61 |
+
To address this issue, we introduced a key solution: __distribution calibration via masked bidirectional truncation, which effectively narrows the gap between training and inference__.
|
| 62 |
|
| 63 |
- Bidirectional Truncation: We truncate not only tokens where the training probability is significantly higher than the inference probability but also the reverse scenario where the training probability is much lower.
|
| 64 |
- Masking: Tokens with excessively large discrepancies are excluded from gradient computation.
|