Rubrics as an Attack Surface (RIPD)
Collection
This collection releases the official artifacts accompanying “Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges.” • 10 items • Updated
• 1
This checkpoint is part of the artifact release for
“Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges.”
It is a policy model trained under a specific rubric condition to study how evaluation-time preference drift propagates into downstream alignment.
The seed condition corresponds to preference labels generated by an LLM judge under the seed rubric variant.
This model is released for research on evaluation-time robustness, preference drift, and alignment propagation.
It is not intended for production deployment.