ClusterMind Chaos Arena β€” LoRA adapter (GRPO)

Trained on the ClusterMind Chaos Arena environment via SFT warm-start + online RL (grpo). Base weights are frozen; only the LoRA adapter is updated (r=8, target_modules=["q_proj","v_proj"]).

Training stack

  • Model load + LoRA: transformers (Unsloth FastLanguageModel when available, else transformers + bitsandbytes 4-bit + peft)
  • SFT phase: trl.SFTTrainer
  • RL phase: in-tree GRPO/PPO/REINFORCE loop (TRL's GRPOTrainer OOMs on T4 because it holds all K trajectories' computation graphs simultaneously; ours is two-phase: no-grad rollout collection then per-step backward)
  • Hub push: huggingface_hub.push_to_hub + upload_file

Training summary

field value
base model Qwen/Qwen2.5-0.5B-Instruct
engine transformers
SFT trainer trl.SFTTrainer
RL algo grpo (auto: trl present -> using episode-level GRPO)
trainable params 540,672 / 11.973056694274142 (4515739.08%)
SFT episodes 16
RL episodes 24
eval episodes 8
eval mean reward 10.46
frozen base True
lora only True
quick mode True

How to load

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = "Qwen/Qwen2.5-0.5B-Instruct"
adapter = "Kabs-123/clustermind-lora"

tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)

Files in this repo

  • adapter_model.safetensors β€” LoRA weights
  • adapter_config.json β€” LoRA config (r, alpha, target modules)
  • tokenizer.json etc. β€” tokenizer of the base model
  • training_logs.jsonl β€” per-step reward + loss + metrics
  • trained_results.json β€” full training summary

Evaluation

The trained agent is benchmarked against five heuristic baselines on 8 chaos scenarios at curriculum levels 3–5. See trained_results.json for the full eval breakdown.

Downloads last month
2
Video Preview
loading

Model tree for Kabs-123/clustermind-lora

Adapter
(609)
this model