Reason-mxbai-colbert-v0.1-32m

v0.1 of the Reason-mxbai-colbert series — same edge-scale late-interaction retriever as v0, retrained with the correct projection-head architecture (use_residual: true in the 2_Dense layer). Mean BRIGHT nDCG@10 improves from 19.00 → 19.61 (+0.61), with the largest gains on the natural-language and formal-math splits that v0 struggled with.

What changed vs v0

The v0 release inherited use_residual: false in the 2_Dense (768→768) layer from the upstream mixedbread-ai/mxbai-edge-colbert-v0-32m config. That turned out to be a shipped-config bug — the base weights were trained with a residual connection on 2_Dense, but the config flag said otherwise. PyLate respects the flag, so every downstream fine-tune was working with a silently-broken architecture.

v0.1 fixes this by:

  1. Copying the base model and flipping 2_Dense/config.json: "use_residual": true.
  2. Re-widening the 3_Dense projection from 64 → 128 dims on top of the corrected base (small-random init for the new channels).
  3. Retraining the full curriculum (Stage 1 VL warmup → Stage 2 BGE-reasoner + HQ-hn polish) on the corrected architecture.

Under the corrected architecture, stage-2 training loss drops from 38.73 → 31.71 (−7.02) — real learning. The v0 equivalent run under the broken architecture was stuck at loss ~45-51 and never moved the new-dim weights.

BRIGHT results

Evaluated with MTEB BrightRetrieval (brute-force MaxSim, query_length=256, document_length=2048).

Split v0 v0.1 Δ v0 Reason-MCB (150M, paper)
aops 5.05 4.89 −0.16 9.17
biology 32.71 33.16 +0.45 33.25
earth_science 43.88 44.28 +0.40 41.02
economics 18.70 20.25 +1.54 24.93
leetcode 17.67 17.40 −0.28 31.07
pony 20.73 22.77 +2.03 8.51
psychology 22.62 24.91 +2.29 30.73
robotics 18.43 18.65 +0.23 21.12
stackoverflow 16.78 16.66 −0.12 20.62
sustainable_living 20.77 20.11 −0.66 20.31
theoremqa_questions 8.38 9.04 +0.66 19.51
theoremqa_theorems 2.25 3.19 +0.94 11.24
Full mean 19.00 19.61 +0.61 22.62

8 wins / 4 losses. Highlights:

  • Natural-language splits (psychology +2.29, economics +1.54, pony +2.03) — the correct architecture lets the model learn meaningfully richer representations.
  • Formal-math splits (theoremqa_theorems +0.94, theoremqa_questions +0.66) — previously stuck near the floor, now moving.
  • Small code-split regressions (leetcode −0.28, stackoverflow −0.12) are within noise.

Model Details

  • Model Type: PyLate ColBERT (late-interaction, multi-vector)
  • Base model: mixedbread-ai/mxbai-edge-colbert-v0-32m (with 2_Dense/use_residual: true patched to match trained weights)
  • Parameters: ~32M backbone + widened 128-dim projection
  • Document Length (training): 2048 tokens
  • Query Length (training): 256 tokens
  • Output Dimensionality: 128 per token (widened from base's 64)
  • Similarity Function: MaxSim
  • Training Data: hanhainebula/bge-reasoner-data + reasonir/reasonir-data (VL split for warmup, HQ with hard negatives for polish)
  • Language: en

Model Architecture

ColBERT(
  (0): Transformer({'max_seq_length': 127, 'do_lower_case': True}) with ModernBertModel
      hidden_size=384, num_hidden_layers=10, num_attention_heads=6,
      position_embedding_type='sans_pos', max_position_embeddings=7999
  (1): Dense(384 → 768, bias=False, use_residual=False)
  (2): Dense(768 → 768, bias=False, use_residual=True)     # ← THE FIX (was False in v0)
  (3): Dense(768 → 128, bias=False, use_residual=False)    # widened from 64 → 128 on correct base
)

Usage

from pylate import indexes, models, retrieve

model = models.ColBERT(model_name_or_path="DataScience-UIBK/Reason-mxbai-colbert-v0.1-32m")

# Retrieval (see https://lightonai.github.io/pylate/ for full API)
index = indexes.Voyager(index_folder="pylate-index", index_name="index", override=True)
docs = ["document 1 text", "document 2 text"]
doc_embs = model.encode(docs, is_query=False, batch_size=32, show_progress_bar=True)
index.add_documents(documents_ids=["1","2"], documents_embeddings=doc_embs)

retriever = retrieve.ColBERT(index=index)
q_embs = model.encode(
    ["Given a Psychology post, retrieve relevant passages that help answer the post.\nQuery: why do I procrastinate?"],
    is_query=True,
)
scores = retriever.retrieve(queries_embeddings=q_embs, k=10)

Acknowledgements

Thanks to Antoine Chaffin (LightOn / Reason-ModernColBERT author) for flagging the upstream 2_Dense config bug in mxbai-edge-colbert-v0-32m — without that heads-up, v0.1 wouldn't exist.

Downloads last month
56
Safetensors
Model size
31.9M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 2 Ask for provider support

Model tree for DataScience-UIBK/Reason-mxbai-colbert-v0.1-32m

Finetuned
(6)
this model

Datasets used to train DataScience-UIBK/Reason-mxbai-colbert-v0.1-32m