Instructions to use DataScience-UIBK/Reason-mxbai-colbert-v0.1-32m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use DataScience-UIBK/Reason-mxbai-colbert-v0.1-32m with sentence-transformers:
from pylate import models queries = [ "Which planet is known as the Red Planet?", "What is the largest planet in our solar system?", ] documents = [ ["Mars is the Red Planet.", "Venus is Earth's twin."], ["Jupiter is the largest planet.", "Saturn has rings."], ] model = models.ColBERT(model_name_or_path="DataScience-UIBK/Reason-mxbai-colbert-v0.1-32m") queries_emb = model.encode(queries, is_query=True) docs_emb = model.encode(documents, is_query=False) - Notebooks
- Google Colab
- Kaggle
Reason-mxbai-colbert-v0.1-32m
v0.1 of the Reason-mxbai-colbert series — same edge-scale late-interaction retriever as v0, retrained with the correct projection-head architecture (use_residual: true in the 2_Dense layer). Mean BRIGHT nDCG@10 improves from 19.00 → 19.61 (+0.61), with the largest gains on the natural-language and formal-math splits that v0 struggled with.
What changed vs v0
The v0 release inherited use_residual: false in the 2_Dense (768→768) layer from the upstream mixedbread-ai/mxbai-edge-colbert-v0-32m config. That turned out to be a shipped-config bug — the base weights were trained with a residual connection on 2_Dense, but the config flag said otherwise. PyLate respects the flag, so every downstream fine-tune was working with a silently-broken architecture.
v0.1 fixes this by:
- Copying the base model and flipping
2_Dense/config.json: "use_residual": true. - Re-widening the 3_Dense projection from 64 → 128 dims on top of the corrected base (small-random init for the new channels).
- Retraining the full curriculum (Stage 1 VL warmup → Stage 2 BGE-reasoner + HQ-hn polish) on the corrected architecture.
Under the corrected architecture, stage-2 training loss drops from 38.73 → 31.71 (−7.02) — real learning. The v0 equivalent run under the broken architecture was stuck at loss ~45-51 and never moved the new-dim weights.
BRIGHT results
Evaluated with MTEB BrightRetrieval (brute-force MaxSim, query_length=256, document_length=2048).
| Split | v0 | v0.1 | Δ v0 | Reason-MCB (150M, paper) |
|---|---|---|---|---|
| aops | 5.05 | 4.89 | −0.16 | 9.17 |
| biology | 32.71 | 33.16 | +0.45 | 33.25 |
| earth_science | 43.88 | 44.28 | +0.40 | 41.02 |
| economics | 18.70 | 20.25 | +1.54 | 24.93 |
| leetcode | 17.67 | 17.40 | −0.28 | 31.07 |
| pony | 20.73 | 22.77 | +2.03 | 8.51 |
| psychology | 22.62 | 24.91 | +2.29 | 30.73 |
| robotics | 18.43 | 18.65 | +0.23 | 21.12 |
| stackoverflow | 16.78 | 16.66 | −0.12 | 20.62 |
| sustainable_living | 20.77 | 20.11 | −0.66 | 20.31 |
| theoremqa_questions | 8.38 | 9.04 | +0.66 | 19.51 |
| theoremqa_theorems | 2.25 | 3.19 | +0.94 | 11.24 |
| Full mean | 19.00 | 19.61 | +0.61 | 22.62 |
8 wins / 4 losses. Highlights:
- Natural-language splits (psychology +2.29, economics +1.54, pony +2.03) — the correct architecture lets the model learn meaningfully richer representations.
- Formal-math splits (theoremqa_theorems +0.94, theoremqa_questions +0.66) — previously stuck near the floor, now moving.
- Small code-split regressions (leetcode −0.28, stackoverflow −0.12) are within noise.
Model Details
- Model Type: PyLate ColBERT (late-interaction, multi-vector)
- Base model: mixedbread-ai/mxbai-edge-colbert-v0-32m (with
2_Dense/use_residual: truepatched to match trained weights) - Parameters: ~32M backbone + widened 128-dim projection
- Document Length (training): 2048 tokens
- Query Length (training): 256 tokens
- Output Dimensionality: 128 per token (widened from base's 64)
- Similarity Function: MaxSim
- Training Data:
hanhainebula/bge-reasoner-data+reasonir/reasonir-data(VL split for warmup, HQ with hard negatives for polish) - Language: en
Model Architecture
ColBERT(
(0): Transformer({'max_seq_length': 127, 'do_lower_case': True}) with ModernBertModel
hidden_size=384, num_hidden_layers=10, num_attention_heads=6,
position_embedding_type='sans_pos', max_position_embeddings=7999
(1): Dense(384 → 768, bias=False, use_residual=False)
(2): Dense(768 → 768, bias=False, use_residual=True) # ← THE FIX (was False in v0)
(3): Dense(768 → 128, bias=False, use_residual=False) # widened from 64 → 128 on correct base
)
Usage
from pylate import indexes, models, retrieve
model = models.ColBERT(model_name_or_path="DataScience-UIBK/Reason-mxbai-colbert-v0.1-32m")
# Retrieval (see https://lightonai.github.io/pylate/ for full API)
index = indexes.Voyager(index_folder="pylate-index", index_name="index", override=True)
docs = ["document 1 text", "document 2 text"]
doc_embs = model.encode(docs, is_query=False, batch_size=32, show_progress_bar=True)
index.add_documents(documents_ids=["1","2"], documents_embeddings=doc_embs)
retriever = retrieve.ColBERT(index=index)
q_embs = model.encode(
["Given a Psychology post, retrieve relevant passages that help answer the post.\nQuery: why do I procrastinate?"],
is_query=True,
)
scores = retriever.retrieve(queries_embeddings=q_embs, k=10)
Acknowledgements
Thanks to Antoine Chaffin (LightOn / Reason-ModernColBERT author) for flagging the upstream 2_Dense config bug in mxbai-edge-colbert-v0-32m — without that heads-up, v0.1 wouldn't exist.
- Downloads last month
- 56
Model tree for DataScience-UIBK/Reason-mxbai-colbert-v0.1-32m
Base model
mixedbread-ai/mxbai-edge-colbert-v0-32m