pplx-embed-v1-late-0.6b: Late-Interaction Embeddings

pplx-embed-v1-late-0.6b is a token-level late-interaction embedding model for retrieval with MaxSim scoring. It is continued training of pplx-embed-v1-0.6b using ContrastiveLoss to optimize token-level MaxSim.

Token-level embedding dim is 128, which hits the fast path of the optional erikkaum/maxsim MaxSim kernel.

Usage

Using PyLate (indexing + retrieval)

from pylate import indexes, models, retrieve

model = models.ColBERT(
    model_name_or_path="perplexity-ai/pplx-embed-v1-late-0.6b",
    trust_remote_code=True,
)

documents_ids = ["1", "2", "3"]
documents = [
    "Scientists explore the universe driven by curiosity.",
    "Children learn through curious exploration.",
    "Historical discoveries began with curious questions.",
]

index = indexes.PLAID(
    index_folder="pylate-index",
    index_name="pplx-embed-v1-late-0.6b",
    override=True,
)
documents_embeddings = model.encode(documents, is_query=False)
index.add_documents(documents_ids=documents_ids, documents_embeddings=documents_embeddings)

retriever = retrieve.ColBERT(index=index)
queries_embeddings = model.encode(["What motivates scientific discovery?"], is_query=True)
scores = retriever.retrieve(queries_embeddings=queries_embeddings, k=3)
print(scores)

Using the erikkaum/maxsim kernel (fast MaxSim scoring)

Fused MaxSim for reranking, pair scoring, or evaluation. Supports CUDA (sm_80/86/89) and Metal (Apple Silicon); fp32/fp16/bf16 in, fp32 out; forward-only.

import torch
from kernels import get_kernel
from pylate import models

device = "cuda" if torch.cuda.is_available() else "mps"
model = models.ColBERT(
    model_name_or_path="perplexity-ai/pplx-embed-v1-late-0.6b",
    trust_remote_code=True,
    device=device,
)
maxsim = get_kernel("erikkaum/maxsim", version=1, trust_remote_code=True)

q_emb = model.encode(["What motivates scientific discovery?"], is_query=True, convert_to_tensor=True)
d_emb = model.encode([
    "Scientists explore the universe driven by curiosity.",
    "Children learn through curious exploration.",
    "Historical discoveries began with curious questions.",
], is_query=False, convert_to_tensor=True)

# Pad to [B=1, n_candidates, Ld_max, dim] for score_candidates_padded.
Lq, dim = q_emb[0].shape
n, Ld_max = len(d_emb), max(d.shape[0] for d in d_emb)
queries_pad = q_emb[0].unsqueeze(0).to(device, torch.float16)
documents_pad = torch.zeros(1, n, Ld_max, dim, device=device, dtype=torch.float16)
for i, d in enumerate(d_emb):
    documents_pad[0, i, : d.shape[0]] = d.to(device, torch.float16)
query_lengths = torch.tensor([Lq], dtype=torch.int32, device=device)
doc_lengths = torch.tensor([[d.shape[0] for d in d_emb]], dtype=torch.int32, device=device)

scores = maxsim.score_candidates_padded(queries_pad, documents_pad, query_lengths, doc_lengths)
print(scores[0].tolist())  # fp32 scores per candidate

For ragged variable-length pair scoring (eval, distillation, hard-negative mining), use maxsim.score_pairs_packed(...) instead — see the kernel card for the packed API.

Performance

We evaluate pplx-embed-v1-late-0.6b on two standard late-interaction retrieval suites and report the average nDCG@10:

BEIR — average over 15 English retrieval tasks.
MIRACL — average over 18 languages.

Benchmark	`pplx-embed-v1-late-0.6b`	Reference
BEIR (15 tasks)	56.61	colbert-zero: 55.43
MIRACL (18 langs)	66.62	jina-colbert-v2: 62.28

Technical Details

This model uses late interaction: queries and documents are encoded as token-level vectors and scored with MaxSim rather than pooled into a single vector.

For background on the base embedding family, see the pplx-embed-v1-0.6b model card and the technical report: https://arxiv.org/abs/2602.11151.