Instructions to use dragonscale-ai/kniv-deberta-nlp-base-en-small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use dragonscale-ai/kniv-deberta-nlp-base-en-small with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="dragonscale-ai/kniv-deberta-nlp-base-en-small")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("dragonscale-ai/kniv-deberta-nlp-base-en-small", dtype="auto") - Notebooks
- Google Colab
- Kaggle
kniv-deberta-nlp-base-en-small
A compact multi-task NLP student model that performs the same 5 language analysis tasks as our production teacher from a single DeBERTa-v3-small encoder pass: POS tagging, Named Entity Recognition, Dependency Parsing, Semantic Role Labeling, and Dialog Act Classification.
This is the best size/quality tradeoff in the kniv cascade family — 2.8× smaller than the teacher while staying within 0.1–1.4 pts of teacher quality across all five heads. Recommended for general-purpose deployment.
Part of the Rustic initiative by Dragonscale Industries Inc.
| Source code | GitHub |
| Teacher | kniv-deberta-nlp-base-en-large (443M) |
| Encoder | DeBERTa-v3-small (768d, 6 layers) |
| Parameters | 157.1M (141.3M encoder + 15.8M heads) |
| Compression | 2.8× smaller than teacher |
| Download | 628 MB (PyTorch) / 629 MB (ONNX FP32) / 190 MB (ONNX INT8) |
| License | CC-BY-SA-4.0 |
Quick Start
ONNX
pip install torch transformers==5.6.2 onnxruntime
import onnxruntime as ort
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("dragonscale-ai/kniv-deberta-nlp-base-en-small")
session = ort.InferenceSession("onnx/cascade.onnx")
pos, ner, arc, label, srl, cls = session.run(None, {
"input_ids": input_ids, # int64 [batch, seq]
"attention_mask": attention_mask, # int64 [batch, seq]
"predicate_idx": predicate_idx, # int64 [batch] — verb token index (0 if unused)
})
PyTorch
pip install torch transformers==5.6.2 seqeval
python examples/cascade_demo.py --model models/kniv-deberta-nlp-base-en-small
The demo script loads the model, runs all 5 heads, and prints POS tags, NER entities, dependency tree, SRL frames, and dialog acts.
Benchmark Results
All benchmarks use standard public test sets. No benchmark data was used during training. Results are reproducible via the included benchmark scripts.
| Head | Score | Metric | Benchmark | Split |
|---|---|---|---|---|
| POS | 0.970 | Accuracy | UD English EWT | test |
| NER | 0.779 | F1 | CoNLL-2003 (mapped) | test |
| DEP | 0.942 / 0.922 | UAS / LAS | UD English EWT | test |
| SRL | 0.831 | F1 | PropBank EWT | test |
| CLS | 0.947 | Macro F1 | SGD + GPT (8 labels, internal) | dev |
NER on CoNLL-2003 was evaluated by mapping our 18 OntoNotes entity types to the 4 CoNLL types (PER, ORG, LOC, MISC). Numeric entities (DATE, MONEY, PERCENT, QUANTITY, ORDINAL, CARDINAL, TIME) have no CoNLL equivalent and are mapped to O — this is a strictly harder protocol than CoNLL-trained baselines.
CLS DailyDialog cross-evaluation accuracy: 0.593 (with lossy 8→4 label mapping; informationally lossy comparison).
# Reproduce benchmarks
python models/download_benchmarks.py
python models/student_benchmark_standard.py \
--model-dir models/kniv-deberta-nlp-base-en-small --backend all
Runtime Performance
Single-call latency on NVIDIA RTX 4070 Laptop GPU (CUDA 13.0, ONNX Runtime 1.25.1 with CUDA 13 build):
| Runtime | bs=1 seq=64 | bs=1 seq=128 | bs=32 seq=128 (sent/s) |
|---|---|---|---|
| ONNX FP32 CUDA | 6.62 ms | 8.96 ms | 355 |
| PyTorch CUDA | 10.04 ms | 13.53 ms | 363 |
| ONNX INT8 CPU | 21.77 ms | 31.84 ms | 51 |
| ONNX FP32 CPU | 29.41 ms | 46.88 ms | 29 |
| PyTorch CPU | 80.59 ms | 99.38 ms | 21 |
Recommended runtimes:
- GPU server (production sweet spot): ONNX FP32 CUDA at 9 ms latency
- CPU edge / embedded: ONNX INT8 CPU at 32 ms, 190 MB model size
INT8 quality drops vs FP32: POS −0.54, NER −2.68, DEP −0.95, SRL −1.80, CLS internal −2.58. INT8 hits this 768d encoder harder than xsmall — its weight rows have more dynamic range to compress. For latency-critical CPU deployment where INT8 quality is acceptable, prefer the xsmall variant which has smaller INT8 quality drops.
Architecture
Identical cascade structure to the teacher, with all sizes auto-scaled to the encoder hidden dimension (768d):
DeBERTa-v3-small + pred_embedding
│
├─ ScalarMix(all) → Linear(17) → POS
├─ ScalarMix(all) → BiLSTM(192) → +POS probs → MLP(37) → NER [Viterbi]
├─ ScalarMix(all) → BiLSTM(192) → +POS/NER probs → Biaffine → DEP
├─ ScalarMix(all) → +pred interaction features +POS+DEP probs
│ → BiLSTM(384) → MLP(42) → SRL [Viterbi]
└─ ScalarMix(all) → AttentionPool → MLP(8) → CLS
The architecture mirrors the teacher's design (see the teacher's model card for the full ScalarMix / cascade / predicate embedding rationale). The student auto-scales each head's internal dimensions proportionally to the encoder width.
Training
This model is a distilled student of
kniv-deberta-nlp-base-en-large.
The training pipeline is two-stage:
Stage 1 — Distillation (~8 epochs): student learns from teacher's soft logits (KL on POS/NER/SRL/CLS, hard CE for DEP arc/relation), teacher's intermediate hidden states (PKD-style MSE matching), and consistency regularization (R-Drop).
Stage 2 — Fine-tune: SRL gold + teacher silver supervision, then CLS fine-tune with frozen encoder/SRL components to preserve the SRL peak achieved in Stage 2a.
See docs/design-knowledge-distillation.md
for the distillation methodology in full.
Limitations
- English only. Encoder and training data are English.
- Same data caveats as teacher. NER trained on silver labels (SpanMarker); CLS trained on dialog data (may misclassify news/documents); SRL requires predicate index; DEP uses greedy decoding (no MST).
- Requires
transformers==5.6.2. Other versions produce incorrect outputs.
Model Family
| Model | Encoder | Params | Compression | SRL F1 |
|---|---|---|---|---|
| kniv-deberta-nlp-base-en-xsmall | DeBERTa-v3-xsmall (384d, 12L) | 74.7M | 5.9× | 0.829 |
| kniv-deberta-nlp-base-en-small | DeBERTa-v3-small (768d, 6L) | 157.1M | 2.8× | 0.831 |
| kniv-deberta-nlp-base-en-large | DeBERTa-v3-large (1024d, 24L) | 443M | — | 0.843 |
Citation
@misc{kniv-cascade-2026,
title={kniv-deberta-nlp-base-en-small: Distilled Multi-Task NLP Cascade},
author={Dragonscale Industries Inc.},
year={2026},
url={https://huggingface.co/dragonscale-ai/kniv-deberta-nlp-base-en-small}
}
License
CC-BY-SA-4.0
Built by Dragonscale Industries Inc. | Rustic
Model tree for dragonscale-ai/kniv-deberta-nlp-base-en-small
Base model
microsoft/deberta-v3-smallDatasets used to train dragonscale-ai/kniv-deberta-nlp-base-en-small
dragonscale-ai/kniv-corpus-en
Collection including dragonscale-ai/kniv-deberta-nlp-base-en-small
Evaluation results
- POS Accuracy on UD English EWTtest set self-reported0.970
- F1 (mapped 18→4 types) on CoNLL-2003test set self-reported0.779
- UAS on UD English EWTtest set self-reported0.942
- LAS on UD English EWTtest set self-reported0.922
- F1 on PropBank EWTtest set self-reported0.831
- Macro F1 on SGD + GPTself-reported0.947