Instructions to use chili-lab/Ouro-hybrid-1.4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use chili-lab/Ouro-hybrid-1.4B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="chili-lab/Ouro-hybrid-1.4B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("chili-lab/Ouro-hybrid-1.4B", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use chili-lab/Ouro-hybrid-1.4B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "chili-lab/Ouro-hybrid-1.4B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "chili-lab/Ouro-hybrid-1.4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/chili-lab/Ouro-hybrid-1.4B

SGLang

How to use chili-lab/Ouro-hybrid-1.4B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "chili-lab/Ouro-hybrid-1.4B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "chili-lab/Ouro-hybrid-1.4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "chili-lab/Ouro-hybrid-1.4B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "chili-lab/Ouro-hybrid-1.4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use chili-lab/Ouro-hybrid-1.4B with Docker Model Runner:
```
docker model run hf.co/chili-lab/Ouro-hybrid-1.4B
```

Ouro-hybrid-1.4B

Ouro-hybrid-1.4B is a research language model distilled from ByteDance/Ouro-1.4B. It was trained as a hybrid/student causal language model using Stage-2 knowledge distillation on an OpenThoughts-derived continuation set.

This release is intended to promote research on efficient distillation, hybrid attention/recurrence designs, and long-context student models. It is not intended for production deployment without independent evaluation.

Model Details

Model name: Ouro-hybrid-1.4B
Organization: chili-lab
Base/teacher model: ByteDance/Ouro-1.4B
Student architecture: gdn_v4
Task: causal language modeling / text generation
Precision: bfloat16 weights
Parameters: approximately 1.4B scale
Context used during distillation: 32,768 tokens
Maximum positions in config: 65,536
Tokenizer: included with this repository
Research status: experimental

Distillation Setup

The model was trained from a student initialization checkpoint and distilled against ByteDance/Ouro-1.4B with softened top-k KL. The relevant Stage-2 training configuration was:

Training stage: Stage 2
Dataset cache: openthoughts3_50k_ctx32768_rowwise
Max steps: 4,000
Batch size: 8
Micro batch size: 1
Sequence length: 32,768
Learning rate: 7e-6
Attention learning rate: 1e-4
Scheduler: constant
Gradient clipping: 1.0
Gradient checkpointing: enabled
KD temperature: 2.0
KD top-k: 512 teacher tokens, renormalized within the top-k set

The KD schedule transitioned from a uniform intermediate-step target to a final step target:

kd_schedule:
  type: uniform_to_final
  switch_steps: 1500
  transition_steps: 0
  initial_weights: [0.25, 0.25, 0.25, 0.25]
  final_weights: [0.0, 0.0, 0.0, 1.0]

Architecture Notes

The released config identifies the architecture as StudentForCausalLM with model_type: student. Important config values include:

Hidden size: 2,048
Intermediate size: 5,632
Layers: 24
Attention heads: 16
KV heads: 16
Head dim: 128
Vocabulary size: 49,152
RoPE theta: 1,000,000
Student steps: 4
Sandwich norm: enabled
Full-attention layers kept during training config: 7, 9, 10, 11, 12, 14

Because this is a custom student architecture, loading may require the matching research code that defines StudentForCausalLM and the student model type. The standard tokenizer/config artifacts are included to make reproduction and analysis easier.

Included Files

This repository includes the converted Hugging Face artifacts:

model.safetensors
config.json
generation_config.json
tokenizer.json
tokenizer_config.json
special_tokens_map.json
vocab.json
merges.txt
chat_template.jinja

Example

The exact loading path depends on having the custom student model code available in your environment. A typical research loading flow is:

from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "chili-lab/Ouro-hybrid-1.4B"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

If your environment does not know the student model type, install or import the corresponding research implementation before calling AutoModelForCausalLM.from_pretrained.

Intended Use

This model is intended for:

research on language model distillation;
analysis of hybrid student architectures;
long-context training and evaluation experiments;
reproducibility comparisons against the ByteDance/Ouro-1.4B teacher.

Limitations

This is an experimental research checkpoint.
The model has not been safety tuned for deployment.
The model may inherit limitations, biases, or unsafe behaviors from the teacher model and training data.
The custom architecture may require local research code for loading and inference.
Long-context behavior should be independently evaluated before use.

Citation and Attribution

This model is distilled from ByteDance/Ouro-1.4B. Please cite or acknowledge the original Ouro model where appropriate, along with any research artifacts from this release.

License

This checkpoint is released for research purposes. Users are responsible for checking and complying with the license terms of the base model, training data, and any associated research code before use or redistribution.