Ouro-hybrid-1.4B

Ouro-hybrid-1.4B is a research language model distilled from ByteDance/Ouro-1.4B. It was trained as a hybrid/student causal language model using Stage-2 knowledge distillation on an OpenThoughts-derived continuation set.

This release is intended to promote research on efficient distillation, hybrid attention/recurrence designs, and long-context student models. It is not intended for production deployment without independent evaluation.

Model Details

  • Model name: Ouro-hybrid-1.4B
  • Organization: chili-lab
  • Base/teacher model: ByteDance/Ouro-1.4B
  • Student architecture: gdn_v4
  • Task: causal language modeling / text generation
  • Precision: bfloat16 weights
  • Parameters: approximately 1.4B scale
  • Context used during distillation: 32,768 tokens
  • Maximum positions in config: 65,536
  • Tokenizer: included with this repository
  • Research status: experimental

Distillation Setup

The model was trained from a student initialization checkpoint and distilled against ByteDance/Ouro-1.4B with softened top-k KL. The relevant Stage-2 training configuration was:

  • Training stage: Stage 2
  • Dataset cache: openthoughts3_50k_ctx32768_rowwise
  • Max steps: 4,000
  • Batch size: 8
  • Micro batch size: 1
  • Sequence length: 32,768
  • Learning rate: 7e-6
  • Attention learning rate: 1e-4
  • Scheduler: constant
  • Gradient clipping: 1.0
  • Gradient checkpointing: enabled
  • KD temperature: 2.0
  • KD top-k: 512 teacher tokens, renormalized within the top-k set

The KD schedule transitioned from a uniform intermediate-step target to a final step target:

kd_schedule:
  type: uniform_to_final
  switch_steps: 1500
  transition_steps: 0
  initial_weights: [0.25, 0.25, 0.25, 0.25]
  final_weights: [0.0, 0.0, 0.0, 1.0]

Architecture Notes

The released config identifies the architecture as StudentForCausalLM with model_type: student. Important config values include:

  • Hidden size: 2,048
  • Intermediate size: 5,632
  • Layers: 24
  • Attention heads: 16
  • KV heads: 16
  • Head dim: 128
  • Vocabulary size: 49,152
  • RoPE theta: 1,000,000
  • Student steps: 4
  • Sandwich norm: enabled
  • Full-attention layers kept during training config: 7, 9, 10, 11, 12, 14

Because this is a custom student architecture, loading may require the matching research code that defines StudentForCausalLM and the student model type. The standard tokenizer/config artifacts are included to make reproduction and analysis easier.

Included Files

This repository includes the converted Hugging Face artifacts:

  • model.safetensors
  • config.json
  • generation_config.json
  • tokenizer.json
  • tokenizer_config.json
  • special_tokens_map.json
  • vocab.json
  • merges.txt
  • chat_template.jinja

Example

The exact loading path depends on having the custom student model code available in your environment. A typical research loading flow is:

from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "chili-lab/Ouro-hybrid-1.4B"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

If your environment does not know the student model type, install or import the corresponding research implementation before calling AutoModelForCausalLM.from_pretrained.

Intended Use

This model is intended for:

  • research on language model distillation;
  • analysis of hybrid student architectures;
  • long-context training and evaluation experiments;
  • reproducibility comparisons against the ByteDance/Ouro-1.4B teacher.

Limitations

  • This is an experimental research checkpoint.
  • The model has not been safety tuned for deployment.
  • The model may inherit limitations, biases, or unsafe behaviors from the teacher model and training data.
  • The custom architecture may require local research code for loading and inference.
  • Long-context behavior should be independently evaluated before use.

Citation and Attribution

This model is distilled from ByteDance/Ouro-1.4B. Please cite or acknowledge the original Ouro model where appropriate, along with any research artifacts from this release.

License

This checkpoint is released for research purposes. Users are responsible for checking and complying with the license terms of the base model, training data, and any associated research code before use or redistribution.

Downloads last month
118
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chili-lab/Ouro-hybrid-1.4B

Finetuned
(2)
this model

Dataset used to train chili-lab/Ouro-hybrid-1.4B

Collection including chili-lab/Ouro-hybrid-1.4B