Text Generation
Transformers
Safetensors
English
avey-d-moe
causal-lm
mixture-of-experts
attention-free
avey
Instructions to use yashmarathe/avey-d-moe-1b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use yashmarathe/avey-d-moe-1b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="yashmarathe/avey-d-moe-1b")# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("yashmarathe/avey-d-moe-1b", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use yashmarathe/avey-d-moe-1b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "yashmarathe/avey-d-moe-1b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yashmarathe/avey-d-moe-1b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/yashmarathe/avey-d-moe-1b
- SGLang
How to use yashmarathe/avey-d-moe-1b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "yashmarathe/avey-d-moe-1b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yashmarathe/avey-d-moe-1b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "yashmarathe/avey-d-moe-1b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yashmarathe/avey-d-moe-1b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use yashmarathe/avey-d-moe-1b with Docker Model Runner:
docker model run hf.co/yashmarathe/avey-d-moe-1b
Avey-D MoE 1B
An attention-free causal language model with Mixture-of-Experts, trained from scratch.
Model Details
| Property | Value |
|---|---|
| Architecture | Avey-D MoE (attention-free, interleaved Static/Dynamic layers) |
| Total Parameters | 1.01B |
| Active Parameters | 205M (per token) |
| Hidden Dimension | 640 |
| Layers | 20 (10 Static MoE + 10 Dynamic dense) |
| Experts | 32 routed + 1 shared, top-4 gating |
| Context Length | 2048 (chunk_size=256, k=3 retrieved chunks) |
| Vocabulary | 50,368 tokens |
| Training Data | FineWeb 10BT sample (~1.3B tokens seen) |
| Training Hardware | 1x AMD Instinct MI300X |
| Training Time | ~4.4 hours |
| Final Train Loss | 4.17 |
| Best Val Loss | 4.23 |
| MFU | 43.6% |
| Throughput | ~86,500 tok/s |
Architecture
Avey-D replaces self-attention with two types of interleaved layers:
- Static Layers (MoE): Learned causal spatial projection for token mixing + MoE Enricher/Fuser (32 routed experts, top-4 gating, 1 shared expert)
- Dynamic Layers (Dense): Cosine-similarity token mixing + dense Enricher/Fuser
- CausalRanker: Neural compression retrieving k=3 most recent preceding chunks
Expert dispatch uses batched torch.bmm via StackedExperts for efficient GPU utilization.
Usage
import torch
from transformers import AutoTokenizer
# Load model (requires the avey_d module)
from avey_d.modeling_moe import AveyDecoderMoEForCausalLM
model = AveyDecoderMoEForCausalLM.from_pretrained("yashmarathe/avey-d-moe-1b")
tokenizer = AutoTokenizer.from_pretrained("yashmarathe/avey-d-moe-1b")
model.eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
prompt = "The capital of France is"
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
with torch.no_grad():
output = model.generate(input_ids, max_new_tokens=100, temperature=0.8, top_k=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Training Configuration
model: d_embed=640, n_layers=20, num_experts=32, top_k=4, shared_expert=true
training: 5000 steps, batch=262144 tok/step, seq_length=2048
optimizer: AdamW, max_lr=3e-4, cosine schedule, warmup=500 steps
dataset: HuggingFaceFW/fineweb sample-10BT
Limitations
This is a small-scale experimental model trained on limited data (~1.3B tokens). It is not intended for production use. The model may generate incoherent, incorrect, or biased text.
Citation
@inproceedings{2026aveyb,
title={Avey-B},
author={Acharya, Devang and Hammoud, Mohammad},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026}
}
- Downloads last month
- 5