ChessLM Qwen3 - Neuron Traced for AWS Trainium/Inferentia (Continuous Batching)

This is a Neuron-traced version of karanps/ChessLM_Qwen3 optimized for AWS Trainium (trn1) and Inferentia (inf2) instances using vLLM with continuous batching enabled.

Model Details

  • Base Model: Qwen3-8B fine-tuned for chess
  • Compilation: optimum-neuron[vllm]==0.3.0
  • Compiler Version: neuronxcc 2.21.33363.0
  • Target Hardware: AWS Trainium (trn1) / Inferentia (inf2)
  • Precision: BF16
  • Tensor Parallelism: 2 cores
  • Batch Size: 4 (continuous batching enabled)
  • Max Sequence Length: 2048
  • On-Device Sampling: Disabled (due to runtime limitation with TP=2)

Requirements

pip install optimum-neuron[vllm]==0.3.0
pip install neuronx-distributed --extra-index-url=https://pip.repos.neuron.amazonaws.com

Usage

Loading the Model

from optimum.neuron import NeuronModelForCausalLM
from transformers import AutoTokenizer

# Load the traced model
model = NeuronModelForCausalLM.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium")
tokenizer = AutoTokenizer.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium")

# Run inference
prompt = "e2e4"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Hardware Requirements

  • AWS Trainium (trn1.32xlarge, trn1.2xlarge) or Inferentia (inf2) instances
  • At least 2 Neuron cores (as configured during tracing)
  • Minimum 32GB RAM recommended

Compilation Details

This model was traced with the following parameters:

  • batch_size=4
  • sequence_length=2048
  • num_cores=2
  • auto_cast_type="bf16"
  • continuous_batching=True
  • vLLM-compatible compilation

Continuous Batching

This model is compiled with continuous batching enabled, which allows vLLM to:

  • Process multiple requests simultaneously with dynamic batch sizes up to 4
  • Optimize throughput by batching requests with different sequence lengths
  • Reduce latency for concurrent inference workloads

Note: On-device sampling is disabled due to a known Neuron runtime limitation when using tensor parallelism with 2 cores. Sampling is handled on the host instead.

Compilation Metrics

  • Total compilation time: ~8.1 minutes
  • Token generation model: 219 seconds
  • Context encoding model: 165 seconds
  • Model size: 17GB

License

This model inherits the license from the base model karanps/ChessLM_Qwen3.

Citation

If you use this model, please cite the original ChessLM model and AWS Neuron tools.

Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for kunhunjon/ChessLM_Qwen3_Trainium

Finetuned
(5)
this model