🤖 caca-1M

Arsitektur Transformer Modern dengan Fitur Canggih

3,524,608 parameters • 3.52M • 6 layers • 1,024 tokens

📚 Documentation • 💻 Usage • ⚙️ Configuration • 🔬 Architecture

⚠️ PENTING: Model Belum Dilatih (Untrained)

⚠️ PERHATIAN: Ini adalah model yang belum melalui proses training. Bobot model masih dalam kondisi random initialization. Output yang dihasilkan akan tidak bermakna dan acak.

Status Model:

🔴 Belum dilatih - Bobot masih random (Kaiming/Xavier init)
🟡 Untuk riset & eksperimen - Arsitektur sudah siap, tinggal train
🟢 Production-ready architecture - Teruji dan optimal

Widget di atas hanya menunjukkan format input yang diharapkan. Setelah model dilatih dengan dataset yang tepat, format yang sama akan menghasilkan output berkualitas tinggi.

🎯 Apa yang Bisa Dilakukan?

✅ Bisa	❌ Belum Bisa
Load model architecture	Generate teks bermakna
Test forward pass	Menjawab pertanyaan
Measure memory & speed	Reasoning & understanding
Start training	Production deployment
Fine-tuning experiments	Real-world applications

📋 Deskripsi

CACA (Collaborative Architecture for Contextual AI) adalah arsitektur Large Language Model (LLM) yang menggabungkan best practices dari berbagai model State-of-the-Art (SOTA) seperti LLaMA, GPT-4, Gemini, Qwen, dan Gemma.

Model ini dirancang dengan fokus pada efisiensi komputasi, skalabilitas, dan performa tinggi — menjadikannya modular, production-ready, dan mendukung multimodal (teks, gambar, audio).

📖 Tentang Project Caca

Caca adalah eksperimen open-source Indonesian LLM yang dibuat dari nol secara individual dan bertahap. Bukan kompetitor siapa-siapa, cuma pengen eksplorasi apa yang bisa dilakukan dengan budget terbatas, passion unlimited, dan mindset collaborative.

Kalau berguna buat orang lain, alhamdulillah. Kalau enggak, ya tetap fun kok. Ini proyek eksplorasi, jadi kalau gagal ya bagian dari proses belajar. Kalau berhasil, itu bonus.

— Lyon, Creator

✨ Highlights

🧠 Hybrid Architecture — Kombinasi teknik terbaik dari 5+ model SOTA
🎭 Multimodal Native — Support teks, gambar, dan audio dalam satu model
⚡ High Performance — Flash Attention, MoE, dan optimasi modern
🌍 Indonesian-First — Dikembangkan dengan fokus pada Bahasa Indonesia
🔓 Open Source — Transparent, reproducible, collaborative

🌟 Mengapa Caca?

🇮🇩 Fokus pada Bahasa Indonesia - Dirancang dengan mempertimbangkan karakteristik bahasa Indonesia
⚡ Efisiensi Tinggi - GQA & Flash Attention untuk inferensi 3-5x lebih cepat
💾 Memory Efficient - Hemat 50% memory untuk KV cache
🔧 Modular & Extensible - Mudah dikustomisasi untuk berbagai use case
🌐 Bilingual - Support optimal untuk Indonesia & English

CACA hadir dengan filosofi berbeda:

✅ Fully open-source — dari architecture sampai training code
✅ Modular & scalable — bisa disesuaikan dari 1B sampai 70B+ parameters
✅ Resource-efficient — optimized untuk budget terbatas
✅ Indonesian-centric — prioritas pada Bahasa Indonesia
✅ Community-driven — open for contributions & collaborations

📈 Perbandingan dengan Model Lain

Fitur	LLaMA	GPT-4	Gemini	Qwen	CACA
RMSNorm	✅	❌	❌	✅	✅
RoPE	✅	❌	❌	✅	✅
GQA	✅	❌	❌	✅	✅
MoE	❌	✅	✅	❌	✅
Multimodal	❌	✅	✅	✅	✅
Flash Attention	✅	✅	✅	✅	✅
Sliding Window	❌	❌	❌	✅	✅
Attention Sinks	❌	❌	❌	❌	✅
MoD	❌	❌	❌	❌	✅
Expert Choice	❌	❌	❌	❌	✅
YARN Scaling	❌	❌	❌	✅	✅
Quantization	✅	❌	❌	✅	✅

🎯 Use Cases & Applications

✅ Cocok Untuk

🔬 Research & Development

Eksperimen arsitektur transformer
Ablation studies
Novel training techniques
Architecture search

📚 Academic & Education

Thesis & research papers
Teaching materials
Student projects
LLM internals understanding

🚀 Base Model for Fine-tuning

Task-specific models
Domain adaptation
Instruction tuning
RLHF experiments

💡 Prototyping

Proof of concept
Feature testing
A/B testing architectures
Benchmark comparisons

❌ Tidak Cocok Untuk

🚫 Production Applications - Model belum dilatih, output random
🚫 Real-world Deployment - Perlu training & safety alignment dulu
🚫 Safety-critical Systems - Tidak ada safety guardrails
🚫 Direct User-facing Apps - Output tidak dapat diprediksi
🚫 Commercial Use (as-is) - Harus dilatih terlebih dahulu

📊 Spesifikasi Model

Parameter	Value	Parameter	Value
Total Parameters	`3,524,608`	Vocab Size	`8,000`
Hidden Size	`128`	Intermediate Size	`512`
Num Layers	`6`	Attention Heads	`4`
KV Heads (GQA)	`2`	Head Dimension	`32`
Max Context Length	`1,024`	RoPE Base (θ)	`10,000`
Model Size (FP16)	`0.01 GB`	Formatted Size	`3.52M`

🎯 Core Features

🔍 Klik untuk expand/collapse

✅ Grouped Query Attention (GQA) - Efisiensi memori dan komputasi superior
- Query heads: 4
- KV heads: 2
- Ratio: 2:1 (hemat ~50% memory KV cache)
- Benefit: Inferensi lebih cepat dengan memory footprint lebih kecil
✅ Rotary Position Embeddings (RoPE) - Generalisasi konteks panjang lebih baik
- Theta (θ): 10,000
- Support extrapolation untuk konteks > training length
- Benefit: Performa stabil pada sequence length yang belum pernah dilihat saat training
✅ RMSNorm - Normalisasi lebih stabil dan ~50% lebih cepat dari LayerNorm
- Epsilon: 1e-06
- Benefit: Training lebih stabil, inference lebih cepat, gradient flow lebih baik
✅ SwiGLU Activation - Performa 10-15% lebih baik dari ReLU/GELU
- Intermediate size: 512 (4.0x hidden)
- Benefit: Kapasitas model lebih besar tanpa menambah parameter signifikan
✅ Flash Attention 2 - Akselerasi hingga 3x dengan memory efficiency
- Otomatis aktif jika tersedia CUDA device
- IO-aware algorithm untuk minimal HBM access
- Benefit: Training & inference jauh lebih cepat, support batch size lebih besar
✅ Hybrid Architecture - Kombinasi teknik terbaik dari 5+ model SOTA
✅ Multimodal Support - Native support untuk Vision dan Audio
✅ Mixture of Experts (MoE) - Sparse activation untuk efisiensi
✅ Long Context - Support hingga 8K+ tokens dengan YARN scaling
✅ Advanced Attention - Flash Attention, Sliding Window, Attention Sinks
✅ Quantization Ready - Support 4-bit dan 8-bit quantization
✅ Production Features - Extensive error handling & monitoring

🔥 Advanced Features

🎯 Mekanisme Attention

⚡ Flash Attention v2 - Algoritma IO-aware yang 3x lebih cepat dari attention standar
🔑 Grouped Query Attention (GQA) - 4 Query heads : 2 KV heads
- Rasio kompresi: 2:1 (hemat ~50% memory KV cache)
🚀 xFormers Support - Fallback memory-efficient attention
🎯 PyTorch SDPA - Native scaled dot product attention

📍 Position Encodings

🔄 RoPE (Rotary Position Embeddings) - Base frequency θ=10,000
- Generalisasi lebih baik untuk sequence panjang dibanding absolute PE

🎓 Optimisasi Training

💾 Gradient Checkpointing - Trade compute for memory (support model hingga 100B+ params)
🎯 Mixed Precision Training - Support FP16, BF16, dan TF32
📉 Dropout Regularization
- Hidden dropout: 0.1
- Attention dropout: 0.0
- Residual dropout: 0.1

📦 Dukungan Quantization

4️⃣ 4-bit Quantization - NF4 & FP4 via bitsandbytes
- Memory reduction: ~75% (4GB → 1GB)
- Accuracy loss: <2% pada kebanyakan tasks
- Support double quantization untuk kompresi maksimal
8️⃣ 8-bit Quantization - LLM.int8() dengan outlier handling
- Memory reduction: ~50% (4GB → 2GB)
- Accuracy loss: <1%
🔄 Dynamic Quantization - Runtime quantization tanpa calibration

🔬 Advanced Features

📊 Automatic Mixed Precision (AMP) - Dynamic loss scaling
🎯 Gradient Clipping - Stabilitas training dengan max norm clipping
📈 Learning Rate Scheduling - Support cosine, linear, warmup
💡 Smart Memory Management - Auto cache clearing & monitoring
🔍 Metrics Tracking - Real-time perplexity, loss, gradient norms
🛡️ NaN/Inf Detection - Automatic recovery dari numerical instability

🧩 Komponen Arsitektur

1️⃣ Dari LLaMA (Meta)

CACA mengadopsi komponen efisien dari LLaMA untuk performa optimal:

✓ RMSNorm                    # Normalisasi lebih efisien dari LayerNorm
✓ Rotary Position Embeddings # Positional encoding yang lebih baik
✓ SwiGLU Activation          # Activation function dengan gating mechanism
✓ Grouped-Query Attention    # Hemat memory dengan shared K/V heads
✓ Pre-normalization          # Stabilitas training yang lebih baik

RMSNorm 30% lebih cepat dari LayerNorm
RoPE membuat model bisa extrapolate ke context lebih panjang
GQA hemat 30-40% memory dibanding Multi-Head Attention
SwiGLU meningkatkan performa 3-5% dibanding ReLU/GELU

2️⃣ Dari GPT-4 (OpenAI)

Implementasi Mixture of Experts untuk skalabilitas:

✓ Mixture of Experts (MoE)   # Sparse activation dengan multiple expert networks
✓ Top-K Router               # Routing token ke K expert terbaik
✓ Auxiliary Loss             # Load balancing antar experts
✓ Z-Loss                     # Stabilisasi router logits
✓ Expert Usage Tracking      # Monitoring penggunaan setiap expert

Input Token
    ↓
[Router] → Pilih Top-K Experts (misal K=2 dari 8 experts)
    ↓
Expert_1 (weight: 0.6) + Expert_3 (weight: 0.4)
    ↓
Weighted Sum Output

Keuntungan:

Model bisa 10x lebih besar dengan compute cost yang sama
Setiap token hanya activate 12.5% parameters (jika K=2, N=8)
Parallel processing antar experts

3️⃣ Dari Gemini (Google)

Multimodal native dengan cross-modal fusion:

✓ Vision Encoder (ViT)           # Process gambar dengan Vision Transformer
✓ Audio Encoder (Conv1D + Trans) # Process audio dengan CNN + Transformer
✓ Cross-Attention Mechanism      # Fuse multimodal features
✓ Multiple Projector Types:
  - Linear Projector             # Simple & cepat
  - MLP Projector                # Non-linear mapping
  - Perceiver Resampler          # Compress dengan latent queries
  - Q-Former                     # Query-based projection (BLIP-2 style)
✓ Logit Soft-Capping             # Clip extreme values untuk stabilitas

Alur Multimodal:

[Image] → Vision Encoder → [2D patches → 1D tokens]
                              ↓
                         Projector → [Hidden dim = text dim]
                              ↓
[Text] + [Image tokens] → Cross-Attention → Fused representation

Support format:

Images: JPEG, PNG (224x224 default)
Audio: Mel-spectrogram (80 bins)

4️⃣ Dari Qwen (Alibaba)

Long context optimization:

✓ YARN Scaling                # Yet Another RoPE extensioN
✓ Dynamic Position Scaling    # Auto-adjust untuk sequence lebih panjang
✓ Sliding Window Attention    # Local attention pattern
✓ Context Window 8K-128K      # Flexible context length

YARN vs Standard RoPE:

Standard RoPE:  [====] 4K context → [====????] 8K (error naik)
YARN:          [====] 4K context → [========] 8K (smooth extrapolation)

Sliding Window Mechanism:

Token 0: attend ke [0]
Token 1: attend ke [0, 1]
Token 2: attend ke [0, 1, 2]
Token 10: attend ke [0, 6, 7, 8, 9, 10]  ← sliding window = 4
         (keep attention sink di token 0)

5️⃣ Dari Gemma (Google)

Optimization techniques:

✓ Layer Scale                # Learnable scaling per layer
✓ Stochastic Depth           # Random layer dropping saat training
✓ Normalized Attention       # QK normalization untuk stabilitas
✓ Knowledge Distillation     # Transfer knowledge dari model besar

Layer Scale formula:

output = input + gamma * layer(input)
# gamma diinit sangat kecil (1e-5) lalu di-learn

Stochastic Depth:

Training: 20% chance layer di-skip (drop_prob=0.2)
Inference: semua layer aktif
Benefit: regularization + faster training

🆕 Fitur Eksperimental & Unik

A) Mixture of Depths (MoD)

Token bisa "skip" layer tertentu untuk efisiensi:

class MixtureOfDepthsRouter:
    # Pilih top 50% tokens paling "penting" untuk di-process
    capacity_factor = 0.5
    
    # Method: learned, random, atau heuristic
    route_method = "learned"

Ilustrasi:

Layer 1:  [All 100 tokens processed]
Layer 2:  [Top 50 tokens processed, 50 skipped]  ← MoD
Layer 3:  [All 100 tokens processed]
Layer 4:  [Top 50 tokens processed, 50 skipped]  ← MoD

Benefit:

30-40% faster inference dengan minimal accuracy drop
Dynamic computation based on token importance

Paper: Mixture-of-Depths (2024)

B) Attention Sinks

Keep token awal selalu di-attend untuk stabilitas:

attention_sink_size = 4      # Keep first 4 tokens
attention_sink_window = 512  # Sliding window size

Attention Pattern:

Query Token 1000:
├─ Attend to: [0, 1, 2, 3]           ← attention sinks (always)
└─ Attend to: [488, 489, ..., 1000]  ← sliding window

Benefit:

Prevent attention collapse di long sequences
Better streaming generation
Inspired by StreamingLLM (2023)

C) Expert Choice Routing

Alternatif dari Top-K routing:

# Top-K: Token pilih expert
Token → Router → "Saya mau Expert 2 dan 5"

# Expert Choice: Expert pilih token
Expert 1 → "Saya mau process Token 3, 7, 12, ..."
Expert 2 → "Saya mau process Token 1, 5, 9, ..."

Keuntungan:

Better load balancing (setiap expert process jumlah token yang sama)
Lebih stable training (no expert collapse)
Trade-off: sedikit lebih complex implementasi

D) Multi-Backend Attention

Automatic fallback untuk compatibility:

if HAS_FLASH_ATTN and device == "cuda":
    use flash_attn_func()           # ← Fastest (2-4x speedup)
elif HAS_XFORMERS and device == "cuda":
    use memory_efficient_attention() # ← Fallback 1
elif HAS_SDPA:
    use F.scaled_dot_product_attention() # ← Fallback 2 (PyTorch 2.0+)
else:
    use standard_attention()         # ← Safe fallback

Performa Comparison:

Flash Attention:     100ms  (baseline)
xFormers:           150ms  (1.5x slower)
SDPA:               180ms  (1.8x slower)
Standard:           400ms  (4x slower)

🏗️ CACA Model Family

Model	Parameters	Vocab Size	Hidden Size	Intermediate Size	Layers	Attention Heads	KV Heads	Head Dim	Max Position
caca-1M-untrained	2.50M	8,000	128	512	6	4	2	32	1,024
caca-3M-untrained	6.63M	12,000	192	768	8	6	2	32	2,048
caca-4M-untrained	4.02M	16,000	128	512	8	4	2	32	2,048
caca-6M-untrained	11.96M	16,000	256	1024	8	4	2	64	2,048
caca-10M-untrained	21.25M	20,000	320	1280	10	8	2	40	2,048
caca-15M-untrained	35.18M	24,000	384	1536	12	6	2	64	2,048
caca-25M-untrained	67.57M	28,000	512	2048	14	8	2	64	4,096
caca-35M-untrained	95.42M	32,000	576	2304	16	8	2	72	4,096
caca-50M-untrained	138.47M	32,000	640	2560	20	10	2	64	4,096
caca-75M-untrained	178.55M	32,000	768	3072	18	12	3	64	4,096
caca-100M-untrained	232.23M	32,000	768	3072	24	12	4	64	4,096
caca-150M-untrained	336.90M	32,000	1024	4096	20	16	4	64	4,096
caca-200M-untrained	458.55M	32,000	1024	4096	28	16	4	64	4,096
caca-250M-untrained	569.54M	32,000	1152	4608	28	18	3	64	8,192
caca-300M-untrained	701.64M	32,000	1280	5120	28	20	4	64	8,192
caca-400M-untrained	956.36M	32,000	1408	5632	32	22	4	64	8,192
caca-500M-untrained	1.27B	32,000	1536	6144	36	24	4	64	8,192
caca-600M-untrained	1.48B	32,000	1664	6656	36	26	4	64	8,192
caca-700M-untrained	1.71B	32,000	1792	7168	36	28	4	64	8,192
caca-800M-untrained	1.96B	32,000	1920	7680	36	30	5	64	8,192
caca-900M-untrained	2.01B	32,000	2048	8192	32	32	8	64	8,192
caca-1B-untrained	2.26B	32,000	2048	8192	36	32	8	64	8,192
caca-1.5B-untrained	2.98B	32,000	2048	8192	48	32	8	64	8,192
caca-2B-untrained	3.15B	32,000	2304	9216	40	32	8	72	8,192
caca-2.5B-untrained	3.12B	32,000	2560	10240	32	32	8	80	8,192
caca-3B-untrained	3.88B	32,000	2560	10240	40	32	8	80	8,192
caca-3.5B-untrained	4.69B	32,000	2816	11264	40	32	8	88	8,192
caca-4B-untrained	5.02B	32,000	3072	12288	36	32	8	96	8,192
caca-4.5B-untrained	5.45B	32,000	3200	12800	36	32	8	100	8,192
caca-5B-untrained	6.53B	32,000	3328	13312	40	32	8	104	8,192
caca-6B-untrained	8.31B	32,000	3584	14336	44	32	8	112	8,192
caca-7B-untrained	7.11B	32,000	4096	14336	32	32	8	128	8,192
caca-8B-untrained	7.98B	32,000	4096	14336	36	32	8	128	8,192
caca-9B-untrained	9.09B	32,000	4608	16384	32	36	9	128	8,192
caca-10B-untrained	11.23B	32,000	4608	18432	36	32	8	144	8,192
caca-12B-untrained	15.26B	32,000	5120	20480	40	40	8	128	8,192
caca-13B-untrained	13.38B	32,000	5120	13824	48	40	8	128	8,192
caca-14B-untrained	13.40B	32,000	5376	14464	44	48	8	112	8,192
caca-15B-untrained	14.90B	32,000	5632	15104	44	32	8	176	8,192
caca-18B-untrained	18.92B	32,000	6144	16384	48	48	8	128	8,192
caca-20B-untrained	20.48B	32,000	6144	16384	52	48	8	128	8,192
caca-24B-untrained	25.83B	32,000	6656	17920	56	64	8	104	8,192
caca-30B-untrained	32.24B	32,000	6656	17920	70	64	8	104	8,192
caca-35B-untrained	39.02B	32,000	8192	22016	56	64	8	128	8,192
caca-40B-untrained	44.56B	32,000	8192	22016	64	64	8	128	8,192
caca-45B-untrained	50.09B	32,000	8192	22016	72	64	8	128	8,192
caca-50B-untrained	55.63B	32,000	8192	22016	80	64	8	128	8,192
caca-60B-untrained	72.14B	32,000	8192	28672	84	64	8	128	8,192
caca-70B-untrained	68.71B	32,000	8192	28672	80	64	8	128	8,192
caca-80B-untrained	101.77B	32,000	9216	36864	84	72	8	128	8,192
caca-100B-untrained	137.32B	32,000	10240	40960	92	80	8	128	8,192
caca-120B-untrained	173.10B	32,000	11264	45056	96	88	8	128	8,192
caca-150B-untrained	214.31B	32,000	12288	49152	100	96	8	128	8,192
caca-175B-untrained	248.53B	32,000	12288	49152	116	96	8	128	8,192
caca-200B-untrained	324.80B	128,000	14336	57344	110	112	16	128	16,384
caca-250B-untrained	419.35B	128,000	15360	61440	124	120	16	128	16,384
caca-300B-untrained	507.03B	128,000	16384	65536	132	128	16	128	16,384
caca-350B-untrained	591.18B	128,000	16384	65536	154	128	16	128	16,384
caca-400B-untrained	675.34B	128,000	16384	65536	176	128	16	128	16,384
caca-500B-untrained	852.77B	128,000	18432	73728	176	144	16	128	16,384
caca-600B-untrained	1.07T	128,000	20480	81920	180	160	16	128	16,384
caca-700B-untrained	1.23T	128,000	21504	86016	186	168	24	128	16,384
caca-800B-untrained	1.38T	128,000	22528	90112	192	176	16	128	16,384
caca-900B-untrained	1.65T	128,000	24576	94208	198	192	24	128	16,384
caca-1T-untrained	1.75T	128,000	24576	98304	204	192	16	128	16,384

💾 Kebutuhan Memory

Training Requirements

Configuration	Model Weights	+ Optimizer States	Total Training
FP32 (AdamW)	0.01 GB	+0.04 GB	0.06 GB
Mixed Precision	0.01 GB	+0.05 GB	0.06 GB
+ Gradient Checkpointing	Menghemat ~30-50% activation memory		~0.03 GB

Inference Requirements

Precision	Model Size	Total Memory	Memory Saving
FP16 / BF16	0.01 GB	0.01 GB	Baseline
INT8	0.00 GB	0.01 GB	~50% ↓
INT4 (NF4)	0.00 GB	0.00 GB	~75% ↓

💡 Note: KV cache bertambah secara linear dengan panjang sequence. Untuk context 8K, kalikan nilai KV cache dengan 4.

Performance Estimates

Metric	Value	Notes
FLOPs per Token	7,049,216	Forward pass only
TFLOPs per Token	0.0000	≈ 6× untuk backward
Bandwidth (FP16)	0.01 GB/token	Memory bandwidth requirement

📐 Struktur Arsitektur Lengkap

🔍 Klik untuk lihat detail arsitektur

CACA Architecture
│
├─── 📥 INPUT PROCESSING
│    │
│    ├─── Text Input
│    │    ├─── Tokenization (BPE/WordPiece/SentencePiece)
│    │    ├─── Token Embeddings (vocab_size × hidden_size)
│    │    └─── Output: [batch_size, seq_len, hidden_size]
│    │
│    ├─── Vision Input (Optional)
│    │    ├─── Image Preprocessing (resize ke 224×224)
│    │    ├─── Vision Encoder (ViT)
│    │    │    ├─── Patch Embedding (Conv2D: 14×14 patches)
│    │    │    ├─── CLS Token + Positional Embeddings
│    │    │    ├─── Vision Transformer Blocks (24 layers)
│    │    │    │    ├─── LayerNorm
│    │    │    │    ├─── Multi-Head Attention
│    │    │    │    ├─── MLP (GELU activation)
│    │    │    │    └─── Residual Connections
│    │    │    └─── Final LayerNorm
│    │    ├─── Vision Projector
│    │    │    ├─── Type: Linear / MLP / Perceiver / Q-Former
│    │    │    └─── Output: [batch_size, num_patches, hidden_size]
│    │    └─── Output: Vision embeddings aligned to text space
│    │
│    └─── Audio Input (Optional)
│         ├─── Audio Preprocessing (Mel-spectrogram, 80 bins)
│         ├─── Audio Encoder
│         │    ├─── Conv1D Layers (feature extraction)
│         │    │    ├─── Conv1D (80 → hidden_size, kernel=3)
│         │    │    └─── Conv1D (stride=2 untuk downsampling)
│         │    ├─── Positional Embeddings (interpolated)
│         │    ├─── Audio Transformer Blocks (12 layers)
│         │    │    ├─── LayerNorm
│         │    │    ├─── Multi-Head Attention
│         │    │    ├─── MLP (GELU activation)
│         │    │    └─── Residual Connections
│         │    └─── Final LayerNorm
│         ├─── Audio Projector
│         │    ├─── Type: Linear / MLP / Perceiver / Q-Former
│         │    └─── Output: [batch_size, audio_len, hidden_size]
│         └─── Output: Audio embeddings aligned to text space
│
├─── 🔄 MULTIMODAL FUSION
│    │
│    ├─── Early Fusion (jika tidak pakai Cross-Attention)
│    │    ├─── Concatenate: [vision_tokens + audio_tokens + text_tokens]
│    │    ├─── Update attention mask
│    │    └─── Output: Combined sequence untuk decoder
│    │
│    └─── Late Fusion (jika pakai Cross-Attention)
│         ├─── Text tokens → Query untuk cross-attention
│         ├─── Vision+Audio tokens → Key/Value untuk cross-attention
│         └─── Fusion dilakukan di dalam decoder layers
│
├─── 🏗️ DECODER STACK (N=32 layers)
│    │
│    └─── 🔁 DECODER LAYER i (repeated N times)
│         │
│         ├─── [OPTIONAL] Mixture of Depths (MoD)
│         │    ├─── Input: Hidden states [batch, seq_len, hidden]
│         │    ├─── MoD Router
│         │    │    ├─── Method: learned / random / heuristic
│         │    │    ├─── Score computation per token
│         │    │    └─── Top-K selection (K = capacity_factor × seq_len)
│         │    ├─── Process Mask Generation
│         │    │    └─── Binary mask [batch, seq_len] (1=process, 0=skip)
│         │    └─── Token Selection
│         │         ├─── Selected tokens: processed through layer
│         │         └─── Skipped tokens: bypass layer (identity)
│         │
│         ├─── 🎯 SELF-ATTENTION PATH
│         │    │
│         │    ├─── Input Normalization
│         │    │    ├─── RMSNorm (Root Mean Square Layer Normalization)
│         │    │    ├─── Formula: x * rsqrt(mean(x²) + ε) * γ
│         │    │    └─── More efficient than LayerNorm (no mean centering)
│         │    │
│         │    ├─── Attention Computation
│         │    │    │
│         │    │    ├─── Query/Key/Value Projections
│         │    │    │    ├─── Q: Linear(hidden_size → num_heads × head_dim)
│         │    │    │    ├─── K: Linear(hidden_size → num_kv_heads × head_dim)
│         │    │    │    ├─── V: Linear(hidden_size → num_kv_heads × head_dim)
│         │    │    │    └─── Reshape: [batch, seq, heads, head_dim]
│         │    │    │
│         │    │    ├─── [OPTIONAL] QK Normalization
│         │    │    │    ├─── Q = RMSNorm(Q)
│         │    │    │    └─── K = RMSNorm(K)
│         │    │    │
│         │    │    ├─── Rotary Position Embeddings (RoPE)
│         │    │    │    ├─── Compute frequencies: θ_i = base^(-2i/dim)
│         │    │    │    ├─── Position indices: t ∈ [0, seq_len)
│         │    │    │    ├─── Rotation matrix: cos(t·θ), sin(t·θ)
│         │    │    │    ├─── Apply rotation: Q, K = rotate(Q, K, cos, sin)
│         │    │    │    └─── YARN Scaling (jika enabled)
│         │    │    │         ├─── Type: linear / dynamic / yarn
│         │    │    │         ├─── Scaling factor per frequency band
│         │    │    │         └─── Better extrapolation ke context panjang
│         │    │    │
│         │    │    ├─── Grouped-Query Attention (GQA)
│         │    │    │    ├─── num_kv_groups = num_heads / num_kv_heads
│         │    │    │    ├─── Repeat K, V: [num_kv_heads → num_heads]
│         │    │    │    └─── Memory saving: 30-40% vs full MHA
│         │    │    │
│         │    │    ├─── Attention Score Computation
│         │    │    │    ├─── scores = (Q @ K.T) / sqrt(head_dim)
│         │    │    │    ├─── Logit clamping: [-50, 50] untuk stabilitas
│         │    │    │    └─── [OPTIONAL] Soft-capping
│         │    │    │         └─── scores = tanh(scores / cap) * cap
│         │    │    │
│         │    │    ├─── Attention Masking
│         │    │    │    ├─── Causal Mask (autoregressive)
│         │    │    │    ├─── Sliding Window Mask (jika enabled)
│         │    │    │    │    ├─── Window size (misal: 512 tokens)
│         │    │    │    │    └─── Attend hanya ke window terdekat
│         │    │    │    ├─── Attention Sinks (jika enabled)
│         │    │    │    │    ├─── Always attend to first K tokens
│         │    │    │    │    ├─── Prevent attention collapse
│         │    │    │    │    └─── Better streaming generation
│         │    │    │    └─── [OPTIONAL] ALiBi Bias
│         │    │    │         ├─── Linear bias based on distance
│         │    │    │         └─── Alternative/complement to RoPE
│         │    │    │
│         │    │    ├─── Backend Selection (automatic fallback)
│         │    │    │    ├─── 1️⃣ Flash Attention 2 (PREFERRED)
│         │    │    │    │    ├─── Requirements: CUDA + FP16/BF16
│         │    │    │    │    ├─── Speedup: 2-4x faster
│         │    │    │    │    ├─── Memory: 10-20x less
│         │    │    │    │    ├─── Sliding window support
│         │    │    │    │    └─── IO-aware algorithm
│         │    │    │    ├─── 2️⃣ xFormers Memory Efficient (FALLBACK 1)
│         │    │    │    │    ├─── Requirements: CUDA
│         │    │    │    │    ├─── Block-sparse attention
│         │    │    │    │    └─── Custom attention patterns
│         │    │    │    ├─── 3️⃣ PyTorch SDPA (FALLBACK 2)
│         │    │    │    │    ├─── Requirements: PyTorch 2.0+
│         │    │    │    │    ├─── Built-in scaled_dot_product_attention
│         │    │    │    │    └─── Hardware-agnostic
│         │    │    │    └─── 4️⃣ Standard Attention (SAFE FALLBACK)
│         │    │    │         ├─── Pure PyTorch implementation
│         │    │    │         ├─── Always available
│         │    │    │         └─── Slower but stable
│         │    │    │
│         │    │    ├─── Softmax + Dropout
│         │    │    │    ├─── attn_weights = softmax(scores, dim=-1)
│         │    │    │    └─── attn_weights = dropout(attn_weights)
│         │    │    │
│         │    │    ├─── Value Aggregation
│         │    │    │    ├─── output = attn_weights @ V
│         │    │    │    └─── Reshape: [batch, seq, num_heads × head_dim]
│         │    │    │
│         │    │    └─── Output Projection
│         │    │         ├─── O: Linear(num_heads × head_dim → hidden_size)
│         │    │         └─── Output: [batch, seq, hidden_size]
│         │    │
│         │    ├─── [OPTIONAL] Layer Scale
│         │    │    ├─── Learnable per-layer scaling: γ
│         │    │    ├─── Initialize: γ = 1e-5 (very small)
│         │    │    ├─── output = γ * output
│         │    │    └─── Improves training stability
│         │    │
│         │    ├─── [OPTIONAL] Stochastic Depth
│         │    │    ├─── Training: Random layer dropping
│         │    │    ├─── drop_prob = layer_idx / num_layers × base_prob
│         │    │    ├─── if random() > drop_prob: return output
│         │    │    ├─── else: return 0
│         │    │    └─── Inference: Always apply (no dropping)
│         │    │
│         │    ├─── Residual Dropout
│         │    │    └─── output = dropout(output)
│         │    │
│         │    └─── Residual Connection
│         │         ├─── hidden_states = hidden_states + output
│         │         └─── [Training] Gradient clipping: [-1e4, 1e4]
│         │
│         ├─── 🌐 [OPTIONAL] CROSS-ATTENTION PATH (untuk Multimodal)
│         │    │
│         │    ├─── Conditional: Hanya jika encoder_hidden_states != None
│         │    ├─── Frequency: Setiap cross_attention_frequency layers
│         │    │
│         │    ├─── Input Normalization
│         │    │    └─── RMSNorm(hidden_states)
│         │    │
│         │    ├─── Cross-Attention Computation
│         │    │    ├─── Query: dari text hidden states
│         │    │    │    └─── Q: Linear(hidden_size → num_heads × head_dim)
│         │    │    ├─── Key/Value: dari encoder_hidden_states (vision+audio)
│         │    │    │    ├─── K: Linear(hidden_size → num_kv_heads × head_dim)
│         │    │    │    └─── V: Linear(hidden_size → num_kv_heads × head_dim)
│         │    │    ├─── Attention: Q @ K.T / sqrt(head_dim)
│         │    │    ├─── Softmax + Dropout
│         │    │    ├─── Output: attn_weights @ V
│         │    │    └─── Output Projection
│         │    │
│         │    ├─── [OPTIONAL] Layer Scale
│         │    ├─── [OPTIONAL] Stochastic Depth
│         │    ├─── Residual Dropout
│         │    └─── Residual Connection
│         │         └─── hidden_states = hidden_states + cross_attn_output
│         │
│         └─── 🔮 FEED-FORWARD PATH
│              │
│              ├─── Input Normalization
│              │    └─── RMSNorm(hidden_states)
│              │
│              ├─── Feed-Forward Network
│              │    │
│              │    ├─── ━━━━━ STANDARD MLP ━━━━━
│              │    │    │
│              │    │    ├─── Gate Projection
│              │    │    │    ├─── gate: Linear(hidden_size → intermediate_size)
│              │    │    │    └─── Typical: intermediate_size = 4 × hidden_size
│              │    │    │
│              │    │    ├─── Up Projection
│              │    │    │    └─── up: Linear(hidden_size → intermediate_size)
│              │    │    │
│              │    │    ├─── SwiGLU Activation
│              │    │    │    ├─── gate = silu(gate)  # Swish activation
│              │    │    │    ├─── hidden = gate * up  # Gating mechanism
│              │    │    │    └─── Formula: silu(x) = x * sigmoid(x)
│              │    │    │
│              │    │    ├─── Dropout
│              │    │    │    └─── hidden = dropout(hidden)
│              │    │    │
│              │    │    └─── Down Projection
│              │    │         ├─── down: Linear(intermediate_size → hidden_size)
│              │    │         └─── Output: [batch, seq, hidden_size]
│              │    │
│              │    └─── ━━━━━ MIXTURE OF EXPERTS (MoE) ━━━━━
│              │         │
│              │         ├─── Conditional: use_moe AND (layer_idx % moe_frequency == 0)
│              │         │
│              │         ├─── Router Network
│              │         │    │
│              │         │    ├─── Router Type Selection
│              │         │    │    ├─── Top-K Router (default)
│              │         │    │    └─── Expert Choice Router (alternative)
│              │         │    │
│              │         │    ├─── ━━━ TOP-K ROUTER ━━━
│              │         │    │    │
│              │         │    │    ├─── Gate Normalization
│              │         │    │    │    └─── hidden = LayerNorm(hidden)
│              │         │    │    │
│              │         │    │    ├─── Router Logits
│              │         │    │    │    ├─── logits: Linear(hidden_size → num_experts)
│              │         │    │    │    ├─── Clamping: [-20, 20]
│              │         │    │    │    └─── Temperature scaling: logits / temp
│              │         │    │    │
│              │         │    │    ├─── [Training] Jitter Noise
│              │         │    │    │    ├─── noise = randn_like(logits) × 0.01
│              │         │    │    │    └─── logits = logits + noise
│              │         │    │    │
│              │         │    │    ├─── Routing Weights
│              │         │    │    │    ├─── weights = softmax(logits)
│              │         │    │    │    └─── top_k_weights, top_k_indices = topk(weights, k)
│              │         │    │    │
│              │         │    │    ├─── Weight Normalization
│              │         │    │    │    └─── top_k_weights = top_k_weights / sum(top_k_weights)
│              │         │    │    │
│              │         │    │    └─── Loss Computation
│              │         │    │         ├─── Auxiliary Loss (load balancing)
│              │         │    │         │    ├─── expert_usage = mean(weights, dim=0)
│              │         │    │         │    ├─── mean_usage = mean(expert_usage)
│              │         │    │         │    └─── aux_loss = std(expert_usage) / mean_usage
│              │         │    │         ├─── Z-Loss (router stability)
│              │         │    │         │    ├─── z_loss = mean(logsumexp(logits)²)
│              │         │    │         │    └─── Prevents logits explosion
│              │         │    │         └─── Entropy Loss (diversity)
│              │         │    │              └─── entropy_loss = -mean(weights × log(weights))
│              │         │    │
│              │         │    └─── ━━━ EXPERT CHOICE ROUTER ━━━
│              │         │         │
│              │         │         ├─── Router Logits
│              │         │         │    └─── logits: Linear(hidden → num_experts)
│              │         │         │
│              │         │         ├─── Expert-wise Token Selection
│              │         │         │    ├─── Transpose: [batch×seq, experts]
│              │         │         │    ├─── capacity = expert_choice_k × total_tokens / num_experts
│              │         │         │    ├─── Per expert: topk(logits, k=capacity)
│              │         │         │    └─── Expert mask: [experts, batch×seq]
│              │         │         │
│              │         │         └─── Routing weights from mask
│              │         │
│              │         ├─── Expert Networks (N experts, misal N=8)
│              │         │    │
│              │         │    └─── Expert i (i = 0 to N-1)
│              │         │         ├─── Same structure as Standard MLP
│              │         │         ├─── gate_proj: Linear(hidden → intermediate)
│              │         │         ├─── up_proj: Linear(hidden → intermediate)
│              │         │         ├─── SwiGLU activation
│              │         │         ├─── Dropout
│              │         │         └─── down_proj: Linear(intermediate → hidden)
│              │         │
│              │         ├─── Expert Execution
│              │         │    │
│              │         │    ├─── For each expert:
│              │         │    │    ├─── Get tokens routed to this expert
│              │         │    │    ├─── If no tokens: skip
│              │         │    │    ├─── Run expert forward pass
│              │         │    │    ├─── [Training] Track expert usage
│              │         │    │    └─── [Safety] NaN/Inf detection
│              │         │    │
│              │         │    └─── Combine Expert Outputs
│              │         │         ├─── Weighted sum by router weights
│              │         │         └─── final_output = Σ(weight_i × expert_i(x))
│              │         │
│              │         └─── Output: [batch, seq, hidden_size]
│              │
│              ├─── [OPTIONAL] Layer Scale
│              │    └─── output = γ * output
│              │
│              ├─── [OPTIONAL] Stochastic Depth
│              │    └─── Probabilistic dropping (training only)
│              │
│              ├─── Residual Dropout
│              │    └─── output = dropout(output)
│              │
│              └─── Residual Connection
│                   ├─── hidden_states = hidden_states + output
│                   └─── [Training] Gradient clipping: [-1e4, 1e4]
│
├─── 📤 OUTPUT HEAD
│    │
│    ├─── Final Normalization
│    │    ├─── RMSNorm(hidden_states)
│    │    └─── Output: [batch, seq, hidden_size]
│    │
│    ├─── Language Modeling Head
│    │    ├─── Linear Projection
│    │    │    ├─── lm_head: Linear(hidden_size → vocab_size, bias=False)
│    │    │    └─── Output: [batch, seq, vocab_size]
│    │    │
│    │    └─── [OPTIONAL] Logit Soft-Capping
│    │         ├─── Clamp extreme values: [-cap×0.99, cap×0.99]
│    │         ├─── Formula: tanh(logits / cap) × cap
│    │         ├─── Prevents numerical instability
│    │         └─── Typical cap value: 30.0
│    │
│    └─── Output: Logits [batch, seq, vocab_size]
│
├─── 📉 LOSS COMPUTATION (Training Only)
│    │
│    ├─── Shift for Autoregressive
│    │    ├─── shift_logits = logits[:, :-1, :]
│    │    └─── shift_labels = labels[:, 1:]
│    │
│    ├─── Language Modeling Loss
│    │    ├─── CrossEntropyLoss(ignore_index=-100)
│    │    ├─── [OPTIONAL] Label Smoothing
│    │    │    └─── Reduces overconfidence
│    │    └─── lm_loss = CE(shift_logits, shift_labels)
│    │
│    ├─── [OPTIONAL] MoE Auxiliary Losses
│    │    ├─── Router Auxiliary Loss (load balancing)
│    │    │    └─── aux_loss × router_aux_loss_coef (default: 0.01)
│    │    ├─── Router Z-Loss (stability)
│    │    │    └─── z_loss × router_z_loss_coef (default: 0.001)
│    │    └─── Sum across all MoE layers
│    │
│    └─── Total Loss
│         └─── total = lm_loss + aux_losses
│
├─── 📊 MONITORING & METRICS
│    │
│    ├─── MetricsTracker
│    │    ├─── Loss tracking (LM, aux, z-loss)
│    │    ├─── Perplexity: exp(lm_loss)
│    │    ├─── Gradient norms per layer
│    │    ├─── GPU memory usage
│    │    ├─── Expert usage statistics
│    │    ├─── Attention cache hit rate
│    │    └─── Periodic summary & clearing
│    │
│    ├─── Gradient Monitoring
│    │    ├─── Max gradient norm per layer
│    │    ├─── Mean gradient norm (EMA)
│    │    ├─── Gradient clipping count
│    │    └─── NaN/Inf detection
│    │
│    └─── Memory Monitoring
│         ├─── GPU memory allocated
│         ├─── GPU memory reserved
│         ├─── Automatic cache clearing
│         └─── Per-layer memory checkpoints
│
└─── 🔧 OPTIMIZATION FEATURES
     │
     ├─── Gradient Checkpointing
     │    ├─── Trade: 30% slower, 50% less memory
     │    ├─── Recompute activations during backward
     │    └─── Enable: model.gradient_checkpointing_enable()
     │
     ├─── Mixed Precision Training (AMP)
     │    ├─── FP16/BF16 forward pass
     │    ├─── FP32 master weights
     │    ├─── Dynamic loss scaling
     │    └─── 2x speedup, 50% memory reduction
     │
     ├─── Gradient Accumulation
     │    ├─── Simulate larger batch size
     │    ├─── loss = loss / accumulation_steps
     │    └─── optimizer.step() every N steps
     │
     ├─── KV Cache (Inference)
     │    ├─── Cache Key/Value tensors
     │    ├─── Reuse for autoregressive generation
     │    ├─── Memory: O(num_layers × seq_len × hidden_size)
     │    └─── Speedup: ~10x untuk long sequences
     │
     └─── Quantization Support
          ├─── 8-bit (LLM.int8)
          │    ├─── bitsandbytes integration
          │    ├─── Mixed precision (outliers in FP16)
          │    └─── 2x memory reduction
          └─── 4-bit (QLoRA)
               ├─── NF4 quantization (normal float 4-bit)
               ├─── Double quantization
               ├─── BF16 compute dtype
               └─── 4x memory reduction

CacaForCausalLM (3.52M)
│
├─ Embedding: 8,000 × 128
│
├─ Transformer Layers (6x)
│  ├─ RMSNorm
│  ├─ Attention (GQA)
│  │  ├─ Q: 4 heads × 32 dim
│  │  ├─ KV: 2 heads × 32 dim
│  │  ├─ RoPE (θ=10,000)
│  │  └─ Flash Attention v2
│  ├─ Residual
│  ├─ RMSNorm
│  ├─ FFN (SwiGLU)
│  │  ├─ Gate: 128 → 512
│  │  ├─ Up: 128 → 512
│  │  └─ Down: 512 → 128
│  └─ Residual
│
├─ Final RMSNorm
└─ LM Head: 128 → 8,000

═══════════════════════════════════════════════════════════
📊 PARAMETER BREAKDOWN:
═══════════════════════════════════════════════════════════
Embeddings:                 1,024,000 ( 29.1%)
Transformer Layers:         1,474,560 ( 41.8%)
  ├─ Attention:               294,912
  └─ FFN:                   1,179,648
Final Norm:                       128 (  0.0%)
───────────────────────────────────────────────────────────
TOTAL:                      3,524,608 (100.0%)
═══════════════════════════════════════════════════════════

Key Design Decisions:

GQA over MHA: Hemat 50% KV cache memory dengan minimal accuracy loss
SwiGLU over GELU: ~10% better performance pada language modeling
RMSNorm over LayerNorm: Lebih cepat & stabil, tanpa bias term
RoPE over Learned: Better extrapolation untuk sequence length > training
No Bias in Linear: Mengikuti modern LLM best practices (LLaMA-style)

📚 Dokumentasi

📦 Instalasi Dependencies

# Core dependencies (REQUIRED)
pip install torch>=2.0.0 transformers>=4.35.0 accelerate safetensors

# Optional: Untuk performa maksimal
pip install flash-attn --no-build-isolation  # Flash Attention 2 (3x speedup)
pip install xformers                          # Memory efficient attention
pip install bitsandbytes                      # 4/8-bit quantization

# Optional: Untuk monitoring & profiling
pip install tensorboard wandb               # Training monitoring
pip install gputil psutil                   # Resource monitoring

Compatibility Matrix:

Component	Version	Note
Python	3.8 - 3.11	3.11 recommended
PyTorch	≥ 2.0.0	2.1+ untuk SDPA optimal
CUDA	11.8 / 12.1	Untuk Flash Attention
Transformers	≥ 4.35.0	Untuk AutoModel support

Cara Penggunaan

1️⃣ Basic Loading

from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
import torch

# Load configuration
config = AutoConfig.from_pretrained(
    "Lyon28/caca-1M-untrained",
    trust_remote_code=True
)

# Load model (FP16 untuk efisiensi)
model = AutoModelForCausalLM.from_pretrained(
    "Lyon28/caca-1M-untrained",
    config=config,
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map="auto"  # Automatic device placement
)

# Model ini UNTRAINED - butuh training dulu!
print(f"Model loaded: {model.num_parameters():,} parameters")
print("⚠️  Model ini belum dilatih dan belum bisa digunakan untuk inference")

2️⃣ Quantized Loading (4-bit/8-bit)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

# Load model dengan quantization
model = AutoModelForCausalLM.from_pretrained(
    "Lyon28/caca-1M-untrained",
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map="auto"
)

print(f"Memory footprint: ~0.00GB (4-bit)")

3️⃣ Training Setup

from transformers import TrainingArguments, Trainer

# Training configuration
training_args = TrainingArguments(
    output_dir="./output",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=2e-4,
    max_steps=10000,
    lr_scheduler_type="cosine",
    warmup_steps=500,
    logging_steps=10,
    save_steps=500,
    fp16=True,  # Mixed precision
    gradient_checkpointing=True,  # Memory efficient
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

# Start training
trainer.train()

Advanced Usage

Gradient Checkpointing (Memory Efficient)

model.gradient_checkpointing_enable()
print("✅ Gradient checkpointing enabled - saves ~40% memory")

Custom Training Loop

from torch.optim import AdamW
from torch.cuda.amp import autocast, GradScaler

optimizer = AdamW(model.parameters(), lr=2e-4)
scaler = GradScaler()

for batch in dataloader:
    # Mixed precision forward
    with autocast(dtype=torch.bfloat16):
        outputs = model(**batch)
        loss = outputs.loss
    
    # Backward with gradient scaling
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
    optimizer.zero_grad()

Multi-GPU Training (DDP)

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel

# Initialize process group
dist.init_process_group(backend="nccl")

# Wrap model
model = DistributedDataParallel(
    model,
    device_ids=[local_rank],
    find_unused_parameters=False
)

⚙️ Konfigurasi Detail

Full Configuration JSON

{
  "architectures": ["CacaForCausalLM"],
  "model_type": "caca",
  "vocab_size": 8000,
  "hidden_size": 128,
  "intermediate_size": 512,
  "num_hidden_layers": 6,
  "num_attention_heads": 4,
  "num_key_value_heads": 2,
  "head_dim": 32,
  "max_position_embeddings": 1024,
  "rope_theta": 10000,
  "rms_norm_eps": 1e-06,
  "use_cache": true,
  "use_qk_norm": true,
  "use_flash_attn": true,
  "attention_dropout": 0.0,
  "hidden_dropout": 0.1,
  "torch_dtype": "float16"
}

Custom Configuration

from transformers import AutoConfig

# Load dan modifikasi config
config = AutoConfig.from_pretrained("Lyon28/caca-1M-untrained")

# Custom modifications
config.max_position_embeddings = 16384  # Extend context
config.rope_scaling = {"type": "linear", "factor": 2.0}
config.use_flash_attn = True
config.hidden_dropout = 0.05

# Save custom config
config.save_pretrained("./custom_config")

🔬 Arsitektur

Layer Structure

Input Tokens
↓
Embedding Layer (vocab_size → hidden_size)
↓
Decoder Block × N

RMSNorm
Multi-Head Attention (GQA)
- Flash Attention v2
- Query heads, KV heads
- RoPE position encoding
Residual Connection
RMSNorm
Feed-Forward Network (SwiGLU)
- Gate: hidden → intermediate
- Up: hidden → intermediate
- Down: intermediate → hidden
Residual Connection

↓
RMSNorm (Final)
↓
LM Head (hidden → vocab_size)
↓
Output Logits

Attention Mechanism (GQA)

Query:  [4 heads × 32 dim] = 128
Key:    [2 heads × 32 dim] = 64
Value:  [2 heads × 32 dim] = 64

Grouped Query Attention:
- Setiap 2 query heads berbagi 1 KV head
- Memory KV cache: 50% lebih kecil dari Multi-Head Attention
- Kualitas mendekati MHA, speed mendekati MQA

Feed-Forward Network (SwiGLU)

FFN(x) = (SiLU(xW_gate) ⊙ xW_up) W_down

Where:
- W_gate: 128 × 512
- W_up:   128 × 512
- W_down: 512 × 128
- SiLU(x) = x · sigmoid(x)
- ⊙ = element-wise multiplication

💬 Format Chat & Prompt Engineering

📝 Chat Template

Model mendukung format chat standar untuk conversational AI:

# Format chat template bawaan
chat_template = """
{% for message in messages %}
{% if message['role'] == 'system' %}
System: {{ message['content'] }}

{% elif message['role'] == 'user' %}
User: {{ message['content'] }}

{% elif message['role'] == 'assistant' %}
Assistant: {{ message['content'] }}

{% endif %}
{% endfor %}
{% if add_generation_prompt %}Assistant:{% endif %}
"""

# Contoh penggunaan
messages = [
    {"role": "system", "content": "Kamu adalah asisten AI yang membantu dan ramah."},
    {"role": "user", "content": "Jelaskan tentang fotosintesis"},
    {"role": "assistant", "content": "Fotosintesis adalah proses di mana tumbuhan mengubah cahaya matahari menjadi energi kimia..."},
    {"role": "user", "content": "Apa manfaatnya bagi manusia?"},
]

# Apply template
formatted = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

print(formatted)
# Output:
# System: Kamu adalah asisten AI yang membantu dan ramah.
#
# User: Jelaskan tentang fotosintesis
# Assistant: Fotosintesis adalah proses di mana tumbuhan...
# User: Apa manfaatnya bagi manusia?
# Assistant:

🎯 Use Cases

Model ini dirancang untuk berbagai aplikasi NLP setelah melalui proses training:

Text Generation

✍️ Creative writing & storytelling
📰 Article generation
💬 Conversational AI
🔄 Text completion

Language Understanding

📊 Text classification
🏷️ Named Entity Recognition (NER)
❓ Question Answering
📝 Summarization

Code Generation

💻 Code completion
🐛 Bug fixing suggestions
📚 Documentation generation
🔄 Code translation

Multilingual Tasks

🌏 Translation (ID ↔ EN)
🗣️ Cross-lingual understanding
🌐 Multilingual classification

📈 Benchmark & Evaluation

⚠️ Model belum melalui evaluasi karena status untrained

Setelah training, model akan dievaluasi pada:

Indonesian Benchmarks

IndoNLU: Comprehensive Indonesian NLU tasks
IndoQA: Indonesian Question Answering
IndoSum: Summarization
IndoNER: Named Entity Recognition

Multilingual Benchmarks

MMLU: Massive Multitask Language Understanding
HellaSwag: Common sense reasoning
ARC: Science QA
TruthfulQA: Truthfulness evaluation

Generation Quality

Perplexity: Language modeling quality
BLEU/ROUGE: Translation & summarization
Human Evaluation: Fluency, coherence, factuality

🛠️ Development & Training Tips

Optimal Batch Size

# Rule of thumb untuk 3.52M model
# GPU Memory → Batch size per device

if gpu_memory >= 80:  # A100 80GB
    batch_size = 4539
    gradient_accumulation = 1
elif gpu_memory >= 40:  # A100 40GB
    batch_size = 2269
    gradient_accumulation = 1
elif gpu_memory >= 24:  # RTX 3090/4090
    batch_size = 1
    gradient_accumulation = 1
    
# Effective batch size = batch_size × gradient_accumulation × num_gpus

Learning Rate Scheduling

# Recommended untuk 3.52M model
learning_rate = 0.0005  # Base LR
warmup_ratio = 0.05  # 5% of total steps
lr_scheduler = "cosine"  # atau "linear"

# Learning rate scaling rule:
# LR ∝ sqrt(batch_size)
# Untuk batch size 256: LR = 0.0005
# Untuk batch size 512: LR = 7.07e-04

Gradient Clipping

# Prevent gradient explosion
max_grad_norm = 1.0  # Clip at 1.0

# Monitor gradients
from torch.nn.utils import clip_grad_norm_

grad_norm = clip_grad_norm_(model.parameters(), max_grad_norm)
if grad_norm > 10.0:
    print(f"⚠️ High gradient norm: {grad_norm:.2f}")

Training Stability

# Tips untuk stable training:

1. **Warmup**: Mulai dengan LR rendah
2. **Gradient Checkpointing**: Kurangi memory footprint
3. **Mixed Precision**: Gunakan BF16 jika tersedia (lebih stable dari FP16)
4. **Batch Size**: Start small, increase gradually
5. **Monitor**: Track loss, perplexity, gradient norms

🔧 Troubleshooting

Out of Memory (OOM)

# Solusi OOM saat training:

✅ 1. Enable gradient checkpointing
model.gradient_checkpointing_enable()

✅ 2. Reduce batch size
per_device_train_batch_size = 1

✅ 3. Increase gradient accumulation
gradient_accumulation_steps = 32

✅ 4. Use quantization
load_in_8bit = True  # atau load_in_4bit

✅ 5. Reduce sequence length
max_length = 1024  # Start dengan ini

✅ 6. CPU offloading (jika perlu)
device_map = "auto"
offload_folder = "offload"

Slow Training

# Optimasi kecepatan training:

✅ 1. Flash Attention
config.use_flash_attn = True  # 2-3x speedup

✅ 2. Compile model (PyTorch 2.0+)
model = torch.compile(model, mode="reduce-overhead")

✅ 3. DataLoader optimization
dataloader = DataLoader(
    dataset,
    batch_size=batch_size,
    num_workers=4,  # Parallel data loading
    pin_memory=True,  # Faster GPU transfer
    prefetch_factor=2
)

✅ 4. Mixed precision
use_fp16 = True  # atau bf16

✅ 5. Optimize communication (multi-GPU)
find_unused_parameters = False
gradient_as_bucket_view = True

NaN Loss

# Jika loss menjadi NaN:

✅ 1. Reduce learning rate
learning_rate = learning_rate * 0.1

✅ 2. Check gradient norms
clip_grad_norm_(model.parameters(), 1.0)

✅ 3. Use BF16 instead of FP16
torch_dtype = torch.bfloat16  # Lebih stable

✅ 4. Add epsilon to RMSNorm
rms_norm_eps = 1e-5  # Increase jika perlu

✅ 5. Check data
# Pastikan tidak ada inf/nan di dataset
assert not torch.isnan(input_ids).any()
assert not torch.isinf(attention_mask).any()

🚫 Prohibited Uses

Model ini TIDAK BOLEH digunakan untuk:

🚫 Harmful content generation (violence, self-harm, illegal acts)
🚫 Misinformation/disinformation campaigns
🚫 Harassment or hate speech
🚫 Impersonation or identity theft
🚫 Child safety violations (CSAM, grooming, exploitation)
🚫 Privacy violations (doxxing, stalking, surveillance abuse)
🚫 Malicious code generation (malware, exploits, etc)
🚫 Spam or manipulation (fake reviews, astroturfing)
🚫 Medical/legal advice (tanpa disclaimer & expert review)
🚫 Financial fraud (scams, market manipulation)

Violation consequences: Model access revocation + legal action jika applicable

📚 References & Papers

Core Architecture

LLaMA - Touvron et al., 2023
- RMSNorm, RoPE, SwiGLU, GQA
GPT-4 - OpenAI Technical Report, 2023
- Mixture of Experts (speculated)
Gemini - Google DeepMind, 2023
- Multimodal architecture, soft-capping
Qwen - Alibaba Cloud, 2023
- YARN, long context
Gemma - Google, 2024
- Layer scaling, normalization

Advanced Techniques

Flash Attention 2 - Dao, 2023
Mixture-of-Depths - Raposo et al., 2024
StreamingLLM - Xiao et al., 2023
YARN - Peng et al., 2023
QLoRA - Dettmers et al., 2023

⚠️ Known Limitations

Training Cost - MoE + Multimodal = expensive
Complex Debugging - Banyak fallback systems
Memory Hungry - Jika semua fitur enabled
Dependency Hell - Butuh flash-attn, xformers, bitsandbytes
Expert Balancing - MoE butuh careful tuning untuk load balancing

📜 License & Citation

📄 License

Model ini dirilis di bawah Apache License 2.0

✅ Anda BEBAS untuk:

✔️ Gunakan secara komersial
✔️ Modifikasi sesuka hati
✔️ Distribusi ulang
✔️ Patent use
✔️ Private use

⚠️ Dengan syarat:

📄 Include license & copyright notice
📝 State changes yang dibuat
📋 Disclaimer of warranty

❌ Tanpa jaminan apapun (use at your own risk)

Full license text: Apache-2.0

📖 Citation

Jika Anda menggunakan model ini dalam penelitian, mohon sitasi:

@misc{cacacaca1m,
  author = {Lyon},
  title = {Caca-caca-1M: Modern Transformer Architecture with Grouped Query Attention},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/Lyon28/caca-1M-untrained}},
  note = {Untrained model with 3,524,608 parameters}
}

APA Style:

Lyon. (2026). Caca-caca-1M: Modern Transformer Architecture with Grouped 
Query Attention [Untrained model]. Hugging Face. 
https://huggingface.co/Lyon28/caca-1M-untrained

MLA Style:

Lyon. "Caca-caca-1M: Modern Transformer Architecture with Grouped Query Attention." 
Hugging Face, 2026, huggingface.co/Lyon28/caca-1M-untrained.

🙏 Acknowledgments

Model ini berdiri di pundak para raksasa! Terima kasih kepada:

🏛️ Klik untuk daftar lengkap acknowledgments

🏗️ Core Architecture

LLaMA/LLaMA 2 (Meta AI, 2023) - Decoder-only architecture, RMSNorm, SwiGLU
- Paper: LLaMA: Open and Efficient Foundation Language Models
- Authors: Hugo Touvron et al.
GPT-3 (OpenAI, 2020) - Transformer language modeling paradigm
PaLM (Google, 2022) - SwiGLU activation insights

🎯 Attention Mechanisms

Flash Attention v2 (Tri Dao et al., Stanford, 2023)
- Paper: FlashAttention-2: Faster Attention with Better Parallelism
- 3x speedup dengan IO-aware algorithm
Grouped Query Attention (Joshua Ainslie et al., Google, 2023)
- Paper: GQA: Training Generalized Multi-Query Transformer
- Memory-efficient KV cache
Multi-Query Attention (Noam Shazeer, Google, 2019)
- Fast inference dengan shared K/V
xFormers (Meta AI, 2022) - Memory efficient attention
PyTorch SDPA (PyTorch Team, 2023) - Native attention optimization

📍 Position Encodings

RoPE (Jianlin Su et al., EleutherAI, 2021)
- Paper: RoFormer: Enhanced Transformer with Rotary Position Embedding
- Superior length extrapolation
ALiBI (Ofir Press et al., 2022)
- Paper: Train Short, Test Long: Attention with Linear Biases
- Length generalization without retraining
YaRN (Bowen Peng et al., 2023)
- Paper: YaRN: Efficient Context Window Extension

🪟 Long Context & Efficiency

Sliding Window Attention (Albert Gu et al., Mistral AI, 2023)
- Paper: Mistral 7B
StreamingLLM (Guangxuan Xiao et al., MIT, 2023)
- Paper: Efficient Streaming Language Models with Attention Sinks
- Infinite sequence length!
Logit Softcapping (Google Gemma Team, 2024)
- Paper: Gemma: Open Models Based on Gemini

🧠 Mixture of Experts

Mixtral 8x7B (Albert Jiang et al., Mistral AI, 2024)
- Paper: Mixtral of Experts
- State-of-the-art sparse MoE
Switch Transformers (William Fedus et al., Google, 2021)
- Paper: Switch Transformers: Scaling to Trillion Parameter Models
- Expert scaling insights
GLaM (Nan Du et al., Google, 2021) - Generalist Language Model
Expert Choice Routing (Yanqi Zhou et al., Google, 2022)
- Better load balancing

🎓 Training Optimizations

Layer Scale (Hugo Touvron et al., Meta, 2021)
- Paper: Going Deeper with Image Transformers
- Training stability untuk deep networks
Stochastic Depth (Gao Huang et al., 2016)
- Paper: Deep Networks with Stochastic Depth
Mixture of Depths (David Raposo et al., DeepMind, 2024)
- Paper: Mixture-of-Depths: Dynamically allocating compute
- Dynamic compute allocation
Gradient Checkpointing (Tianqi Chen et al., 2016)

📦 Quantization

LLM.int8() (Tim Dettmers et al., 2022)
- Paper: LLM.int8(): 8-bit Matrix Multiplication for Transformers
QLoRA (Tim Dettmers et al., 2023)
- Paper: QLoRA: Efficient Finetuning of Quantized LLMs
- 4-bit efficient fine-tuning
bitsandbytes (Tim Dettmers) - Quantization library

🎨 Multimodal

Vision Transformer (Alexey Dosovitskiy et al., Google, 2020)
- Paper: An Image is Worth 16x16 Words
Flamingo (Jean-Baptiste Alayrac et al., DeepMind, 2022)
- Paper: Flamingo: a Visual Language Model
- Perceiver Resampler
BLIP-2 (Junnan Li et al., Salesforce, 2023)
- Paper: BLIP-2: Bootstrapping Language-Image Pre-training
- Q-Former architecture
Whisper (Alec Radford et al., OpenAI, 2022) - Audio encoding

🛠️ Normalization & Activations

RMSNorm (Biao Zhang, Rico Sennrich, 2019)
- Paper: Root Mean Square Layer Normalization
SwiGLU (Noam Shazeer, Google, 2020)
- Paper: GLU Variants Improve Transformer

🔧 Tools & Frameworks

🤗 Hugging Face - Transformers, Accelerate, PEFT
- Making NLP accessible to everyone
PyTorch - Deep learning framework
- Facebook AI Research team
Safetensors - Secure serialization
- Hugging Face team
DeepSpeed - Distributed training
- Microsoft Research
Flash Attention Implementation - Tri Dao & team

🇮🇩 Indonesian NLP Community

Special thanks to Indonesian NLP researchers & practitioners yang telah membangun foundation untuk Indonesian language AI.

📄 License

Model ini dirilis di bawah Apache License 2.0.

Ketentuan Penggunaan:

✅ Bebas digunakan untuk keperluan komersial dan non-komersial
✅ Modifikasi diperbolehkan
✅ Distribusi diperbolehkan dengan attribution
⚠️ No Warranty - model disediakan "as is"
📝 Attribution Required - sertakan copyright notice

Lihat LICENSE untuk detail lengkap.

🤝 Contributing

Kami sangat terbuka untuk kontribusi! Berikut cara Anda bisa berkontribusi:

Training & Fine-tuning

🎓 Train model ini dengan dataset Anda
📊 Share benchmark results
🔬 Experiment dengan hyperparameters

Code & Architecture

🐛 Report bugs atau issues
💡 Suggest improvements
🔧 Submit pull requests

Documentation

📚 Improve documentation
🌐 Add translations
✍️ Write tutorials & guides

Dataset & Evaluation

📝 Contribute training data
🧪 Create evaluation benchmarks
🎯 Share fine-tuned versions

👥 Team & Acknowledgments

Core Team

LyonPoy - Architecture design & implementation

Special Thanks

🤗 Hugging Face - Infrastructure & community
⚡ FlashAttention Team - Efficient attention implementation
🧠 Anthropic, Google, Meta, openAI, etc - Research inspirations
Meta AI (LLaMA)
OpenAI (GPT series)
Google DeepMind (Gemini, Gemma)
Alibaba Cloud (Qwen)
HuggingFace (Transformers library)
Tri Dao (Flash Attention)
Tim Chen (bitsandbytes)

Community

Terima kasih kepada komunitas open-source yang telah berkontribusi pada:

Transformers library
PyTorch framework
Datasets & evaluation tools

📞 Contact & Support

Community

💬 Discussions - Ask questions
🐛 Issues - Report bugs
📧 Email : [email protected]

🌟 Star History

💝 Dibuat dengan ❤️ untuk Komunitas AI Indonesia

Terima kasih telah menggunakan Caca!

Jika model ini berguna, jangan lupa ⭐ repository kami!

⭐
Star Repo
_{Show your support}

🔗
Share
_{Tell your friends}

💬
Join Discussion
_{Ask questions}

🤝
Contribute
_{Make it better}

🚀 Happy Training! 🚀

Model ini menunggu untuk dilatih dan menjadi foundation untuk aplikasi AI Anda.

📥 Download Model • 📖 Read Docs • 💬 Join Community

📊 Model Statistics

🎨 Daily Inspiration

📈 Quick Stats

Metric	Value
💎 Total Parameters	3,524,608
🏗️ Layers	6
🎯 Attention Heads	4
📖 Max Context	1,024 tokens
💾 Size (FP16)	0.01 GB
💾 Size (INT4)	0.00 GB

_{Model ini adalah bagian dari Caca Project - Open source initiative untuk membangun Indonesian LLM ecosystem.

Created with 💻 by @Lyon28 |
Licensed under Apache 2.0 |
Built with 🤗 HuggingFace}

🌟 "Dari nol, untuk semua" 🌟

_{Last updated: january 2026}

_{Built with ❤️ by Caca Transformers Team}
_{Powered by 🤗 Transformers • ⚡ PyTorch • 🔥 Flash Attention}

Downloads last month: 386

Safetensors

Model size

3.52M params

Tensor type

F32

Papers for Lyon28/caca-1M-untrained

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Paper • 2404.02258 • Published Apr 2, 2024 • 107