W2V-BERT 2.0 Hybrid V3 with Per-Language Adapters

This model provides per-language ASR adapters for 6 languages built on top of frozen W2V-BERT 2.0.

Model Architecture

Base Model: facebook/w2v-bert-2.0 (frozen)
Adapters: MMS-style bottleneck adapters (dim=64) inserted after each encoder layer
Decoder: Single transformer decoder block with gated attention and FFN
Vocabulary: Per-language character-level with extended double-vowel tokens

Architecture Details

Total parameters: 592M
Trainable parameters per language: ~11.7M (1.97%)
- Decoder: 8.4M (72%)
- Adapters: 3.2M (27.6%)
- LM Head: 41K (0.4%)
- Final Norm: 2K (0.0%)

Training Configuration

Training samples: 15,000 per language
Test samples: 2,000 per language
Epochs: 15
Batch size: 32 (8 × 4 gradient accumulation)
Learning rate: 5e-4 with cosine schedule
Extended vocabulary: Yes (includes double-vowel tokens: aa, ee, ii, oo, uu, ĩĩ, ũũ)

Results

Language	Code	WER (%)	Status
English	eng	10.81	✅ Excellent
Kikuyu	kik	13.66	✅ Excellent
Luo	luo	16.17	✅ Good
Kamba	kam	28.07	⚠️ Moderate
Kimeru	mer	32.56	⚠️ Moderate
Swahili	swh	99.97	❌ Failed*

*Note: Swahili training failed due to a dataset issue. This adapter should not be used.

Repository Structure

├── config.json                 # Base model configuration
├── preprocessor_config.json    # Feature extractor config
├── training_summary.json       # Full training metrics
├── README.md
├── kik/                        # Kikuyu adapter
│   ├── adapter_config.json
│   ├── adapter_weights.pt      # MMS-style adapter weights
│   ├── decoder_weights.pt      # Decoder block weights
│   ├── lm_head_weights.pt      # Language model head
│   ├── final_norm_weights.pt   # Final layer norm
│   ├── vocab.json              # Language-specific vocabulary
│   ├── tokenizer_config.json
│   └── metrics.json            # Per-language metrics
├── kam/                        # Kamba adapter
├── mer/                        # Kimeru adapter
├── luo/                        # Luo adapter
├── swh/                        # Swahili adapter (failed)
└── eng/                        # English adapter

Usage

import torch
from transformers import Wav2Vec2BertProcessor
from adapter_variants_v3 import create_hybrid_model_v3

# Load base model with adapters
model = create_hybrid_model_v3(
    vocab_size=40,  # Adjust per language
    pad_token_id=0,
    adapter_dim=64
)

# Load language-specific weights
lang = "kik"
adapter_dir = f"mutisya/w2v-bert-v3Hybrid-lora-6lang--25_50-v2/{lang}"

model.adapters.load_state_dict(torch.load(f"{adapter_dir}/adapter_weights.pt"))
model.decoder_blocks.load_state_dict(torch.load(f"{adapter_dir}/decoder_weights.pt"))
model.lm_head.load_state_dict(torch.load(f"{adapter_dir}/lm_head_weights.pt"))
model.final_layer_norm.load_state_dict(torch.load(f"{adapter_dir}/final_norm_weights.pt"))

# Load processor
processor = Wav2Vec2BertProcessor.from_pretrained(adapter_dir)

# Transcribe
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

Known Issues

Double-vowel CTC collapse: The model tends to collapse double vowels (aa→a, ii→i) in ~40-50% of cases. See the investigation report for details.
Swahili failure: The Swahili adapter failed to train properly (99.97% WER). This is likely due to a dataset loading issue.

Citation

If you use this model, please cite:

@misc{mutisya2024multilingual,
  title={Multilingual ASR Adapters for East African Bantu Languages},
  author={Mutisya},
  year={2024},
  publisher={HuggingFace}
}

License

Apache 2.0

Downloads last month: 11

mutisya
/

w2v-bert-v3Hybrid-lora-6lang-25_50-v2