W2V-BERT 2.0 Hybrid V3 with Per-Language Adapters
This model provides per-language ASR adapters for 6 languages built on top of frozen W2V-BERT 2.0.
Model Architecture
- Base Model: facebook/w2v-bert-2.0 (frozen)
- Adapters: MMS-style bottleneck adapters (dim=64) inserted after each encoder layer
- Decoder: Single transformer decoder block with gated attention and FFN
- Vocabulary: Per-language character-level with extended double-vowel tokens
Architecture Details
- Total parameters: 592M
- Trainable parameters per language: ~11.7M (1.97%)
- Decoder: 8.4M (72%)
- Adapters: 3.2M (27.6%)
- LM Head: 41K (0.4%)
- Final Norm: 2K (0.0%)
Training Configuration
- Training samples: 15,000 per language
- Test samples: 2,000 per language
- Epochs: 15
- Batch size: 32 (8 Γ 4 gradient accumulation)
- Learning rate: 5e-4 with cosine schedule
- Extended vocabulary: Yes (includes double-vowel tokens: aa, ee, ii, oo, uu, Δ©Δ©, Ε©Ε©)
Results
| Language | Code | WER (%) | Status |
|---|---|---|---|
| English | eng | 10.81 | β Excellent |
| Kikuyu | kik | 13.66 | β Excellent |
| Luo | luo | 16.17 | β Good |
| Kamba | kam | 28.07 | β οΈ Moderate |
| Kimeru | mer | 32.56 | β οΈ Moderate |
| Swahili | swh | 99.97 | β Failed* |
*Note: Swahili training failed due to a dataset issue. This adapter should not be used.
Repository Structure
βββ config.json # Base model configuration
βββ preprocessor_config.json # Feature extractor config
βββ training_summary.json # Full training metrics
βββ README.md
βββ kik/ # Kikuyu adapter
β βββ adapter_config.json
β βββ adapter_weights.pt # MMS-style adapter weights
β βββ decoder_weights.pt # Decoder block weights
β βββ lm_head_weights.pt # Language model head
β βββ final_norm_weights.pt # Final layer norm
β βββ vocab.json # Language-specific vocabulary
β βββ tokenizer_config.json
β βββ metrics.json # Per-language metrics
βββ kam/ # Kamba adapter
βββ mer/ # Kimeru adapter
βββ luo/ # Luo adapter
βββ swh/ # Swahili adapter (failed)
βββ eng/ # English adapter
Usage
import torch
from transformers import Wav2Vec2BertProcessor
from adapter_variants_v3 import create_hybrid_model_v3
# Load base model with adapters
model = create_hybrid_model_v3(
vocab_size=40, # Adjust per language
pad_token_id=0,
adapter_dim=64
)
# Load language-specific weights
lang = "kik"
adapter_dir = f"mutisya/w2v-bert-v3Hybrid-lora-6lang--25_50-v2/{lang}"
model.adapters.load_state_dict(torch.load(f"{adapter_dir}/adapter_weights.pt"))
model.decoder_blocks.load_state_dict(torch.load(f"{adapter_dir}/decoder_weights.pt"))
model.lm_head.load_state_dict(torch.load(f"{adapter_dir}/lm_head_weights.pt"))
model.final_layer_norm.load_state_dict(torch.load(f"{adapter_dir}/final_norm_weights.pt"))
# Load processor
processor = Wav2Vec2BertProcessor.from_pretrained(adapter_dir)
# Transcribe
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
Known Issues
Double-vowel CTC collapse: The model tends to collapse double vowels (aaβa, iiβi) in ~40-50% of cases. See the investigation report for details.
Swahili failure: The Swahili adapter failed to train properly (99.97% WER). This is likely due to a dataset loading issue.
Citation
If you use this model, please cite:
@misc{mutisya2024multilingual,
title={Multilingual ASR Adapters for East African Bantu Languages},
author={Mutisya},
year={2024},
publisher={HuggingFace}
}
License
Apache 2.0
- Downloads last month
- 11