W2V-BERT 2.0 Hybrid V3 with Per-Language Adapters

This model provides per-language ASR adapters for 6 languages built on top of frozen W2V-BERT 2.0.

Model Architecture

  • Base Model: facebook/w2v-bert-2.0 (frozen)
  • Adapters: MMS-style bottleneck adapters (dim=64) inserted after each encoder layer
  • Decoder: Single transformer decoder block with gated attention and FFN
  • Vocabulary: Per-language character-level with extended double-vowel tokens

Architecture Details

  • Total parameters: 592M
  • Trainable parameters per language: ~11.7M (1.97%)
    • Decoder: 8.4M (72%)
    • Adapters: 3.2M (27.6%)
    • LM Head: 41K (0.4%)
    • Final Norm: 2K (0.0%)

Training Configuration

  • Training samples: 15,000 per language
  • Test samples: 2,000 per language
  • Epochs: 15
  • Batch size: 32 (8 Γ— 4 gradient accumulation)
  • Learning rate: 5e-4 with cosine schedule
  • Extended vocabulary: Yes (includes double-vowel tokens: aa, ee, ii, oo, uu, Δ©Δ©, Ε©Ε©)

Results

Language Code WER (%) Status
English eng 10.81 βœ… Excellent
Kikuyu kik 13.66 βœ… Excellent
Luo luo 16.17 βœ… Good
Kamba kam 28.07 ⚠️ Moderate
Kimeru mer 32.56 ⚠️ Moderate
Swahili swh 99.97 ❌ Failed*

*Note: Swahili training failed due to a dataset issue. This adapter should not be used.

Repository Structure

β”œβ”€β”€ config.json                 # Base model configuration
β”œβ”€β”€ preprocessor_config.json    # Feature extractor config
β”œβ”€β”€ training_summary.json       # Full training metrics
β”œβ”€β”€ README.md
β”œβ”€β”€ kik/                        # Kikuyu adapter
β”‚   β”œβ”€β”€ adapter_config.json
β”‚   β”œβ”€β”€ adapter_weights.pt      # MMS-style adapter weights
β”‚   β”œβ”€β”€ decoder_weights.pt      # Decoder block weights
β”‚   β”œβ”€β”€ lm_head_weights.pt      # Language model head
β”‚   β”œβ”€β”€ final_norm_weights.pt   # Final layer norm
β”‚   β”œβ”€β”€ vocab.json              # Language-specific vocabulary
β”‚   β”œβ”€β”€ tokenizer_config.json
β”‚   └── metrics.json            # Per-language metrics
β”œβ”€β”€ kam/                        # Kamba adapter
β”œβ”€β”€ mer/                        # Kimeru adapter
β”œβ”€β”€ luo/                        # Luo adapter
β”œβ”€β”€ swh/                        # Swahili adapter (failed)
└── eng/                        # English adapter

Usage

import torch
from transformers import Wav2Vec2BertProcessor
from adapter_variants_v3 import create_hybrid_model_v3

# Load base model with adapters
model = create_hybrid_model_v3(
    vocab_size=40,  # Adjust per language
    pad_token_id=0,
    adapter_dim=64
)

# Load language-specific weights
lang = "kik"
adapter_dir = f"mutisya/w2v-bert-v3Hybrid-lora-6lang--25_50-v2/{lang}"

model.adapters.load_state_dict(torch.load(f"{adapter_dir}/adapter_weights.pt"))
model.decoder_blocks.load_state_dict(torch.load(f"{adapter_dir}/decoder_weights.pt"))
model.lm_head.load_state_dict(torch.load(f"{adapter_dir}/lm_head_weights.pt"))
model.final_layer_norm.load_state_dict(torch.load(f"{adapter_dir}/final_norm_weights.pt"))

# Load processor
processor = Wav2Vec2BertProcessor.from_pretrained(adapter_dir)

# Transcribe
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

Known Issues

  1. Double-vowel CTC collapse: The model tends to collapse double vowels (aa→a, ii→i) in ~40-50% of cases. See the investigation report for details.

  2. Swahili failure: The Swahili adapter failed to train properly (99.97% WER). This is likely due to a dataset loading issue.

Citation

If you use this model, please cite:

@misc{mutisya2024multilingual,
  title={Multilingual ASR Adapters for East African Bantu Languages},
  author={Mutisya},
  year={2024},
  publisher={HuggingFace}
}

License

Apache 2.0

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train mutisya/w2v-bert-v3Hybrid-lora-6lang-25_50-v2