model configration
Hi can you please share model configuration used while pre-training and fine-tuning.
pruned RNN-T fine-tune, LS 960
./hubert/finetune.py
--world-size 8
--num-epochs 222
--start-epoch 1
--use-fp16 0
--exp-dir hubert/exp_finetune
--pretrained-dir download/hubert/hubert_large_ll60k.pt
--full-libri 1
--max-duration 80
--accum-grad 1
--do-normalize 1
--encoder-layers 24
--encoder-embed-dim 1024
--encoder-ffn-embed-dim 4096
--encoder-attention-heads 16
--final-dim 768
--layer-norm-first 1
--untie-final-proj 1
--extractor-mode "layer_norm"
--mask-prob 0.50
--mask-channel-prob 0.25
--mask-channel-length 64
--encoder-layerdrop 0.1
--activation-dropout 0.1
--feature-grad-mult 0.1
--base-lr 0.001
--lr-epochs 10.5
pruned RNN-T decode
for ((epoch=2; epoch<=19; epoch+=1)); do
for ((avg=1; avg<=$epoch-1; avg+=1)); do
./hubert/decode.py
--epoch $epoch
--avg $avg
--exp-dir ./hubert/exp_finetune
--max-duration 1000
--decoding-method greedy_search
--do-normalize 1
--encoder-layers 24
--encoder-embed-dim 1024
--encoder-ffn-embed-dim 4096
--encoder-attention-heads 16
--final-dim 768
--layer-norm-first 1
--untie-final-proj 1
--extractor-mode "layer_norm"
done
done
The hubert large pretrained model is downloaded from fairseq.
thank you
@yfyeung
for your quick response.
I'm considering you used the same configuration for large zipformer pertaining model too.
Does zipformer pertaining and fine-tuning support streaming?.
I found causal implemented with zipfomer but not exposed to use?
No, the streaming Zipformer (--causal 1) differs from the non-streaming Zipformer (--causal 0) in terms of its model architecture.
Hi @yfyeung , Above large pre-training and finetuning configuration doesn't fit the current script,
I tried it to change the parameter name as below .
--encoder-layers -> --num-encoder-layers
--encoder-embed-dim -> --encoder-dim
--encoder-ffn-embed-dim -> --feedforward-dim
--encoder-attention-heads -> --num-heads
--final-dim -> ??
zipformer/pretrain.py takes the above parameters but leads to 3702317981 parameters, different from 318M in your model.
Can you please help me to replicate your model arch? would be a great help if you could provide parameters for latest script.
Hi, here is the 318M config. It was trained on 32 V100 32GB GPUs for about 2–3 weeks. Due to limited compute resources at the time, it didn’t reach as many training steps as the original HuBERT paper (400k) or the CMU replication version (800k).
Make sure use the code in the PR: https://github.com/k2-fsa/icefall/pull/1745, instead of code in master.
torchrun
--nproc_per_node $num_gpus
--nnodes $num_nodes
--node_rank $node_rank
--master_addr $master_addr
--master_port $master_port
zipformer/pretrain.py
--use-multi-node 1
--master-port $master_port
--num-epochs 20
--start-epoch 1
--use-fp16 1
--exp-dir zipformer/exp_pretrain
--max-duration 350
--quadratic-duration 1024
--accum-grad 1
--do-normalize 1
--mask-prob 0.8
--dropout-input 0.0
--dropout-features 0.0
--feature-grad-mult 1.0
--num-encoder-layers 2,2,4,5,4,2
--feedforward-dim 768,1536,2048,3072,2048,1536
--encoder-dim 256,512,768,1024,768,512
--encoder-unmasked-dim 256,256,256,320,256,256
--base-lr 0.045
Personally, I found that Zipformer scales with diminishing returns. Compared to Standard Zipformer Large (~150M), the 318M model sacrifices a lot in efficiency without showing significant performance improvements.
After completing pre-training, you can export the averaged model using the following script:
./zipformer/generate_averaged_model.py
--exp-dir k2ssl-librilight-zipformer-large/exp
--epoch xxx
--avg xxx (~3/4 epochs)
--num-encoder-layers 2,2,4,5,4,2
--feedforward-dim 768,1536,2048,3072,2048,1536
--encoder-dim 256,512,768,1024,768,512
--encoder-unmasked-dim 256,256,256,320,256,256
Hi @yfyeung
i'm trying to train a multilingual large SSL model in K2 librispeech recipe with your config(318M parameters).
I have made a small change in recipe, trained 3000 k-means cluster insted of 500.
i'm stuck with error in very first compute_mask_indices stage in zipformer pretraining.
File "/icefall/egs/librispeech/SSL/zipformer/hubert_ce.py", line 423, in forward
x, mask_indices = self.apply_mask(features, padding_mask, target_list)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/icefall/egs/librispeech/SSL/zipformer/hubert_ce.py", line 319, in apply_mask
mask_indices = compute_mask_indices(
^^^^^^^^^^^^^^^^^^^^^
File "/icefall/egs/librispeech/SSL/zipformer/hubert_ce.py", line 191, in compute_mask_indices
raise ValueError(
ValueError: the entire sequence is masked. sz=8; mask_idc[mask_idc]; index=None
i tried by reducing --mask-prob till 0.1 but no luck.
Can you please provide some guidance to fix this issue.
thank you in advance.
hi @yfyeung , the training is running fine now, there was issue with audio loading with bash sox conversion.
OK. Glad to hear that.