read dnsmos scores
sample human voice score (you can find wav tsukuyomui_chan_corpus datasets. datasets was not in train )
japanese speaker's human voice score avrgs ↓
tsukuyomui_corpus_sample
,filename,len_in_sec,sr,num_hops,OVRL_raw,SIG_raw,BAK_raw,OVRL,SIG,BAK,P808_MOS
0,.\test\VOICEACTRESS100_017.wav,5.0001875,16000,1,3.2000952,3.5052588,3.896164,2.9227098969128984,3.2528423553812953,3.8747408680754774,3.3338459
1,.\test\VOICEACTRESS100_012.wav,5.1775,16000,1,2.7019486,2.9619849,3.500944,2.565979346828354,2.8846291622132165,3.6237025378548644,3.7821674
2,.\test\VOICEACTRESS100_003.wav,4.8096875,16000,1,2.1873345,2.6199408,2.827399,2.1621915352276515,2.6273744596126547,3.1010927877041423,3.4254858
3,.\test\VOICEACTRESS100_013.wav,4.8898125,16000,1,2.9094923,3.0740592,4.1051044,2.7186855172813935,2.964647590293625,3.99083596341334,3.0937512
4,.\test\VOICEACTRESS100_009.wav,5.734375,16000,2,2.7630239,3.0332813,3.5787196,2.611414337554783,2.9348871896737814,3.676297288529258,3.829452
5,.\test\VOICEACTRESS100_004.wav,5.4488125,16000,1,2.751989,2.9214494,3.7092264,2.6033311787824274,2.855168354030632,3.761127332034449,3.520155
6,.\test\VOICEACTRESS100_005.wav,10.55375,16000,1,3.3857923,3.6571841,3.9042385,3.047098137709349,3.3469432742323764,3.879440925353043,3.8805087
7,.\test\VOICEACTRESS100_030.wav,4.98825,16000,1,2.7880013,3.2666614,3.3197627,2.6300024481893955,3.0972332957037687,3.4948681538919852,3.9011118
8,.\test\VOICEACTRESS100_016.wav,4.91025,16000,1,3.122046,3.4049766,3.9595494,2.8690361983886943,3.1886048068513237,3.91117498264337,3.331627
9,.\test\VOICEACTRESS100_020.wav,6.2341875,16000,3,2.1416757,2.421304,3.2206821,2.1243002049505084,2.468773497435098,3.419672666404756,3.5495617
and tts synthesized voice score avrgs ↓
amitaro's courpus siingle speaker ft model (not upload, you can ft single speaker and nearly score)
dnsmos_out_sample
,filename,len_in_sec,sr,num_hops,OVRL_raw,SIG_raw,BAK_raw,OVRL,SIG,BAK,P808_MOS
0,.\test\amitaro dataset (raw human voice) emoNormal002.wav,2.761,16000,2,3.8337939,4.0836735,4.4308057,3.3279373705739044,3.5903295083053113,4.148855150656194,3.5687275
1,.\test\amitaro dataset (raw human voice)emoNormal003.wav,3.1690625,16000,3,3.8572223,4.1315084,4.4619784,3.3418322115351735,3.615698270734455,4.162368712074678,3.7228901
2,.\test\amitaro dataset (raw human voice) emoNormal001.wav,1.637125,16000,4,3.26939,3.466292,4.3564534,2.9692884277592313,3.227623923756912,4.1147632738098014,3.3102732
3,.\test\sbv2 amitaro.wav,4.2376875,16000,7,3.1194606,3.6841893,3.3271327,2.867083568671068,3.36297831453127,3.4996973662018243,3.2683856
6,.\test\fix rev17 我輩は猫である(pd).wav,64.0678125,16000,55,3.467322,3.843662,3.9634974,3.095989628156305,3.454443241861787,3.900091444817868,3.7221756
finaly verson generated audio rev17
and other audio in generated audio dir
train, infer (webgui) code in this fork vram use lower than original and this model only work this repo https://github.com/q9uri/index-tts-ja
model pretrain use jvnv courpus, cretaed by taakamichi shinosuke sensei and japanese voice actor!
and reason-speech-v2-denoized
original reazon-speech was created by reazon team, source voice was japanese tv wav file under licensed by 日本国著作権の例外項目
denoized by fishaudio. use uvr5 reupload hf by litagin02
anime-whisper-0.3 use create text transcript kanji in suppres token,transcripts nearly kana only text.
model was trained sigle rtx 3060 (max 60% setting,power look like rtx a2000)
batch size 1, amp don't use (haha, i forgotten. recommend use amp)
gpu was i have'n ローカルllmに向き合う会 hackason. thx サルドラ (@saldra) ゆづき
may i support me?
buy gpu for me amazon.jp shoplist
download custom pretrain models
https://huggingface.co/WariHima/index-tts-japanese-prosody
and other orginal index-tts2 weight file need.
infer need cuda 12.8 and vram 8gb created voice length * 2 sec,
36000+6000x8x6cycle_steps.pth rename to gpt.pth, copy to ./checkpoints.
japanese-bpe.model to ./checkpoints, don't be rename.
run webui python webui.py