read dnsmos scores

sample human voice score (you can find wav tsukuyomui_chan_corpus datasets. datasets was not in train ) japanese speaker's human voice score avrgs ↓
tsukuyomui_corpus_sample

 ,filename,len_in_sec,sr,num_hops,OVRL_raw,SIG_raw,BAK_raw,OVRL,SIG,BAK,P808_MOS
0,.\test\VOICEACTRESS100_017.wav,5.0001875,16000,1,3.2000952,3.5052588,3.896164,2.9227098969128984,3.2528423553812953,3.8747408680754774,3.3338459
1,.\test\VOICEACTRESS100_012.wav,5.1775,16000,1,2.7019486,2.9619849,3.500944,2.565979346828354,2.8846291622132165,3.6237025378548644,3.7821674
2,.\test\VOICEACTRESS100_003.wav,4.8096875,16000,1,2.1873345,2.6199408,2.827399,2.1621915352276515,2.6273744596126547,3.1010927877041423,3.4254858
3,.\test\VOICEACTRESS100_013.wav,4.8898125,16000,1,2.9094923,3.0740592,4.1051044,2.7186855172813935,2.964647590293625,3.99083596341334,3.0937512
4,.\test\VOICEACTRESS100_009.wav,5.734375,16000,2,2.7630239,3.0332813,3.5787196,2.611414337554783,2.9348871896737814,3.676297288529258,3.829452
5,.\test\VOICEACTRESS100_004.wav,5.4488125,16000,1,2.751989,2.9214494,3.7092264,2.6033311787824274,2.855168354030632,3.761127332034449,3.520155
6,.\test\VOICEACTRESS100_005.wav,10.55375,16000,1,3.3857923,3.6571841,3.9042385,3.047098137709349,3.3469432742323764,3.879440925353043,3.8805087
7,.\test\VOICEACTRESS100_030.wav,4.98825,16000,1,2.7880013,3.2666614,3.3197627,2.6300024481893955,3.0972332957037687,3.4948681538919852,3.9011118
8,.\test\VOICEACTRESS100_016.wav,4.91025,16000,1,3.122046,3.4049766,3.9595494,2.8690361983886943,3.1886048068513237,3.91117498264337,3.331627
9,.\test\VOICEACTRESS100_020.wav,6.2341875,16000,3,2.1416757,2.421304,3.2206821,2.1243002049505084,2.468773497435098,3.419672666404756,3.5495617

and tts synthesized voice score avrgs ↓

amitaro's courpus siingle speaker ft model (not upload, you can ft single speaker and nearly score)
dnsmos_out_sample

,filename,len_in_sec,sr,num_hops,OVRL_raw,SIG_raw,BAK_raw,OVRL,SIG,BAK,P808_MOS
0,.\test\amitaro dataset (raw human voice) emoNormal002.wav,2.761,16000,2,3.8337939,4.0836735,4.4308057,3.3279373705739044,3.5903295083053113,4.148855150656194,3.5687275
1,.\test\amitaro dataset (raw human voice)emoNormal003.wav,3.1690625,16000,3,3.8572223,4.1315084,4.4619784,3.3418322115351735,3.615698270734455,4.162368712074678,3.7228901
2,.\test\amitaro dataset (raw human voice) emoNormal001.wav,1.637125,16000,4,3.26939,3.466292,4.3564534,2.9692884277592313,3.227623923756912,4.1147632738098014,3.3102732
3,.\test\sbv2 amitaro.wav,4.2376875,16000,7,3.1194606,3.6841893,3.3271327,2.867083568671068,3.36297831453127,3.4996973662018243,3.2683856
6,.\test\fix rev17 我輩は猫である(pd).wav,64.0678125,16000,55,3.467322,3.843662,3.9634974,3.095989628156305,3.454443241861787,3.900091444817868,3.7221756

finaly verson generated audio rev17

and other audio in generated audio dir

train, infer (webgui) code in this fork vram use lower than original and this model only work this repo https://github.com/q9uri/index-tts-ja

model pretrain use jvnv courpus, cretaed by taakamichi shinosuke sensei and japanese voice actor!

and reason-speech-v2-denoized

original reazon-speech was created by reazon team, source voice was japanese tv wav file under licensed by 日本国著作権の例外項目

denoized by fishaudio. use uvr5 reupload hf by litagin02

anime-whisper-0.3 use create text transcript kanji in suppres token,transcripts nearly kana only text.

model was trained sigle rtx 3060 (max 60% setting,power look like rtx a2000)
batch size 1, amp don't use (haha, i forgotten. recommend use amp)

gpu was i have'n ローカルllmに向き合う会 hackason. thx サルドラ (@saldra) ゆづき

may i support me?
buy gpu for me amazon.jp shoplist

download custom pretrain models
https://huggingface.co/WariHima/index-tts-japanese-prosody

and other orginal index-tts2 weight file need.

infer need cuda 12.8 and vram 8gb created voice length * 2 sec,

36000+6000x8x6cycle_steps.pth rename to gpt.pth, copy to ./checkpoints.
japanese-bpe.model to ./checkpoints, don't be rename.

run webui python webui.py

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WariHima/index-tts-japanese-prosody

Finetuned
(2)
this model
Finetunes
1 model