Instructions to use mlx-community/mel-roformer-kim-vocal-2-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/mel-roformer-kim-vocal-2-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir mel-roformer-kim-vocal-2-mlx mlx-community/mel-roformer-kim-vocal-2-mlx
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
mlx-community/mel-roformer-kim-vocal-2-mlx
This model was converted to MLX format from KimberleyJSN/melbandroformer using a custom Mel-Band-RoFormer MLX port (the Blaizzy/mlx-audio PR + the xocialize/mel-roformer-mlx-swift Swift consumer). Refer to the original model card for more details on the model.
Model
- Family: Mel-Band-RoFormer (Lu, Wang, Won, "Mel-Band RoFormer for Music Source Separation," arXiv:2310.01809)
- Checkpoint: Kim Vocal 2 by Kimberley Jensen
- Parameters: ~228M
- Sample rate: 44100 Hz, stereo
- Stems produced:
vocals(single-stem model โ deriveinstrumentalasmixture - vocals) - Chunk size: 352800 samples (~8 s at 44.1 kHz), 50% overlap
- STFT:
n_fft=2048,hop_length=441,win_length=2048 - Transformer:
dim=384,depth=6,heads=8,dim_head=64 - Bands: 60 mel bands
- Mask estimator depth: 2
Full hyperparameters in config.json.
Source
- Original repository: https://huggingface.co/KimberleyJSN/melbandroformer
- Original author: Kimberley Jensen (@KimberleyJSN)
- Original license: MIT (relicensed from GPL-3.0 โ see License provenance below)
- Original commit at conversion time:
ac9b061 - Source file:
MelBandRoformer.ckpt
License
This redistribution is MIT-licensed, matching the original checkpoint license. See LICENSE.
License provenance (Kim Vocal 2)
This checkpoint was originally released under GPL-3.0 on June 17, 2025 (see https://huggingface.co/KimberleyJSN/melbandroformer/discussions/2) and relicensed to MIT on April 22, 2026 by the original author Kimberley Jensen (commit https://huggingface.co/KimberleyJSN/melbandroformer/commit/ac9b0614ab3cd7f77219e18ba494dfd93956c348). This MLX conversion was made after the relicense and inherits the MIT terms.
The relicense was independently confirmed with the original author the week of April 20, 2026 prior to this redistribution.
| Date | Event | Reference |
|---|---|---|
| 2025-06-17 | Original license assigned: GPL-3.0 | Discussion #2 |
| 2026-04-22 | Relicense to MIT (HuggingFace README updated, license: mit) |
Commit ac9b061 |
| Week of 2026-04-20 | Direct confirmation with author | Private correspondence |
| 2026-04-25 | Repo license: mit badge confirmed |
README front-matter |
Conversion
- Tool:
mlx_audio.sts.models.mel_roformer.convertโ merged upstream into Blaizzy/mlx-audio via PR #654 (2026-04-27) and shipped inmlx-audio==0.4.3and later. - Tool version at conversion time:
8380ab8on thefeat/mel-band-roformerbranch (xocialize fork) โ this is the exact commit that producedmodel.safetensors. The merged upstream code is functionally equivalent. - mlx (Python) version: 0.31.0
- Architecture preset:
MelRoFormerConfig.kim_vocal_2() - Output precision:
bfloat16 - Source
MelBandRoformer.ckptSHA-256:87201f4d31afb5bc79993230fc49446918425574db48c01c405e44f365c7559e - Conversion date: 2026-04-25
Parity
Verified against the PyTorch reference implementation:
- SDR: 66.08 dB between PyTorch and MLX outputs (target: > 40 dB per the upload guide). The upload guide treats > 40 dB SDR as effectively bit-exact up to floating-point precision; 66 dB indicates the bf16 conversion is faithful to the lucidrains PyTorch reference.
- PyTorch reference:
bs_roformer==0.3.10โbs_roformer.MelBandRoformerinstantiated from the original training YAML. Newerbs_roformerreleases (0.4+) are not compatible with the Kim Vocal 2 checkpoint (thelayersModuleList nesting was reordered and nGPT-style normalization was added). - Test signal: 8-second stereo 44.1 kHz clip (mid-episode anime audio, mono โ stereo duplication, music + dialogue mix).
- Reproduce: see
mlx_audio/tests/sts/test_mel_roformer_parity.pyandtests/sts/torch_infer.pyin the xocialize/mlx-audio fork.
Intended use
Vocal isolation for music source separation. Input is a stereo music mixture; output is a separated vocal stem. Trained for vocals; not validated for general-purpose source separation (drums, bass, other). Derive an instrumental stem as mixture - vocals if needed.
Files
model.safetensorsโ MLX weightsconfig.jsonโ architecture hyperparametersLICENSEโ MIT license text
Usage
Python (mlx-audio)
The Mel-Band-RoFormer architecture is included in
mlx-audio>=0.4.3(merged via PR #654 on 2026-04-27). Install withpip install "mlx-audio>=0.4.3".
import soundfile as sf
import numpy as np
import mlx.core as mx
from mlx_audio.sts.models.mel_roformer import MelRoFormer, MelRoFormerConfig
from mlx_audio.utils import load_audio
# 1. Load model + weights from the Hub.
model = MelRoFormer.from_pretrained(
"mlx-community/mel-roformer-kim-vocal-2-mlx",
config=MelRoFormerConfig.kim_vocal_2(), # optional if config.json is present
)
model.eval()
# 2. Load the input mixture as 44.1 kHz stereo and add a batch axis.
mixture = load_audio("input_mixture.wav", sample_rate=44100) # mx.array [2, samples]
batched = mixture[None, ...] # [1, 2, samples]
# 3. Separate vocals.
vocals = model(batched)[0] # [2, samples]
# 4. Derive instrumental as (mixture - vocals).
instrumental = mixture - vocals
# 5. Write stems to disk (soundfile expects [samples, channels]).
sf.write("vocals.wav", np.array(vocals).T, 44100)
sf.write("instrumental.wav", np.array(instrumental).T, 44100)
For long inputs, chunk the audio at chunk_size = 352800 samples with 50% overlap and overlap-add the outputs โ see the model code for the canonical helper once it's added.
Swift (mel-roformer-mlx-swift)
import SwiftRoFormer
// One-shot Hub download + weight load. The bundled config.json's
// `checkpoint_family` field auto-resolves the correct preset.
let model = try await MelRoFormer.fromPretrained(
"mlx-community/mel-roformer-kim-vocal-2-mlx"
)
// Forward pass: input is (batch, channels, samples) at 44100 Hz.
let vocals = model(input)
For lower-level control (explicit configuration, pre-downloaded weights, custom Hub clients), construct the model directly and pass a local URL to WeightLoader.loadWeights:
let model = MelRoFormer(config: .kimVocal2)
try WeightLoader.loadWeights(into: model, from: localWeightsURL)
Citation
If you use this checkpoint, please cite the Mel-Band-RoFormer paper and the original Kim Vocal 2 release:
@misc{lu2023melband,
title = {Mel-Band {RoFormer} for Music Source Separation},
author = {Lu, Wei-Tsung and Wang, Ju-Chiang and Won, Minz and Choi, Keunwoo and Song, Xuchen},
year = {2023},
eprint = {2310.01809},
archivePrefix = {arXiv},
primaryClass = {eess.AS},
url = {https://arxiv.org/abs/2310.01809}
}
@misc{kim_vocal_2_2025,
title = {Kim Vocal 2 โ Mel-Band-RoFormer for vocal source separation},
author = {Jensen, Kimberley},
year = {2025},
url = {https://huggingface.co/KimberleyJSN/melbandroformer}
}
The training-time configurations are from ZFTurbo/Music-Source-Separation-Training (MIT). The MLX implementation lineage is lucidrains/BS-RoFormer (MIT) โ Blaizzy/mlx-audio (Apache-2.0).
- Downloads last month
- 191
Quantized
Model tree for mlx-community/mel-roformer-kim-vocal-2-mlx
Base model
KimberleyJSN/melbandroformer