mlx-community/mel-roformer-kim-vocal-2-mlx

This model was converted to MLX format from KimberleyJSN/melbandroformer using a custom Mel-Band-RoFormer MLX port (the Blaizzy/mlx-audio PR + the xocialize/mel-roformer-mlx-swift Swift consumer). Refer to the original model card for more details on the model.

Model

Family: Mel-Band-RoFormer (Lu, Wang, Won, "Mel-Band RoFormer for Music Source Separation," arXiv:2310.01809)
Checkpoint: Kim Vocal 2 by Kimberley Jensen
Parameters: ~228M
Sample rate: 44100 Hz, stereo
Stems produced: vocals (single-stem model — derive instrumental as mixture - vocals)
Chunk size: 352800 samples (~8 s at 44.1 kHz), 50% overlap
STFT: n_fft=2048, hop_length=441, win_length=2048
Transformer: dim=384, depth=6, heads=8, dim_head=64
Bands: 60 mel bands
Mask estimator depth: 2

Full hyperparameters in config.json.

Source

Original repository: https://huggingface.co/KimberleyJSN/melbandroformer
Original author: Kimberley Jensen (@KimberleyJSN)
Original license: MIT (relicensed from GPL-3.0 — see License provenance below)
Original commit at conversion time: ac9b061
Source file: MelBandRoformer.ckpt

License

This redistribution is MIT-licensed, matching the original checkpoint license. See LICENSE.

License provenance (Kim Vocal 2)

This checkpoint was originally released under GPL-3.0 on June 17, 2025 (see https://huggingface.co/KimberleyJSN/melbandroformer/discussions/2) and relicensed to MIT on April 22, 2026 by the original author Kimberley Jensen (commit https://huggingface.co/KimberleyJSN/melbandroformer/commit/ac9b0614ab3cd7f77219e18ba494dfd93956c348). This MLX conversion was made after the relicense and inherits the MIT terms.

The relicense was independently confirmed with the original author the week of April 20, 2026 prior to this redistribution.

Date	Event	Reference
2025-06-17	Original license assigned: GPL-3.0	Discussion #2
2026-04-22	Relicense to MIT (HuggingFace README updated, `license: mit`)	Commit `ac9b061`
Week of 2026-04-20	Direct confirmation with author	Private correspondence
2026-04-25	Repo `license: mit` badge confirmed	README front-matter

Conversion

Tool: mlx_audio.sts.models.mel_roformer.convert — merged upstream into Blaizzy/mlx-audio via PR #654 (2026-04-27) and shipped in mlx-audio==0.4.3 and later.
Tool version at conversion time: 8380ab8 on the feat/mel-band-roformer branch (xocialize fork) — this is the exact commit that produced model.safetensors. The merged upstream code is functionally equivalent.
mlx (Python) version: 0.31.0
Architecture preset: MelRoFormerConfig.kim_vocal_2()
Output precision: bfloat16
Source MelBandRoformer.ckpt SHA-256: 87201f4d31afb5bc79993230fc49446918425574db48c01c405e44f365c7559e
Conversion date: 2026-04-25

Parity

Verified against the PyTorch reference implementation:

SDR: 66.08 dB between PyTorch and MLX outputs (target: > 40 dB per the upload guide). The upload guide treats > 40 dB SDR as effectively bit-exact up to floating-point precision; 66 dB indicates the bf16 conversion is faithful to the lucidrains PyTorch reference.
PyTorch reference: bs_roformer==0.3.10 — bs_roformer.MelBandRoformer instantiated from the original training YAML. Newer bs_roformer releases (0.4+) are not compatible with the Kim Vocal 2 checkpoint (the layers ModuleList nesting was reordered and nGPT-style normalization was added).
Test signal: 8-second stereo 44.1 kHz clip (mid-episode anime audio, mono → stereo duplication, music + dialogue mix).
Reproduce: see mlx_audio/tests/sts/test_mel_roformer_parity.py and tests/sts/torch_infer.py in the xocialize/mlx-audio fork.

Intended use

Vocal isolation for music source separation. Input is a stereo music mixture; output is a separated vocal stem. Trained for vocals; not validated for general-purpose source separation (drums, bass, other). Derive an instrumental stem as mixture - vocals if needed.

Files

model.safetensors — MLX weights
config.json — architecture hyperparameters
LICENSE — MIT license text

Usage

Python (mlx-audio)

The Mel-Band-RoFormer architecture is included in mlx-audio>=0.4.3 (merged via PR #654 on 2026-04-27). Install with pip install "mlx-audio>=0.4.3".

import soundfile as sf
import numpy as np
import mlx.core as mx

from mlx_audio.sts.models.mel_roformer import MelRoFormer, MelRoFormerConfig
from mlx_audio.utils import load_audio

# 1. Load model + weights from the Hub.
model = MelRoFormer.from_pretrained(
    "mlx-community/mel-roformer-kim-vocal-2-mlx",
    config=MelRoFormerConfig.kim_vocal_2(),  # optional if config.json is present
)
model.eval()

# 2. Load the input mixture as 44.1 kHz stereo and add a batch axis.
mixture = load_audio("input_mixture.wav", sample_rate=44100)  # mx.array [2, samples]
batched = mixture[None, ...]                                  # [1, 2, samples]

# 3. Separate vocals.
vocals = model(batched)[0]                                    # [2, samples]

# 4. Derive instrumental as (mixture - vocals).
instrumental = mixture - vocals

# 5. Write stems to disk (soundfile expects [samples, channels]).
sf.write("vocals.wav",       np.array(vocals).T,       44100)
sf.write("instrumental.wav", np.array(instrumental).T, 44100)

For long inputs, chunk the audio at chunk_size = 352800 samples with 50% overlap and overlap-add the outputs — see the model code for the canonical helper once it's added.

Swift (`mel-roformer-mlx-swift`)

import SwiftRoFormer

// One-shot Hub download + weight load. The bundled config.json's
// `checkpoint_family` field auto-resolves the correct preset.
let model = try await MelRoFormer.fromPretrained(
    "mlx-community/mel-roformer-kim-vocal-2-mlx"
)

// Forward pass: input is (batch, channels, samples) at 44100 Hz.
let vocals = model(input)

For lower-level control (explicit configuration, pre-downloaded weights, custom Hub clients), construct the model directly and pass a local URL to WeightLoader.loadWeights:

let model = MelRoFormer(config: .kimVocal2)
try WeightLoader.loadWeights(into: model, from: localWeightsURL)

Citation

If you use this checkpoint, please cite the Mel-Band-RoFormer paper and the original Kim Vocal 2 release:

@misc{lu2023melband,
  title         = {Mel-Band {RoFormer} for Music Source Separation},
  author        = {Lu, Wei-Tsung and Wang, Ju-Chiang and Won, Minz and Choi, Keunwoo and Song, Xuchen},
  year          = {2023},
  eprint        = {2310.01809},
  archivePrefix = {arXiv},
  primaryClass  = {eess.AS},
  url           = {https://arxiv.org/abs/2310.01809}
}

@misc{kim_vocal_2_2025,
  title  = {Kim Vocal 2 — Mel-Band-RoFormer for vocal source separation},
  author = {Jensen, Kimberley},
  year   = {2025},
  url    = {https://huggingface.co/KimberleyJSN/melbandroformer}
}

The training-time configurations are from ZFTurbo/Music-Source-Separation-Training (MIT). The MLX implementation lineage is lucidrains/BS-RoFormer (MIT) → Blaizzy/mlx-audio (Apache-2.0).

Downloads last month: 191

MLX

Hardware compatibility

Quantized

Model tree for mlx-community/mel-roformer-kim-vocal-2-mlx

Base model

KimberleyJSN/melbandroformer

Finetuned

(2)

this model

Collection including mlx-community/mel-roformer-kim-vocal-2-mlx

Mel-Band-RoFormer (MLX)

Collection

MLX-format Mel-Band-RoFormer vocal source separation models (MIT-licensed, parity-tested vs PyTorch reference) • 2 items • Updated 21 days ago • 1

Paper for mlx-community/mel-roformer-kim-vocal-2-mlx

Mel-Band RoFormer for Music Source Separation

Paper • 2310.01809 • Published Oct 3, 2023