mlx-community/mel-roformer-kim-vocal-2-mlx

This model was converted to MLX format from KimberleyJSN/melbandroformer using a custom Mel-Band-RoFormer MLX port (the Blaizzy/mlx-audio PR + the xocialize/mel-roformer-mlx-swift Swift consumer). Refer to the original model card for more details on the model.

Model

  • Family: Mel-Band-RoFormer (Lu, Wang, Won, "Mel-Band RoFormer for Music Source Separation," arXiv:2310.01809)
  • Checkpoint: Kim Vocal 2 by Kimberley Jensen
  • Parameters: ~228M
  • Sample rate: 44100 Hz, stereo
  • Stems produced: vocals (single-stem model โ€” derive instrumental as mixture - vocals)
  • Chunk size: 352800 samples (~8 s at 44.1 kHz), 50% overlap
  • STFT: n_fft=2048, hop_length=441, win_length=2048
  • Transformer: dim=384, depth=6, heads=8, dim_head=64
  • Bands: 60 mel bands
  • Mask estimator depth: 2

Full hyperparameters in config.json.

Source

License

This redistribution is MIT-licensed, matching the original checkpoint license. See LICENSE.

License provenance (Kim Vocal 2)

This checkpoint was originally released under GPL-3.0 on June 17, 2025 (see https://huggingface.co/KimberleyJSN/melbandroformer/discussions/2) and relicensed to MIT on April 22, 2026 by the original author Kimberley Jensen (commit https://huggingface.co/KimberleyJSN/melbandroformer/commit/ac9b0614ab3cd7f77219e18ba494dfd93956c348). This MLX conversion was made after the relicense and inherits the MIT terms.

The relicense was independently confirmed with the original author the week of April 20, 2026 prior to this redistribution.

Date Event Reference
2025-06-17 Original license assigned: GPL-3.0 Discussion #2
2026-04-22 Relicense to MIT (HuggingFace README updated, license: mit) Commit ac9b061
Week of 2026-04-20 Direct confirmation with author Private correspondence
2026-04-25 Repo license: mit badge confirmed README front-matter

Conversion

  • Tool: mlx_audio.sts.models.mel_roformer.convert โ€” merged upstream into Blaizzy/mlx-audio via PR #654 (2026-04-27) and shipped in mlx-audio==0.4.3 and later.
  • Tool version at conversion time: 8380ab8 on the feat/mel-band-roformer branch (xocialize fork) โ€” this is the exact commit that produced model.safetensors. The merged upstream code is functionally equivalent.
  • mlx (Python) version: 0.31.0
  • Architecture preset: MelRoFormerConfig.kim_vocal_2()
  • Output precision: bfloat16
  • Source MelBandRoformer.ckpt SHA-256: 87201f4d31afb5bc79993230fc49446918425574db48c01c405e44f365c7559e
  • Conversion date: 2026-04-25

Parity

Verified against the PyTorch reference implementation:

  • SDR: 66.08 dB between PyTorch and MLX outputs (target: > 40 dB per the upload guide). The upload guide treats > 40 dB SDR as effectively bit-exact up to floating-point precision; 66 dB indicates the bf16 conversion is faithful to the lucidrains PyTorch reference.
  • PyTorch reference: bs_roformer==0.3.10 โ€” bs_roformer.MelBandRoformer instantiated from the original training YAML. Newer bs_roformer releases (0.4+) are not compatible with the Kim Vocal 2 checkpoint (the layers ModuleList nesting was reordered and nGPT-style normalization was added).
  • Test signal: 8-second stereo 44.1 kHz clip (mid-episode anime audio, mono โ†’ stereo duplication, music + dialogue mix).
  • Reproduce: see mlx_audio/tests/sts/test_mel_roformer_parity.py and tests/sts/torch_infer.py in the xocialize/mlx-audio fork.

Intended use

Vocal isolation for music source separation. Input is a stereo music mixture; output is a separated vocal stem. Trained for vocals; not validated for general-purpose source separation (drums, bass, other). Derive an instrumental stem as mixture - vocals if needed.

Files

  • model.safetensors โ€” MLX weights
  • config.json โ€” architecture hyperparameters
  • LICENSE โ€” MIT license text

Usage

Python (mlx-audio)

The Mel-Band-RoFormer architecture is included in mlx-audio>=0.4.3 (merged via PR #654 on 2026-04-27). Install with pip install "mlx-audio>=0.4.3".

import soundfile as sf
import numpy as np
import mlx.core as mx

from mlx_audio.sts.models.mel_roformer import MelRoFormer, MelRoFormerConfig
from mlx_audio.utils import load_audio

# 1. Load model + weights from the Hub.
model = MelRoFormer.from_pretrained(
    "mlx-community/mel-roformer-kim-vocal-2-mlx",
    config=MelRoFormerConfig.kim_vocal_2(),  # optional if config.json is present
)
model.eval()

# 2. Load the input mixture as 44.1 kHz stereo and add a batch axis.
mixture = load_audio("input_mixture.wav", sample_rate=44100)  # mx.array [2, samples]
batched = mixture[None, ...]                                  # [1, 2, samples]

# 3. Separate vocals.
vocals = model(batched)[0]                                    # [2, samples]

# 4. Derive instrumental as (mixture - vocals).
instrumental = mixture - vocals

# 5. Write stems to disk (soundfile expects [samples, channels]).
sf.write("vocals.wav",       np.array(vocals).T,       44100)
sf.write("instrumental.wav", np.array(instrumental).T, 44100)

For long inputs, chunk the audio at chunk_size = 352800 samples with 50% overlap and overlap-add the outputs โ€” see the model code for the canonical helper once it's added.

Swift (mel-roformer-mlx-swift)

import SwiftRoFormer

// One-shot Hub download + weight load. The bundled config.json's
// `checkpoint_family` field auto-resolves the correct preset.
let model = try await MelRoFormer.fromPretrained(
    "mlx-community/mel-roformer-kim-vocal-2-mlx"
)

// Forward pass: input is (batch, channels, samples) at 44100 Hz.
let vocals = model(input)

For lower-level control (explicit configuration, pre-downloaded weights, custom Hub clients), construct the model directly and pass a local URL to WeightLoader.loadWeights:

let model = MelRoFormer(config: .kimVocal2)
try WeightLoader.loadWeights(into: model, from: localWeightsURL)

Citation

If you use this checkpoint, please cite the Mel-Band-RoFormer paper and the original Kim Vocal 2 release:

@misc{lu2023melband,
  title         = {Mel-Band {RoFormer} for Music Source Separation},
  author        = {Lu, Wei-Tsung and Wang, Ju-Chiang and Won, Minz and Choi, Keunwoo and Song, Xuchen},
  year          = {2023},
  eprint        = {2310.01809},
  archivePrefix = {arXiv},
  primaryClass  = {eess.AS},
  url           = {https://arxiv.org/abs/2310.01809}
}

@misc{kim_vocal_2_2025,
  title  = {Kim Vocal 2 โ€” Mel-Band-RoFormer for vocal source separation},
  author = {Jensen, Kimberley},
  year   = {2025},
  url    = {https://huggingface.co/KimberleyJSN/melbandroformer}
}

The training-time configurations are from ZFTurbo/Music-Source-Separation-Training (MIT). The MLX implementation lineage is lucidrains/BS-RoFormer (MIT) โ†’ Blaizzy/mlx-audio (Apache-2.0).

Downloads last month
191
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mlx-community/mel-roformer-kim-vocal-2-mlx

Finetuned
(2)
this model

Collection including mlx-community/mel-roformer-kim-vocal-2-mlx

Paper for mlx-community/mel-roformer-kim-vocal-2-mlx