Towards Reliable Objective Evaluation Metrics for Generative Singing Voice Separation Models

Paul A. Bereuter¹, Benjamin Stahl¹, Mark D. Plumbley², Alois Sontacchi¹

¹Institute of Electronic Music and Acoustics (IEM), University of Music and Performing Arts, Graz, Austria
²Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2025

Paper Code Data PyPi

Abstract

Traditional Blind Source Separation Evaluation (BSS-Eval) metrics were originally designed to evaluate linear audio source separation models based on methods such as time-frequency masking. However, recent generative models may introduce nonlinear relationships between the separated and reference signals, limiting the reliability of these metrics for objective evaluation. To address this issue, we conduct a Degradation Category Rating listening test and analyze correlations between the obtained degradation mean opinion scores (DMOS) and a set of objective audio quality metrics for the task of singing voice separation. We evaluate three state-of-the-art discriminative models and two new, competitive generative models. For both discriminative and generative models, intrusive embedding-based metrics show higher correlations with DMOS than conventional intrusive metrics such as BSS-Eval metrics. For discriminative models, the highest correlation is achieved by the MSE computed on Music2Latent embeddings. When it comes to the evaluation of generative models, the strongest correlations are evident for the multi-resolution STFT loss and the MSE calculated on MERT-L12 embeddings, with the latter also providing the most balanced correlation across both model types. Our results highlight the limitations of BSS-Eval metrics for evaluating generative singing voice separation models and emphasize the need for careful selection and validation of alternative evaluation metrics for the task of singing voice separation.

Model Inference and Embedding-based MSE Evaluation

If you want to use the proposed generative models (SGMSVS, MelRoFo (S) + BigVGAN) to separate singing voice from musical mixtures, or if you want to compute the embedding-based MSE of MERT or Music2Latent embeddings take a look at our PyPi package.

This PyPi package contains all the dependencies required to set up a conda environment and run all the training code available on the GitHub repository.

Benchmark your Metrics

If you would like to benchmark your metric using our DMOS data, please follow these steps:

Download the dataset from Zenodo.
Compute your metric on the evaluation audio in the folder gensvs_eval_audio_and_embeddings.
Append your metric scores as the last column of gensvs_eval_data.csv.
When appending the scores, ensure they are in the correct order (check for correct filepaths).
Finally, run gensvs_eval_dmos_metric_correlation_demo.py to obtain the correlation coefficients for the generative and discriminative models.
Don't forget to cite our paper if you use the dataset or the code! ;-)

SVS Model Details

This table summarizes additional details about the singing voice separation models evaluated in the paper. The used architectures as well as the made modifications, and the implementation details are noted in the table, all references in the table refer to the references of the paper.
Category	Model	Architecture / Features	Implementation Details
Discriminative	HTDemucs	Hybrid-Transformer-Demucs Hybrid-Demucs [27] with transformer bottleneck	Training details: Loss: L1 waveform Learning rate: 1e-4 Batch size: 6 Trained on stereo audio 590 epochs
	Mel-RoFo. (L)	Large Band-split RoPE-transformer [28] with Mel-projection layer [29]	Pre-trained on undisclosed larger dataset. Settings per [30].
	Mel-RoFo. (S)	Scaled-down version of Mel-RoFo. (L) Reduced num. of latent features (`dim`) 384 → 192 Increased num. of RoPE-transformer encoders (`depth`) 6 → 9 Trained from scratch.	Training details: Loss: time-domain MAE & multi-resolution complex spec- trogram MAE from [29] Batch size: 1 Trained on stereo audio 550 epochs
Generative	SGMSVS	Score-based generative model for singing voice separation Retrained score-based generative model for speech enhancement (SGMSE) [6] NCSNPP score-model from [31]	Training details: Loss: Score matching loss from [6] Batch size: 1 Trained on mono audio 550 epochs Adapted inference parameters: α = 0.0516 β = 0.334 diffusion steps = 45 corrector steps = 2
Generative	Mel-RoFo. (S) + BigVGAN	Mel-RoFo. (S) + finetuned BigVGAN BigVGAN-model: "bigvgan_v2_44khz_128band_512x" Mel-RoFo. (S) is frozen while finetuning BigVGAN	BigVGAN finetuning details 650 000 steps Batch size: 1 Trained on mono audio Other training settings as per [32]

Metric Details

This table comprises additional details about the evaluated objective evaluation metrics. Additional information and implementation details are noted in the table, all references in the table refer to the references of the paper.
Type	Metric	Metric Information	Implementation Notes
Intrusive	BSS‑Eval: SDR: signal to distortion ratio SI-SDR: scale-invariant signal to distortion ratio SAR: signal to artifacts ratio SIR: signal to inference ratio ISR: image to spatial distortion ratio	Blind Source Separation Evaluation (BSS-Eval) metrics are energy‑ratio measures between reference and separated signal, the where estimation is decomposed into individual components via projections onto FIR-filtered subspaces of target and distorting sources [4].	The used packages/toolkits to compute the metrics are listed below: SDR & SI-SDR: `TorchMetrics` [39] SAR: `fast-bss-eval` [40] SIR: `PEASS Matlab toolkit (v2.0.1)` [12] ISR: `PEASS Matlab toolkit (v2.0.1)` [12]
	PEASS: OPS: overall perceptual score TPS: target-related perceptual score IPS: interference-related perceptual score APS: artifacts related perceptual score	The Perceptual Evaluation methods for Audio Source Separation (PEASS) also decompose a signal into distortion components. However, before decomposition, the signal is split into gammatone subbands and segemented into overlapping frames. Then regression is used approximate subjective ratings [12].	All PEASS metrics were calculated using the `PEASS Matlab toolkit (v2.0.1)` [12]. The Matlab scripts were executed using the Matlab engine in Python.
	ViSQOL	Similar to PEASS, the Virtual Speech Quality Objective Listener (ViSQOL), employs a perceptual model with a fitted mapping on a spectro-temporal representation.	ViSQOL v3 command-line tool from [41] to compute audio version Audio has to be resampled to 48 kHz
	\( \mathcal{L}_{\text{MR}} \)	The Multi-resolution STFT Loss (\( \mathcal{L}_{\text{MR}} \)) is computed by averaging the STFT loss denoted in the paper over 5 STFT resolutions (256, 512, 1024, 2048, 4096).	Computed using the `Auraloss` toolbox [42] 75 % overlap A-weighting
Embedding-based Intrusive	Embedding‑MSE L-CLAP_aud (CL_a) L-CLAP_mus (CL_a) MERT-L12 (M-L12) Music2Latent (M2L)	MSE between time-resolved Self-Supervised Learning (SSL) embeddings which are: Large-scale Contrastive Language-Audio Pretraining audio (CL_a) and music (CL_m) embeddings [15], the 12^th layer embeddings of an acoustic Music undERstanding model with large-scale self-supervised Training (MERT-L12/M-L12) [26], and Music2Latent (M2L) embeddings [17]	All MSE's were calculated by adapted Microsoft's FAD‑toolkit (`fadtk`) [19] We adpoted the code of `fadtk` to compute the MSE on time-resolved embeddings and included the calculation of M2L embedings
Embedding-based Intrusive	Intrusive Variant of Fréchet Audio Distance (FAD_song2song) Embeddings as for MSE (see above)	Fréchet distance between Gaussian fits of embedding distributions. The distributions are individually fitted to the time-resolved embeddings. The same embeddings as used for the MSE calculationen are used (CL_a, CL_m, M-L12, M2L).	Calculated by adapting `fadtk` [19] Song-level equation for FAD shown in Eq. (2)
Non-Intrusive	XLS‑R‑SQA	MOS-like speech enhancement quality that generalizes well on unseen data.	Python package available at [20] Manual downsampling to 16kHz required
	Audiobox‑Aesthetics Production Quality (PQ) Content Usefulness (CU)	Universal MOS-like audio quality assessment model for speech, music and sound. In total 4 evaluation axes exist, we analyzed 2 in our paper (PQ and CU).	Python package available at [22] Input is auto-resampled to 16kHz
	PAM	Another non-intrusive universal MOS-like metric that prompts audio-language models for audio quality assessment (PAM).	Python package available at [23] No resampling necessary as it operates on f_s=44.1kHz
	SingMOS	wav2vec 2.0‑based MOS predictor for singing voice. Trained on MOS ratings of singing voice audio incl. examples from singing voice conversion and coding models.	Code available at [24] Manual downsampling to 16kHz required

Audio Examples

This table contains audio examples for each SVS models and the achieved DMOS.
File-ID	HTDemucs	Mel-RoFo. (S)	Mel-RoFo. (L)	SGMSVS	Mel-RoFo. (S) + BigVGAN
#1	DMOS: 3.08	DMOS: 2.83	DMOS: 3.67	DMOS: 3.25	DMOS: 4.17
#2	DMOS: 3.58	DMOS: 2.33	DMOS: 3.58	DMOS: 3.33	DMOS: 4.08
#3	DMOS: 1.83	DMOS: 3.08	DMOS: 3.75	DMOS: 4.25	DMOS: 4.17
#5	DMOS: 2.42	DMOS: 2.92	DMOS: 2.75	DMOS: 3.50	DMOS: 2.92
#6	DMOS: 1.00	DMOS: 1.17	DMOS: 1.42	DMOS: 1.08	DMOS: 1.58
#10	DMOS: 2.75	DMOS: 4.00	DMOS: 3.92	DMOS: 4.83	DMOS: 4.17
#14	DMOS: 1.58	DMOS: 2.17	DMOS: 2.83	DMOS: 1.58	DMOS: 3.25
#21	DMOS: 2.92	DMOS: 2.75	DMOS: 3.17	DMOS: 3.50	DMOS: 3.67
#22	DMOS: 1.58	DMOS: 2.67	DMOS: 3.42	DMOS: 2.58	DMOS: 3.25
#31	DMOS: 2.50	DMOS: 2.75	DMOS: 3.75	DMOS: 2.50	DMOS: 3.67
#42	DMOS: 3.33	DMOS: 4.00	DMOS: 4.25	DMOS: 4.67	DMOS: 4.58
#44	DMOS: 3.08	DMOS: 3.58	DMOS: 4.50	DMOS: 4.17	DMOS: 4.25

Exemplary Metric Rankings Compared to DMOS Ranking

The color gradients below encode the rankings of the audio files according to the corresponding metrics/scores. The rankings are sorted according to the DMOS ranking. A color gradient from dark green to dark red similar to the gradient present for DMOS would indicate a higher correlation between the metric and DMOS.

Discriminative Models (150 audio files)


SINGMOS
SDR
MUSIC2LATENT MSE
MERT-L12 MSE
DMOS
	rank #1	rank #150

Generative Models (100 audio files)


SINGMOS
SDR
MUSIC2LATENT MSE
MERT-L12 MSE
DMOS
	rank #1	rank #100

DMOS & Correlation Results

Degradation Mean Opinion Score (DMOS) obtained from ITU P.808 compliant Degradation Category Rating (DCR) listening test.

SRCC Trade-off for evaluation of disc. and gen. models

Trade-off between Spearman’s rank correlation coefficients of objective metrics and DMOS for evaluation of discriminative (SRCC_disc) and generative models (SRCC_gen).

Citation (BibTeX)

If you use any parts of our code, our data or the gensvs package in your work, please cite our paper and the work that formed the basis of this research.

@misc{bereuter2025,
      title={Towards Reliable Objective Evaluation Metrics for Generative Singing Voice Separation Models}, 
      author={Paul A. Bereuter and Benjamin Stahl and Mark D. Plumbley and Alois Sontacchi},
      year={2025},
      eprint={2507.11427},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2507.11427}, 
}