Towards Reliable Objective Evaluation Metrics for Generative Singing Voice Separation Models

1Institute of Electronic Music and Acoustics (IEM), University of Music and Performing Arts, Graz, Austria
2Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2025

Abstract

Traditional Blind Source Separation Evaluation (BSS-Eval) metrics were originally designed to evaluate linear audio source separation models based on methods such as time-frequency masking. However, recent generative models may introduce nonlinear relationships between the separated and reference signals, limiting the reliability of these metrics for objective evaluation. To address this issue, we conduct a Degradation Category Rating listening test and analyze correlations between the obtained degradation mean opinion scores (DMOS) and a set of objective audio quality metrics for the task of singing voice separation. We evaluate three state-of-the-art discriminative models and two new, competitive generative models. For both discriminative and generative models, intrusive embedding-based metrics show higher correlations with DMOS than conventional intrusive metrics such as BSS-Eval metrics. For discriminative models, the highest correlation is achieved by the MSE computed on Music2Latent embeddings. When it comes to the evaluation of generative models, the strongest correlations are evident for the multi-resolution STFT loss and the MSE calculated on MERT-L12 embeddings, with the latter also providing the most balanced correlation across both model types. Our results highlight the limitations of BSS-Eval metrics for evaluating generative singing voice separation models and emphasize the need for careful selection and validation of alternative evaluation metrics for the task of singing voice separation.

Benchmark your Metrics

If you would like to benchmark your metric using our DMOS data, please follow these steps:
  1. Download the dataset from Zenodo.
  2. Compute your metric on the evaluation audio in the folder gensvs_eval_audio_and_embeddings.
  3. Append your metric scores as the last column of gensvs_eval_data.csv.
  4. When appending the scores, ensure they are in the correct order (check for correct filepaths).
  5. Finally, run gensvs_eval_dmos_metric_correlation_demo.py to obtain the correlation coefficients for the generative and discriminative models.
  6. Don't forget to cite our paper if you use the dataset or the code! ;-)

Compute Metrics or Train/Infer Models

If you want to look into how we calculated the included objective evaluation metrics or retrain/infer our models on your own data, please refer to the GitHub repository.

SVS Model Details

This table summarizes additional details about the singing voice separation models evaluated in the paper. The used architectures as well as the made modifications, and the implementation details are noted in the table, all references in the table refer to the references of the paper.
Category Model Architecture / Features Implementation Details
Discriminative HTDemucs
  • Hybrid-Transformer-Demucs
  • Hybrid-Demucs [27] with transformer bottleneck
Training details:
  • Loss: L1 waveform
  • Learning rate: 1e-4
  • Batch size: 6
  • Trained on stereo audio
  • 590 epochs
Mel-RoFo. (L)
  • Large Band-split RoPE-transformer [28] with Mel-projection layer [29]
Pre-trained on undisclosed larger dataset. Settings per [30].
Mel-RoFo. (S)
  • Scaled-down version of Mel-RoFo. (L)
  • Reduced num. of latent features (dim) 384 → 192
  • Increased num. of RoPE-transformer encoders (depth) 6 → 9
  • Trained from scratch.
Training details:
  • Loss: time-domain MAE & multi-resolution complex spec- trogram MAE from [29]
  • Batch size: 1
  • Trained on stereo audio
  • 550 epochs
Generative SGMSVS
  • Score-based generative model for singing voice separation
  • Retrained score-based generative model for speech enhancement (SGMSE) [6]
  • NCSNPP score-model from [31]
Training details:
  • Loss: Score matching loss from [6]
  • Batch size: 1
  • Trained on mono audio
  • 550 epochs
Adapted inference parameters:
  • α = 0.0516
  • β = 0.334
  • diffusion steps = 45
  • corrector steps = 2
Mel-RoFo. (S) + BigVGAN
  • Mel-RoFo. (S) + finetuned BigVGAN
  • BigVGAN-model: "bigvgan_v2_44khz_128band_512x"
  • Mel-RoFo. (S) is frozen while finetuning BigVGAN
BigVGAN finetuning details
  • 650 000 steps
  • Batch size: 1
  • Trained on mono audio
  • Other training settings as per [32]

Metric Details

This table comprises additional details about the evaluated objective evaluation metrics. Additional information and implementation details are noted in the table, all references in the table refer to the references of the paper.
Type Metric Metric Information Implementation Notes
Intrusive BSS‑Eval:
  • SDR: signal to distortion ratio
  • SI-SDR: scale-invariant signal to distortion ratio
  • SAR: signal to artifacts ratio
  • SIR: signal to inference ratio
  • ISR: image to spatial distortion ratio
Blind Source Separation Evaluation (BSS-Eval) metrics are energy‑ratio measures between reference and separated signal, the where estimation is decomposed into individual components via projections onto FIR-filtered subspaces of target and distorting sources [4]. The used packages/toolkits to compute the metrics are listed below:
  • SDR & SI-SDR:
    TorchMetrics [39]
  • SAR:
    fast-bss-eval [40]
  • SIR:
    PEASS Matlab toolkit (v2.0.1) [12]
  • ISR:
    PEASS Matlab toolkit (v2.0.1) [12]
PEASS:
  • OPS: overall perceptual score
  • TPS: target-related perceptual score
  • IPS: interference-related perceptual score
  • APS: artifacts related perceptual score
The Perceptual Evaluation methods for Audio Source Separation (PEASS) also decompose a signal into distortion components. However, before decomposition, the signal is split into gammatone subbands and segemented into overlapping frames. Then regression is used approximate subjective ratings [12].
  • All PEASS metrics were calculated using the PEASS Matlab toolkit (v2.0.1) [12].
  • The Matlab scripts were executed using the Matlab engine in Python.
ViSQOL Similar to PEASS, the Virtual Speech Quality Objective Listener (ViSQOL), employs a perceptual model with a fitted mapping on a spectro-temporal representation.
  • ViSQOL v3 command-line tool from [41] to compute audio version
  • Audio has to be resampled to 48 kHz
m‑res. loss The Multi-resolution STFT Loss (m-res. loss) is computed by averaging the STFT loss denoted in the paper over 5 STFT resolutions (256, 512, 1024, 2048, 4096).
  • Computed using the Auraloss toolbox [42]
  • 75 % overlap
  • A-weighting
Embedding-based Intrusive Embedding‑MSE
  • L-CLAPaud (CLa)
  • L-CLAPmus (CLa)
  • MERT-L12 (M-L12)
  • Music2Latent (M2L)
MSE between time-resolved Self-Supervised Learning (SSL) embeddings which are: Large-scale Contrastive Language-Audio Pretraining audio (CLa) and music (CLm) embeddings [15], the 12th layer embeddings of an acoustic Music undERstanding model with large-scale self-supervised Training (MERT-L12/M-L12) [26], and Music2Latent (M2L) embeddings [17]
  • All MSE's were calculated by adapted Microsoft's FAD‑toolkit (fadtk) [19]
  • We adpoted the code of fadtk to compute the MSE on time-resolved embeddings and included the calculation of M2L embedings
Intrusive Variant of Fréchet Audio Distance (FADsong2song)
  • Embeddings as for MSE (see above)
Fréchet distance between Gaussian fits of embedding distributions. The distributions are individually fitted to the time-resolved embeddings. The same embeddings as used for the MSE calculationen are used (CLa, CLm, M-L12, M2L).
  • Calculated by adapting fadtk [19]
  • Song-level equation for FAD shown in Eq. (2)
Non-Intrusive XLS‑R‑SQA MOS-like speech enhancement quality that generalizes well on unseen data.
  • Python package available at [20]
  • Manual downsampling to 16kHz required
Audiobox‑Aesthetics
  • Production Quality (PQ)
  • Content Usefulness (CU)
Universal MOS-like audio quality assessment model for speech, music and sound. In total 4 evaluation axes exist, we analyzed 2 in our paper (PQ and CU).
  • Python package available at [22]
  • Input is auto-resampled to 16kHz
PAM Another non-intrusive universal MOS-like metric that prompts audio-language models for audio quality assessment (PAM).
  • Python package available at [23]
  • No resampling necessary as it operates on fs=44.1kHz
SingMOS wav2vec 2.0‑based MOS predictor for singing voice. Trained on MOS ratings of singing voice audio incl. examples from singing voice conversion and coding models.
  • Code available at [24]
  • Manual downsampling to 16kHz required

Audio Examples

This table contains audio examples for each SVS models and the achieved DMOS.
File-ID Mixture Target HTDemucs Mel-RoFo. (S) Mel-RoFo. (L) SGMSVS Mel-RoFo. (S) + BigVGAN
#1 DMOS: 3.08 DMOS: 2.83 DMOS: 3.67 DMOS: 3.25 DMOS: 4.17
#2 DMOS: 3.58 DMOS: 2.33 DMOS: 3.58 DMOS: 3.33 DMOS: 4.08
#3 DMOS: 1.83 DMOS: 3.08 DMOS: 3.75 DMOS: 4.25 DMOS: 4.17
#5 DMOS: 2.42 DMOS: 2.92 DMOS: 2.75 DMOS: 3.50 DMOS: 2.92
#6 DMOS: 1.00 DMOS: 1.17 DMOS: 1.42 DMOS: 1.08 DMOS: 1.58
#10 DMOS: 2.75 DMOS: 4.00 DMOS: 3.92 DMOS: 4.83 DMOS: 4.17
#14 DMOS: 1.58 DMOS: 2.17 DMOS: 2.83 DMOS: 1.58 DMOS: 3.25
#21 DMOS: 2.92 DMOS: 2.75 DMOS: 3.17 DMOS: 3.50 DMOS: 3.67
#22 DMOS: 1.58 DMOS: 2.67 DMOS: 3.42 DMOS: 2.58 DMOS: 3.25
#31 DMOS: 2.50 DMOS: 2.75 DMOS: 3.75 DMOS: 2.50 DMOS: 3.67
#42 DMOS: 3.33 DMOS: 4.00 DMOS: 4.25 DMOS: 4.67 DMOS: 4.58
#44 DMOS: 3.08 DMOS: 3.58 DMOS: 4.50 DMOS: 4.17 DMOS: 4.25

Exemplary Metric Rankings Compared to DMOS Ranking

The color gradients below encode the rankings of the audio files according to the corresponding metrics/scores. The rankings are sorted according to the DMOS ranking. A color gradient from dark green to dark red similar to the gradient present for DMOS would indicate a higher correlation between the metric and DMOS.

Discriminative Models (150 audio files)

SINGMOS                                                                                                                                                      
SDR                                                                                                                                                      
MUSIC2LATENT MSE                                                                                                                                                      
MERT-L12 MSE                                                                                                                                                      
DMOS                                                                                                                                                      
rank #1rank #150

Generative Models (100 audio files)

SINGMOS                                                                                                    
SDR                                                                                                    
MUSIC2LATENT MSE                                                                                                    
MERT-L12 MSE                                                                                                    
DMOS                                                                                                    
rank #1rank #100

DMOS & Correlation Results

Citation (BibTeX)

@misc{bereuter2025,
      title={Towards Reliable Objective Evaluation Metrics for Generative Singing Voice Separation Models}, 
      author={Paul A. Bereuter and Benjamin Stahl and Mark D. Plumbley and Alois Sontacchi},
      year={2025},
      eprint={2507.11427},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2507.11427}, 
}