Intrusive |
BSS‑Eval:- SDR: signal to distortion ratio
- SI-SDR: scale-invariant signal to distortion ratio
- SAR: signal to artifacts ratio
- SIR: signal to inference ratio
- ISR: image to spatial distortion ratio
|
Blind Source Separation Evaluation (BSS-Eval) metrics are energy‑ratio measures between reference and separated signal, the where estimation is decomposed into individual components via projections onto FIR-filtered subspaces of target and distorting sources [4]. |
The used packages/toolkits to compute the metrics are listed below:- SDR & SI-SDR:
TorchMetrics [39] - SAR:
fast-bss-eval [40] - SIR:
PEASS Matlab toolkit (v2.0.1) [12] - ISR:
PEASS Matlab toolkit (v2.0.1) [12]
|
PEASS:- OPS: overall perceptual score
- TPS: target-related perceptual score
- IPS: interference-related perceptual score
- APS: artifacts related perceptual score
|
The Perceptual Evaluation methods for Audio Source Separation (PEASS) also decompose a signal into distortion components. However, before decomposition, the signal is split into gammatone subbands and segemented into overlapping frames. Then regression is used approximate subjective ratings [12]. |
- All PEASS metrics were calculated using the
PEASS Matlab toolkit (v2.0.1) [12]. - The Matlab scripts were executed using the Matlab engine in Python.
|
ViSQOL |
Similar to PEASS, the Virtual Speech Quality Objective Listener (ViSQOL), employs a perceptual model with a fitted mapping on a spectro-temporal representation. |
- ViSQOL v3 command-line tool from [41] to compute audio version
- Audio has to be resampled to 48 kHz
|
m‑res. loss |
The Multi-resolution STFT Loss (m-res. loss) is computed by averaging the STFT loss denoted in the paper over 5 STFT resolutions (256, 512, 1024, 2048, 4096). |
- Computed using the
Auraloss toolbox [42] - 75 % overlap
- A-weighting
|
Embedding-based Intrusive |
Embedding‑MSE
- L-CLAPaud (CLa)
- L-CLAPmus (CLa)
- MERT-L12 (M-L12)
- Music2Latent (M2L)
|
MSE between time-resolved Self-Supervised Learning (SSL) embeddings which are: Large-scale Contrastive Language-Audio Pretraining audio (CLa) and music (CLm) embeddings [15], the 12th layer embeddings of an acoustic Music undERstanding model with large-scale self-supervised Training (MERT-L12/M-L12) [26], and Music2Latent (M2L) embeddings [17] |
- All MSE's were calculated by adapted Microsoft's FAD‑toolkit (
fadtk ) [19] - We adpoted the code of
fadtk to compute the MSE on time-resolved embeddings and included the calculation of M2L embedings
|
Intrusive Variant of Fréchet Audio Distance (FADsong2song)- Embeddings as for MSE (see above)
|
Fréchet distance between Gaussian fits of embedding distributions. The distributions are individually fitted to the time-resolved embeddings. The same embeddings as used for the MSE calculationen are used (CLa, CLm, M-L12, M2L). |
- Calculated by adapting
fadtk [19] - Song-level equation for FAD shown in Eq. (2)
|
Non-Intrusive |
XLS‑R‑SQA |
MOS-like speech enhancement quality that generalizes well on unseen data. |
- Python package available at [20]
- Manual downsampling to 16kHz required
|
Audiobox‑Aesthetics- Production Quality (PQ)
- Content Usefulness (CU)
|
Universal MOS-like audio quality assessment model for speech, music and sound. In total 4 evaluation axes exist, we analyzed 2 in our paper (PQ and CU). |
- Python package available at [22]
- Input is auto-resampled to 16kHz
|
PAM |
Another non-intrusive universal MOS-like metric that prompts audio-language models for audio quality assessment (PAM). |
- Python package available at [23]
- No resampling necessary as it operates on fs=44.1kHz
|
SingMOS |
wav2vec 2.0‑based MOS predictor for singing voice. Trained on MOS ratings of singing voice audio incl. examples from singing voice conversion and coding models. |
- Code available at [24]
- Manual downsampling to 16kHz required
|