Teaching Speech Enhancement Models to Sing: Domain Adaptation from Speech Enhancement to Singing Voice Separation

1Institute of Electronic Music and Acoustics (IEM), University of Music and Performing Arts, Graz, Austria
2Departement for Informatics, King's College London, London, UK
To be presented at _____ in 2026

Abstract

State-of-the-art speech enhancement (SE) models benefit from large-scale labeled datasets, whereas singing voice separation (SVS) models suffer from limited available training data. To address this limitation, we formulate singing voice separation as domain adaptation from speech enhancement to singing voice separation. We investigate two fine-tuning strategies: full fine-tuning and parameter-efficient fine-tuning using Low-Rank Adaptation (LoRA) on a generative and a discriminative model. Models with either adaptation strategy outperform the same architectures trained from scratch by 0.2-1.8 dB in Signal-to-Distortion-Ratio (SDR). Full fine-tuning yields highest singing voice separation performance, but catastrophic forgetting degrades speech enhancement performance. LoRA fine-tuning achieves competitive singing voice separation performance while preserving the original speech enhancement capability with only 6-12% additional parameters compared to the base speech enhancement model. Furthermore, the generative model shows improved generalization to an unseen test set. The results demonstrate that adapting pretrained speech enhancement models is an effective strategy for training singing voice separation models in data-scarce scenarios.

Audio Examples

The audio examples on this webpage accompany our paper titled "Teaching Speech Enhancement Models to Sing: Domain Adaptation from Speech Enhancement to Singing Voice Separation" presented at _____ in 2026. The audio audible here is the output of models that were trained with from scratch training, LoRA, or full fine-tuning for the task of singing voice separation. Only the base model and the LoRA model with disabled adapter, which is effectively the base speech enhancement model, are models trained only for speech enhancement.

Singing Voice Separation: GenSVS test set [1]

Sample 1 Sample 35 Sample 42
Mixture
Target
BSRNNFull fine-tuning
LoRA (rank 16)
LoRA (rank 32)
LoRA (rank 128)
From scratch
LoRA (rank 16)
(adapter disabled)
Base [2]
SGMFull fine-tuning
LoRA (rank 16)
From scratch
LoRA (rank 16)
(adapter disabled)
Base [3]

Singing Voice Restoration: MSRBench test set [4]

Sample 240 Sample 198 Sample 78
Mixture
Target
BSRNNFull fine-tuning
LoRA (rank 16)
LoRA (rank 32)
LoRA (rank 128)
From scratch
LoRA (rank 16)
(adapter disabled)
Base [2]
SGMFull fine-tuning
LoRA (rank 16)
From scratch
LoRA (rank 16)
(adapter disabled)
Base [3]

Speech Enhancement: EARS-WHAM test set [4]

p102 / 0007 p105 / 00164 p104 / 00749
Mixture
Target
BSRNNFull fine-tuning
LoRA (rank 16)
LoRA (rank 32)
LoRA (rank 128)
From scratch
LoRA (rank 16)
(adapter disabled)
Base [2]
SGMFull fine-tuning
LoRA (rank 16)
From scratch
LoRA (rank 16)
(adapter disabled)
Base [3]

Results

References

[1]
P. A. Bereuter, B. Stahl, M. D. Plumbley, and A. Sontacchi, "Towards reliable objective evaluation metrics for generative singing voice separation models," in Proc. WASPAA, 2025. [Online]. Available: https://arxiv.org/abs/2504.02271.
[2]
J. Yu, H. Chen, Y. Luo, R. Gu, and C. Weng, "High fidelity speech enhancement with band-split RNN," in Interspeech 2023, 2023, pp. 2483-2487. [Online]. Available: https://arxiv.org/pdf/2212.00406.
[3]
J. Richter, Y.-C. Wu, S. Krenn, S. Welker, B. Lay, S. Watanabe, A. Richard, and T. Gerkmann, "EARS: An anechoic fullband speech dataset benchmarked for speech enhancement and dereverberation," in Proc. Interspeech, 2024. [Online]. Available: https://arxiv.org/pdf/2406.06185.
[4]
Y. Zang, J. Hai, W. Ge, Q. Kong, Z. Dai, H. Wang, Y. Mitsufuji, and M. D. Plumbley, "MSRBench: A benchmarking dataset for music source restoration," 2025. [Online]. Available: https://arxiv.org/abs/2510.10995.

Citation (BibTeX)

If you use any parts of our code, our data or the gensvs package in your work, please cite our paper and the work that formed the basis of this research.

@INPROCEEDINGS{bereuter2026se2svs,
  author={Bereuter, Paul A. and Plumbley, Mark D. and Sontacchi, Alois},
  booktitle={}, 
  title={Teaching Speech Enhancement Models to Sing: Domain Adaptation from Speech Enhancement to Singing Voice Separation}, 
  year={2026},
  volume={},
  number={},
  pages={},
  keywords={},
  doi={}