Teaching Speech Enhancement Models to Sing: Domain Adaptation from Speech Enhancement to Singing Voice Separation

Abstract

State-of-the-art speech enhancement (SE) models benefit from large-scale labeled datasets, whereas singing voice separation (SVS) models suffer from limited available training data. To address this limitation, we formulate singing voice separation as domain adaptation from speech enhancement to singing voice separation. We investigate two fine-tuning strategies: full fine-tuning and parameter-efficient fine-tuning using Low-Rank Adaptation (LoRA) on a generative and a discriminative model. Models with either adaptation strategy outperform the same architectures trained from scratch by 0.2-1.8 dB in Signal-to-Distortion-Ratio (SDR). Full fine-tuning yields highest singing voice separation performance, but catastrophic forgetting degrades speech enhancement performance. LoRA fine-tuning achieves competitive singing voice separation performance while preserving the original speech enhancement capability with only 6-12% additional parameters compared to the base speech enhancement model. Furthermore, the generative model shows improved generalization to an unseen test set. The results demonstrate that adapting pretrained speech enhancement models is an effective strategy for training singing voice separation models in data-scarce scenarios.

Audio Examples

The audio examples on this webpage accompany our paper titled "Teaching Speech Enhancement Models to Sing: Domain Adaptation from Speech Enhancement to Singing Voice Separation" presented at _____ in 2026. The audio audible here is the output of models that were trained with from scratch training, LoRA, or full fine-tuning for the task of singing voice separation. Only the base model and the LoRA model with disabled adapter, which is effectively the base speech enhancement model, are models trained only for speech enhancement.

Singing Voice Separation: GenSVS test set [1]


		Sample 1	Sample 35	Sample 42
Mixture
Target
BSRNN	Full fine-tuning
	LoRA (rank 16)
	LoRA (rank 32)
	LoRA (rank 128)
	From scratch
	LoRA (rank 16) (adapter disabled)
	Base [2]
SGM	Full fine-tuning
	LoRA (rank 16)
	From scratch
	LoRA (rank 16) (adapter disabled)
	Base [3]

Singing Voice Restoration: MSRBench test set [4]


		Sample 240	Sample 198	Sample 78
Mixture
Target
BSRNN	Full fine-tuning
	LoRA (rank 16)
	LoRA (rank 32)
	LoRA (rank 128)
	From scratch
	LoRA (rank 16) (adapter disabled)
	Base [2]
SGM	Full fine-tuning
	LoRA (rank 16)
	From scratch
	LoRA (rank 16) (adapter disabled)
	Base [3]

Speech Enhancement: EARS-WHAM test set [4]


		p102 / 0007	p105 / 00164	p104 / 00749
Mixture
Target
BSRNN	Full fine-tuning
	LoRA (rank 16)
	LoRA (rank 32)
	LoRA (rank 128)
	From scratch
	LoRA (rank 16) (adapter disabled)
	Base [2]
SGM	Full fine-tuning
	LoRA (rank 16)
	From scratch
	LoRA (rank 16) (adapter disabled)
	Base [3]

References

[1]

P. A. Bereuter, B. Stahl, M. D. Plumbley, and A. Sontacchi, "Towards reliable objective evaluation metrics for generative singing voice separation models," in Proc. WASPAA, 2025. [Online]. Available: https://arxiv.org/abs/2504.02271.

[2]

J. Yu, H. Chen, Y. Luo, R. Gu, and C. Weng, "High fidelity speech enhancement with band-split RNN," in Interspeech 2023, 2023, pp. 2483-2487. [Online]. Available: https://arxiv.org/pdf/2212.00406.

[3]

J. Richter, Y.-C. Wu, S. Krenn, S. Welker, B. Lay, S. Watanabe, A. Richard, and T. Gerkmann, "EARS: An anechoic fullband speech dataset benchmarked for speech enhancement and dereverberation," in Proc. Interspeech, 2024. [Online]. Available: https://arxiv.org/pdf/2406.06185.

[4]

Y. Zang, J. Hai, W. Ge, Q. Kong, Z. Dai, H. Wang, Y. Mitsufuji, and M. D. Plumbley, "MSRBench: A benchmarking dataset for music source restoration," 2025. [Online]. Available: https://arxiv.org/abs/2510.10995.

Citation (BibTeX)

If you use any parts of our code, our data or the gensvs package in your work, please cite our paper and the work that formed the basis of this research.

@INPROCEEDINGS{bereuter2026se2svs,
  author={Bereuter, Paul A. and Plumbley, Mark D. and Sontacchi, Alois},
  booktitle={}, 
  title={Teaching Speech Enhancement Models to Sing: Domain Adaptation from Speech Enhancement to Singing Voice Separation}, 
  year={2026},
  volume={},
  number={},
  pages={},
  keywords={},
  doi={}

Teaching Speech Enhancement Models to Sing: Domain Adaptation from Speech Enhancement to Singing Voice Separation

Abstract

Audio Examples

Singing Voice Separation: GenSVS test set [1]

Singing Voice Restoration: MSRBench test set [4]

Speech Enhancement: EARS-WHAM test set [4]

Results

Tradeoff between MERT-MSE and PESQ evaluated for speech enhancement on the EARS-WHAM test set. MERT-MSE was evaluated for singing voice separation (GenSVS, top) and singing voice restoration (MSRBench, bottom). Models in the lower right corner indicate a better tradeoff.

Mean performance delta between GenSVS and MSRBench. Bold values denote the best improvement per row and metric.

Model parameter counts and complexity: total parameters (M), rel. added parameters compared to the base model, and model complexity measured with in GMACs/s.

References

Citation (BibTeX)