REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion

Ishan D. Biyani ¹, Nirmesh J. Shah¹, Ashishkumar P. Gudmalwar ¹, Pankaj Wasnik¹, and Rajiv Ratn Shah²

¹ Media Analysis Group, Sony Research India, Bangalore
² Indraprastha Institute of Information Technology (IIIT), Delhi

Email: pankaj.wasnik@sony.com

Speech time reversal refers to the process of reversing the entire speech signal in time, causing it to play backward. Such signals are completely unintelligible since the fundamental structures of phonemes and syllables are destroyed. However, they still retain tonal patterns that enable perceptual speaker identification despite losing linguistic content. In this paper, we propose leveraging speaker representations learned from time reversed speech as an augmentation strategy to enhance speaker representation. Notably, speaker and language disentanglement in voice conversion (VC) is essential to accurately preserve a speaker's unique vocal traits while minimizing interference from linguistic content. The effectiveness of the proposed approach is evaluated in the context of state-of-the-art diffusion-based VC models. Experimental results indicate that the proposed approach significantly improves speaker similarity-related scores while maintaining high speech quality.

Paper

accepted at INTERSPEECH 2025

Methods	Sample 1	Sample 2	Sample 3
Source
Reference
DDDM-VC
DDDM-VC + Ours
DiffHier-VC
DiffHier-VC + Ours
Diff-VC
Diff-VC + Ours
LMVC
SEFVC
StyleVC
StableVC
VALLE VC

REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion

Demo Samples for Voice Conversion

Citation