Prototype-Based Disentanglement for Controllable Dysarthric Speech Synthesis

Abstract Dysarthric speech exhibits high variability and limited labeled data, posing major challenges for both automatic speech recognition (ASR) and assistive speech technologies. Existing approaches rely on synthetic data augmentation or speech reconstruction, yet often entangle speaker identity with pathological articulation, limiting controllability and robustness. In this paper, we propose ProtoDisent-TTS, a prototype-based disentanglement framework built on a pre-trained text-to-speech backbone that factorizes speaker timbre and dysarthric articulation within a unified latent space. A pathology prototype codebook provides interpretable and controllable representations of control and dysarthric speech patterns, while a dual-classifier objective with a gradient reversal layer enforces invariance of speaker embeddings to pathological attributes. This design enables bidirectional transformation between healthy and dysarthric speech, supporting scalable ASR data augmentation and speaker-aware speech reconstruction. Experiments on the TORGO dataset demonstrate that ProtoDisent-TTS is an effective framework for ASR data augmentation and dysarthric speech reconstruction.

Contents

This page is for research demonstration purposes only.

Model Overview

Figure 1. Overall architecture of our ProtoDisent-TTS.

Dysarthria Speech Synthesis

Real Synthesis Reference Text
F01
Usually minus several buttons.
A long flowing beard clings to his chin.
M01
When he speaks his voice is just a bit cracked and quivers a trifle.
Grandfather likes to be modern in his language.
M02
She had your dark suit in greasy wash water all year.
Yet he still thinks as swiftly as ever.
M04
The quick brown fox jumps over the lazy dog.
I can read.
M05
Twice each day he plays skillfully and with zest upon our small organ.
Don't ask me to carry an oily rag like that.

Healthy-to-Dysarthria Transformation

FC01
Original Speech
Reference Text: When he speaks his voice is just a bit cracked and quivers a trifle
Prototype k = 1
Prototype k = 2
Prototype k = 3
Prototype k = 4
Prototype k = 5
Original Speech
Reference Text: Don't ask me to carry an oily rag like that
Prototype k = 1
Prototype k = 2
Prototype k = 3
Prototype k = 4
Prototype k = 5
MC02
Original Speech
Reference Text: Usually minus several buttons
Prototype k = 1
Prototype k = 2
Prototype k = 3
Prototype k = 4
Prototype k = 5
Original Speech
Reference Text: You wished to know all about my grandfather
Prototype k = 1
Prototype k = 2
Prototype k = 3
Prototype k = 4
Prototype k = 5
MC04
Original Speech
Reference Text: We have often urged him to walk more and smoke less
Prototype k = 1
Prototype k = 2
Prototype k = 3
Prototype k = 4
Prototype k = 5
Original Speech
Reference Text: he dresses himself in an ancient black frock coat
Prototype k = 1
Prototype k = 2
Prototype k = 3
Prototype k = 4
Prototype k = 5

Dysarthria-to-Healthy Transformation

Original Transformed Reference Text
F01
Stick.
Except in the winter when the ooze or snow or ice prevents.
Giving those who observe him a pronounced feeling of the utmost respect.
M01
Trait.
Grandfather likes to be modern in his language.
A long flowing beard clings to his chin.
M04
Trouble.
Twice each day he plays skillfully and with zest upon our small organ.
Well he is nearly ninetythree years old.