HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis

Abstract

Controllable speech synthesis refers to the precise control of speaking style by manipulating specific prosodic and paralinguistic attributes, such as gender, volume, speech rate, pitch, and pitch fluctuation. With the integration of advanced generative models, particularly large language models (LLMs) and diffusion models, controllable text-to-speech (TTS) systems have increasingly transitioned from label-based control to natural language description-based control, which is typically imple mented by predicting global style embeddings from textual prompts. However, this straightforward prediction overlooks the underlying distribution of the style embeddings, which may hinder the full potential of controllable TTS systems. In this study, we use t-SNE analysis to visualize and analyze the global style embedding distribution of various mainstream TTS systems, revealing a clear hierarchical clustering pattern: embeddings first cluster by timbre and subsequently subdivide into finer clusters based on style attributes. Based on this observation, we propose HiStyle, a two-stage style embedding predictor that Hierarchical predicts style embeddings conditioned on textual prompts, and further incorporate contrastive learning to help align the text and audio embedding spaces. Additionally, we propose a style annotation strategy that leverages the complementary strengths of statistical methodologies and human auditory preferences to generate more accurate and perceptually consistent textual prompts for style control. Comprehensive experiments demon strate that when applied to the base TTS model, HiStyle achieves significantly better style controllability than alternative style embedding predicting approaches while preserving high speech quality in terms of naturalness and intelligibility.

model — Fig.1 Overview architecture of the HiStyle embedding predictor. The subplot in the upper-left corner illustrates the prediction process of our first stage. The Speaker Embedding Predictor takes the text prompt embedding as a condition and uses the reference speaker embedding to predict the predicted speaker embedding. Similarly, the subplot in the lower-left corner represents the second stage, where the Style Embedding Predictor takes the text prompt embedding along with the residual connection of the intermediate result (predicted speaker embedding) from the first stage as conditions, and leverages the fusion embedding to predict the predicted style embedding. The subplot on the right depicts the detailed architecture and training process of the two Embedding Predictors..

Demos

Target Text	Text Prompt	Histyle Predictor (proposed)	Vatiation Predictor	Query Encoder Predictor
山无棱，天地合，才敢与君绝。	这名男子以缓慢的语速高声说道 (This man spoke slowly but loudly.)
	这位男子说话时音调起伏显著，声音却很轻。 (When this man spoke, his pitch fluctuated significantly, yet his voice was very soft.)
	这个女人用平静的语调和非常大的声音说着。 (This woman spoke in a calm tone and with a very loud voice.)
前方有左急转弯，请减速慢行。	这位女性说话的音调很高但很平稳。 (Her tone was persistently high and stable throughout her speaking.)
	这位女性说话音调很低，语速适中。 (She spoke in a low tone with a moderate speed.)
	一个男人用低沉的音调和很大的音量说话。 (A man spoke in a deep tone and with a very loud voice.)
追到附近北一路一网吧时，该男子跑进网吧躲藏。	这位女性说话的音调很高但很平稳。 (Her tone was persistently high and stable throughout her speaking.)
	这位男性说话的音量极大。 (He spoke in a powerful volume.)
	这位女性说话音调很低，语速适中。 (She spoke in a low tone with a moderate speed.)
Hi there, how are you doing today?	这位男性说话音调低沉、音量极大。 (He spoke in a low tone with a powerful volume.)
	这位女性说话的音调很高且语速很快。 (This woman spoke with a high pitch and at a fast pace.)
	这个女性语速很快。(This female speaks very fast.)
Could you please help me with my homework?	一个男人说话声音很大同时语调变化很大。(A man spoke with a very loud voice and highly varied intonation.)
	这位女性说话时语调很高，语速较快。(The woman spoke with a high pitch and a relatively fast speaking rate.)
	这个男生用很平稳的语调说话。(The man spoke in a very smooth tone.)

XEmoRAG Attribute Table Demo