From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation
English summary
The paper evaluates four families of speech representations for speech-driven 3D facial animation, comparing facial reconstruction quality across two facial decoders using objective metrics and perceptual evaluation. It also includes probing analyses linking tokenized representations to phonetic units and articulatory deformations. The study finds that encoding phonetic classes improves facial animation accuracy, and that semantic and label-based representations achieve comparable performance. Building on the label-based representations, the authors propose an Audio Visual Text-to-Speech (AVTTS) pipeline that uses discrete representations as a shared space to decode both speech and 3D facial motion.
Chinese summary
该论文评估了四类语音表征在语音驱动3D面部动画中的应用,通过客观指标和感性评估,在两种面部解码器上比较了面部重建质量。论文还进行了探测分析,将离散化表征与语音单元及发音形变关联起来。研究发现,编码语音类别有助于提升面部动画预测的准确性,且语义表征与标签表征的性能相当。基于标签表征,作者提出了一个视听文语转换(AVTTS)流水线,利用离散表征作为共享空间来解码语音和3D面部运动。
Key points
Four speech representation families are systematically compared for 3D facial synthesis, using two decoders, objective metrics, and perceptual evaluation.
使用两种面部解码器、客观指标与感性评估,系统对比了四类语音表征在3D面部合成中的效果。
Encoding phonetic classes proves beneficial for facial animation accuracy; semantic and label-based representations yield comparable quality.
编码语音类别对提升面部动画准确性有益;语义表征与标签表征的表现相当。
An AVTTS pipeline is introduced that employs discrete representations as a shared space to jointly decode speech and 3D facial motion.
提出了一种AVTTS流水线,将离散表征作为共享空间,同时解码语音与3D面部运动。