AI Can Figure out What You Look Like from Your Voice

3 min readOct 29, 2022


Have you imagined such a scene? In the future, when we’re on the phone or listening to the radio, we don’t see each other’s faces, but AI models can make the portraits in seconds.

This sounds like metaphysics. With the help of AI technology, it has become a reality to recognize people by hearing voices, and it is better than human recognition. By training an AI model to study the potential relationship between expressions and voices, find the owner of the voice.

According to the paper “Seeking the Shape of Sound: An Adaptive Framework for Learning Voice-Face Association” published in CVPR 2021 by Dr. Wen Peisong of the Institute of Computing Technology of the Chinese Academy of Sciences, “recognizing the appearance by hearing the sound” or “recognizing the sound by seeing people” is essentially based on deep learning and cross-modal retrieval technology, input face images and speech audio clips into the face encoder network and the speech encoder network respectively, assign different weights to the extracted feature values according to the average loss of identity, and filter out personalized samples, The neural network parameters are then updated using two-level modal matching to find the correlation between voice and face.

At present, given a voice and several face pictures containing only one correct face, the correct rate of this AI algorithm matching voice and face is about 87.2%, while under the same conditions, human judgment The accuracy rate is about 81.3%. If the gender of the identified object is limited, the accuracy rate will drop to 57.1%, and the degree of freedom and robustness of AI is very good, so the accuracy rate is quite stable.

The MIT team released the paper “Speech2Face: Learning the Face Behind a Voice” at CVPR 2019. The research team gave their AI neural network the intuitive name “Speech2Face”.

In Speech2Face, the researchers used the AVSpeech dataset (consisting of millions of video clips on YouTube, with language data of more than 100,000 people) as the basis, and input face images and speech audio clips into the face encoder network separately. and speech encoder network, extract low-dimensional 4096-D face features, then correlate the information points of face images and speech, and then decode the predicted facial features through a separately trained face decoder model. Standard image.

The face images reconstructed by Speech2Face are still highly consistent with real face images in terms of age, gender, ethnicity and craniofacial information. Of course, since this neural network model is still in the further research stage, because some people’s voices are very distinctive, it will lead to misjudgment by these AI systems. For example, some boys before the voice change will be regarded as girls, men with hoarse voices will be regarded as old men, and Asians with fluent English will be regarded as white people.

About Datatang

Founded in 2011, Datatang is a professional artificial intelligence data service provider and committed to providing high-quality training data and data services for global AI companies. Relying on own data resources, technical advantages and intensive data processing experiences, Datatang provides data services to 1000+ companies and institutions worldwide. Datatang entered Chinese stock market (NEEQ: 831428) in 2014 and became the first listed company in China’s artificial intelligence data service industry.

If you need data services, please feel free to contact us:




Off-the-shelf AI training data, on-demand data collection & annotation services