The interpreter in me

Image: Aniriana/
Image: Aniriana/

New technology merges interpreter's voice and facial expression with speaker's face

By Frank Grünberg

Video conferences involving people who speak different languages can be tiring. If simultaneous interpreting is used, the facial expression of the speaker and the voice of the interpreter in the background do not match up. This separation of visual and acoustic information makes understanding more difficult.

With this in mind, together with colleagues at the Max Planck Institute in Saarbrücken and Stanford University in California, FAU researchers have developed a technology that combines the two channels in one image. ‘We merge the interpreter’s voice and facial expression with the speaker’s face,’ explains Prof. Dr. Marc Stamminger from the Chair of Computer Science 9 (Computer Graphics), ‘And we are the first to achieve this in real-time without additional facial markers.’

30 parameters determine a face

The idea of transferring the movements of a real person onto a digitally animated figure is nothing new. The film industry uses this technique to bring avatars to life. However, this not only requires marker technology that captures the original movement, it also takes a long time, even when supercomputers are used.

The FAU researchers are using a different approach. They are taking advantage of the fact that there are 3D models available that can depict the face of an individual central European on the basis of 30 parameters and are using their expertise to transform a commercially available graphics card into a small supercomputer.

First a photograph of the interpreter’s face is taken with a standard time-of-flight camera. This not only captures the shape and texture of the face, with features such as scars or birthmarks, it also provides three-dimensional information on features like the curvature of the nose and forehead. Next a computer program is used to configure the 30 parameters in such a way that the the 3D model fits onto the face like a kind of mask. The software adjusts the parameters in the same way as a sound engineer who moves the sliders on their mixing deck up and down until they have found the perfect sound and the best settings.

Just like for the face shape, there as also parameters – around 70 of them – that allow facial expressions to be captured. When the interpreter begins to speak, their facial expression, which changes constantly while they are speaking, is readjusted in the model at a rate of several times per second. To make sure that there is no time delay during this readjustment, the FAU researchers adapt their software to the special architecture of graphics cards. ‘These cards offer up to 2000 individual processor cores,’ explains Justus Thies, who made a significant contribution to the facial reenactment project as part of his doctoral degree.

‘By dividing the optimisation program into many small computations that are carried out in parallel, we are able to shorten the time the whole process takes down to a fraction of a second.’ Based on the data, the image generator moves the facial expression of the 3D model of the speaker in parallel to the words spoken by the interpreter. The result is that the speaker on the monitor looks like they are speaking normally, with the visual and acoustic information matching up.

Soon to be available on smartphones

The collaborative project was funded by the German Research Foundation (DFG) as part of the Research Training Group ‘Heterogeneous image systems’. The researchers presented the innovative concept, along with a prototype, for the first time at SigGraph, the world’s largest conference on computer graphics, in Kobe, Japan, in November 2015.

What do the researchers aim to do next? ‘The system still needs a well-equipped PC,’ Prof. Stamminger says. ‘In the future we aim to adapt the system so that it can be used on smartphones.’

Further information:

See how facial reenactment works in this video: