Part of Advances in Neural Information Processing Systems 13 (NIPS 2000)
Malcolm Slaney, Michele Covell
FaceSync is an optimal linear algorithm that finds the degree of syn(cid:173) chronization between the audio and image recordings of a human speaker. Using canonical correlation, it finds the best direction to com(cid:173) bine all the audio and image data, projecting them onto a single axis. FaceSync uses Pearson's correlation to measure the degree of synchro(cid:173) nization between the audio and image data. We derive the optimal linear transform to combine the audio and visual information and describe an implementation that avoids the numerical problems caused by comput(cid:173) ing the correlation matrices.
1 Motivation In many applications, we want to know about the synchronization between an audio signal and the corresponding image data. In a teleconferencing system, we might want to know which of the several people imaged by a camera is heard by the microphones; then, we can direct the camera to the speaker. In post-production for a film, clean audio dialog is often dubbed over the video; we want to adjust the audio signal so that the lip-sync is perfect. When analyzing a film, we want to know when the person talking is in the shot, instead of off camera. When evaluating the quality of dubbed films, we can measure of how well the translated words and audio fit the actor's face.
This paper describes an algorithm, FaceSync, that measures the degree of synchronization between the video image of a face and the associated audio signal. We can do this task by synthesizing the talking face, using techniques such as Video Rewrite [1], and then com(cid:173) paring the synthesized video with the test video. That process, however, is expensive. Our solution finds a linear operator that, when applied to the audio and video signals, generates an audio-video-synchronization-error signal. The linear operator gathers information from throughout the image and thus allows us to do the computation inexpensively. Hershey and Movellan [2] describe an approach based on measuring the mutual informa(cid:173) tion between the audio signal and individual pixels in the video. The correlation between the audio signal, x, and one pixel in the image y, is given by Pearson's correlation, r. The mutual information between these two variables is given by f(x,y) = -1/2 log(l-?). They create movies that show the regions of the video that have high correlation with the audio;