Source:
Researchers Ghose and Prevost created a deep learning algorithm which, given a silent video, can generate a realistic sounding synchronized soundtrack.
Frequently, movies had added sound effects which were not recorded during after the recording to make it feel more realistic in a process called “Foley”. Researchers at the university of Texas turned to deep learning to automate this process. They trained a neural network on 12 popular movie events where directors frequently add Foley effects. Their neural network classifies the class of the sound to generate, and also has a sequential network that generates the sound. They thus used neural networks to go from temporally aligned images to the generation of sound, a whole different modality!
The first thing the researchers did was create a dataset (the Automatic Foley Dataset) containing short movie clips with 12 movie events. For some movie events they generated sounds themselves inside a studio (such as cutting, footsteps, and a clock sound). For other events (such as gunshots, a horse running, and fire) they downloaded video clips with sounds from YouTube. They recorded 1000 videos with an average duration of 5 seconds.
The next step is predicting the right class of the sound. For this they compared two approaches: a frame sequence network (FSLSTM) and a frame relation network (TRN). In the frame sequence network approach they take each video frame. They then interpolate frames between the existing frames in the video for more granularity. A ResNet-50 convolutional neural network (CNN) extracts image features. The sound class is then predicted using a recurrent neural network called Fast-Slow LSTM fed with the image features. In the Frame Relation Network they tried to capture the detail transformations and actions of the objects with less computational time. The frame relation network (or more precisely, the Multi-Scale Temporal Relation Network) compares features from frames at N distances apart, where N takes on multiple values. In the end all these features are combined again using a multilayer perceptron.
The last step is generating the sound for this class. To do this, the researchers used the Inverse Short Time Fourier Transform method. For this method, they first determine the average of all spectrograms of each sound class in their training set. This way they get a good (average) start for a sound generation. The neural network then only has to predict the delta to this average sound anchor for every sampling step of the sound.
Four different methods were used to evaluate the performance of the algorithm, among which a human qualitative evaluation. They asked local college students to pick the most realistic sound, the most suitable sound, the one with minimum noise, and the most synchronized sound sample. These students preferred the synthesized sound over the original sound in 73.71 percent of the cases for one model, and in 65.96 percent of the cases for another model. The preference for each model also depended on what was in the video: one model performed better on the scenes with many random action changes.
You can judge for yourself whether the final result feels realistic with this video of a fire, this video of a horse, and this video of rain. You can read more about their approach in their paper.