YouTube deserves a round of applause — because the video platform can now include [APPLAUSE] and other sound effects in a video’s closed captions automatically. The caption expansion, announced on Thursday, March 23, is made possible by deep neural networks, a form of artificial intelligence.
For now, YouTube can only automatically label applause, music, and laughter, but those three sound effects were the descriptions content creators added manually over any other closed captioning noise. The latest feature builds on the automatic captioning feature launched in 2009 for text, but adds the first sound effects to the system.
YouTube says the program works similarly to detecting objects in images, but faced a few more difficulties over object recognition. To get the program to recognize just those three sounds, YouTube engineers had to teach the program to detect those sounds, separate them temporarily and then insert that recognized sound into the captions.
The system also tended to struggle with sound effects that occurred at the same time as other sounds, like laughter and talking. Another challenge was to find a large enough data set to train the system that wasn’t already adequately labeled by manually inputting the data.
The deep learning network analyzes short segments in sequence, and is able to predict the likelihood of those sounds effects at a rate of about 100 frames per second. YouTube engineers, however, built the system in a way that will allow additional sound effects to be added to the system later.
So why applause, music, and laughter? Besides just being the most frequently manually adjusted labels in the close captioning system, each of those sounds also only has one meaning. A “ring,” YouTube explained, in offering an example, could be a ring from a doorbell, a phone, or an alarm, presenting a whole new challenge for the software.
According to YouTube, over 15 million videos with automatic captions are viewed every day. In a test of the latest update to the auto captions, two thirds said the sound effect labels enhanced the overall experience.