Thursday, September 16 2021

Facebook AI researchers Ruohan Gao and Kristen Grauman presented VisualVoice, a new approach for audio-visual separation of speech.

“While the existing methods focus on learning the alignment between the movements of the speaker’s lips and the sounds they generate, we propose to take advantage of the appearance of the speaker’s face before isolating the qualities corresponding vocals that they are likely to produce, ”the researchers said.

The human perceptual system relies heavily on visual information to reduce ambiguities in audio and modulate attention to an active speaker in a busy environment. Automating this process of speech separation has many interesting applications, including:

  • Assistive technology for the hearing impaired.
  • Superhuman hearing in a portable augmented reality device.
  • Better transcription of content spoken in noisy internet videos in nature.


The researchers first formally defined the problem before presenting the audiovisual speech separation network. The article then presented the learning of audiovisual speech separation and face-to-voice intermodal integrations in a multi-tasking learning framework and finally presented the training criteria and inference procedures.

“We use the visual cues from the facial track to guide the speech separation for each speaker. The visual flow of our network consists of two parts: a network for analyzing lip movement and a network for analyzing facial attributes, ”the newspaper said.

The lip movement analysis network includes:

  • The lip movement analysis network takes N oral regions of interest (ROI) as input and consists of a 3D convolutional layer.
  • This is followed by a ShuffleNet v2 network to extract a time-indexed sequence of feature vectors.
  • They are then processed by a temporal convolutional network (TCN) to extract the final map of the characteristics of the movement of the lips of dimension Vl × N.

For the facial attribute analysis network, the researchers used:

  • ResNet-18 network that takes a single randomly sampled face image from the face track as input to extract a face integration that encodes the speaker’s facial attributes.
  • Then replicate the feature of facial attributes along the time dimension to concatenate it with the lip movement features map and get a final visual feature.

“The feature of facial attributes represents an identity code whose role is to identify the space of expected frequencies or other audio properties for the speaker’s voice, while the role of lip movement is to isolate the articulated speech specific to this segment. Together, they provide complementary visual cues to guide the process of speech separation.

On the audio side, the team uses a U-Net style network suitable for audiovisual speech separation. It consists of an encoder and a network of decoders.

See also

Outcome and scope

Lip movement is directly correlated with speech content and is much more informative for speech separation. However, the performance of the model based on lip movement suffers greatly when the lip movement is unreliable, as is often the case in real world videos. “Our VISUALVOICE approach combines the complementary cues in learned lip movement and facial voice integration with intermodal consistency and is therefore less vulnerable to unreliable lip movements,” the researchers said. Discover the demo here.

Intermodal integration learning could benefit from joint learning of the Facebook model. The researchers intend to evaluate the intermodal verification task, in which the system must decide whether a given face and voice belong to the same person.

“Our design for intermodal matching and loss of speaker consistency is not limited to the task of speech separation and can be potentially useful for other audiovisual applications, such as learning intermediate features for identification. of the speaker and the location of the sound source. As part of our future work, we plan to explicitly model the fine-grained intermodal attributes of faces and voices and exploit them to further improve speech separation, ”the researchers concluded.

Join our Discord server. Be part of an engaging online community. Join here.

Subscribe to our newsletter

Receive the latest updates and relevant offers by sharing your email.

Kumar Gandharv

Kumar Gandharv, PGD in English Journalism (IIMC, Delhi), is embarking on a trip as a technical journalist at AIM. An attentive observer of national and IR news. He loves going to the gym. Contact: [email protected]

Source link


Audiovisual Room - Manila Bulletin


Tinkerine enters into a reseller partnership with Kansas City Audio Visual Inc.

Check Also