Latest Computer Vision Research Presents a New Audio-Visual Framework, “ECLIPSE”, for Long-Range Video Retrieval


Video has become the primary medium for sharing information online. About 80% of all Internet traffic is video content, and the growth is expected to continue in the years to come. Therefore, there is a huge amount of video data available nowadays.

We all use Google to retrieve information online. If we are looking for a text on a specific topic, we write the keyword and are greeted by the amount of posts written on the same topic. The same goes for image search; just write the keywords and you will see the image you are looking for. But what about video? How can we retrieve a video by simply describing it via text? This is the problem that text-to-video recovery tries to solve.

Traditional video recovery methods are primarily designed to work with short videos (eg, 5-15 seconds), and this limitation is usually insufficient when recovering complex actions.

Imagine a video about making burgers from scratch. It may take an hour or even longer. First, prepare the dough for the bun, let it rest, chop the meat, prepare the burger buns, prepare the buns, bake them, toast the buns, assemble the burger, etc. If you want to extract step-by-step instructions from the same video, it would be useful to grab a few relevant minutes of long video segments for each step. However, this cannot be done by traditional video recovery methods as they fail to analyze long video content.

So we know that we need a better video recovery system if we want to eliminate the limitation of short duration of videos. Traditional methods can be adapted for longer videos by increasing the number of input frames. However, this would not be practical due to high computational costs, as processing dense frames would be extremely time and resource consuming.

This is where ECLIPSE comes in. Instead of relying solely on expensive-to-process video images, it uses rich auditory cues and minimally sampled video images, which are easier to process. ECLIPSE is not only more efficient than conventional video-only techniques, it also provides greater text-to-video recovery accuracy.

Although the video modality has a lot of information to store, it also has a lot of information redundancy, which means that the video material often does not vary much between frames. In comparison, audio can more effectively record details about people, things, settings, and other complex events. It is also cheaper to produce than raw film.

Going back to our burger example, the visual cues, such as dough, burger buns, and paddy fields, can be captured in multiple frames, and they will remain the same for most of the video. The audio, however, can point to better cues, such as the sound of rice paddies cooking, etc.

ECLIPSE uses CLIP, a state-of-the-art vision and language method, as the backbone of the method. ECLIPSE uses a dual-path AV attention block in each level of the transformer backbone to adapt CLIP to long-distance video. Through this cross-modal attention mechanism, long-range temporal cues from the audio stream can be included in the visual representation. Conversely, rich visual features of the video modality can be injected into the audio representation to increase the expressiveness of the audio features.

This was a brief summary of the ECLIPSE document. ECLIPSE replaces expensive visual cues from video with audio cues that are inexpensive to process and provides better performance than video-only methods. It is flexible, fast, memory efficient and achieves top performance in video recovery tasks. You can find related links below if you want to know more about ECLIPSE.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and github link.

Please Don't Forget To Join Our ML Subreddit

Ekrem Çetinkaya obtained his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis on image denoising using deep convolutional networks. He is currently pursuing a doctorate. degree at the University of Klagenfurt, Austria, and working as a researcher on the ATHENA project. His research interests include deep learning, computer vision and multimedia networks.


Comments are closed.