Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Cesar Borja

FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

Mar 25, 2025

Carlos Plou, Cesar Borja, Ruben Martinez-Cantin, Ana C. Murillo

Figure 1 for FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

Figure 2 for FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

Figure 3 for FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

Figure 4 for FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

Abstract:Information retrieval in hour-long videos presents a significant challenge, even for state-of-the-art Vision-Language Models (VLMs), particularly when the desired information is localized within a small subset of frames. Long video data presents challenges for VLMs due to context window limitations and the difficulty of pinpointing frames containing the answer. Our novel video agent, FALCONEye, combines a VLM and a Large Language Model (LLM) to search relevant information along the video, and locate the frames with the answer. FALCONEye novelty relies on 1) the proposed meta-architecture, which is better suited to tackle hour-long videos compared to short video approaches in the state-of-the-art; 2) a new efficient exploration algorithm to locate the information using short clips, captions and answer confidence; and 3) our state-of-the-art VLMs calibration analysis for the answer confidence. Our agent is built over a small-size VLM and a medium-size LLM being accessible to run on standard computational resources. We also release FALCON-Bench, a benchmark to evaluate long (average > 1 hour) Video Answer Search challenges, highlighting the need for open-ended question evaluation. Our experiments show FALCONEye's superior performance than the state-of-the-art in FALCON-Bench, and similar or better performance in related benchmarks.

Via

Access Paper or Ask Questions