VideoLucy icon

VideoLucy: Deep Memory Backtracking for Long Video Understanding

1 National Key Laboratory of Multispectral Information Intelligent Processing Technology, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
2NUS 3S-Lab, NTU 4Shanghai AI Lab
NeurIPS-2025 Accepted Paper

*Corresponding Authors
Teaser Image

Qualitative comparison of event understanding in long videos. Compared with existing leading video MLLMs, our VideoLucy stands out in capturing and integrating cross-temporal events in long videos, along with explainable and comprehensive reasoning process.

Abstract

We propose VideoLucy, a deep memory backtracking framework for long video understanding. Inspired by the human recollection process from coarse to fine, VideoLucy employs a hierarchical memory structure with progressive granularity. This structure explicitly defines the detail level and temporal scope of memory at different hierarchical depths. Through an agent-based iterative backtracking mechanism, VideoLucy systematically mines video-wide, question-relevant deep memories until sufficient information is gathered to provide a confident answer. This design enables effective temporal understanding of consecutive frames while preserving critical details. In addition, we introduce EgoMem, a new benchmark for long video understanding. EgoMem is designed to comprehensively evaluate a model's ability to understand complex events that unfold over time and capture fine-grained details in extremely long videos. Extensive experiments demonstrate the superiority of VideoLucy. Built on open-source models, VideoLucy significantly outperforms state-of-the-art methods on multiple long video understanding benchmarks, achieving performance even surpassing the latest proprietary models such as GPT-4o.

Method

Teaser Image

Comparison between our VideoLucy with existing video agent-based systems. In(a), they usually perform frame-level captioning on sparsely sampled frames, and then search for information, resulting in great information loss and hampering temporal understanding. In (b), our VideoLucy, through a hierarchical memory structure and a memory backtracking mechanism, effectively performs multi-level video representation and achieves comprehensive information coverage.

Teaser Image

We propose a novel iterative backtracking mechanism. Through an agent-driven iterative loop, we continuously update the current memory initialized by sparse coarse memory, so as to dynamically explore the question-related memory in terms of both breadth and depth. This mechanism emulates the human recollection process and achieves a comprehensive search and integration of information relevant to the question with a relatively low resource cost.

EgoMem Benchmark

Teaser Image

We construct a new benchmark for ultra-long video understanding, namely EgoMem, aiming to measure the ability to model instantaneous (detail perception) and continuous (event understanding) memory of long videos. Based on the video resources of EgoLife, we manually annotate question-answer pairs that particularly focus on the understanding of cross-temporal events and the perception of instantaneous visual features for each day’s long video. For the event understanding, we design six different question types to conduct a comprehensive and effective evaluation of the model’s performance in a real sense, and to avoid shortcuts. In addition, we manually annotate questions about the subtle visual features in those instantaneous time segments to assess whether the model can effectively cover detailed information. It contains 42 videos with an average duration of 6.33 hours and 504 questions.


This is a clip from the original long video, provided as a demo.
An example for the Detail Perception task in EgoMem:
Jake picked up KFC fast food from the delivery person for lunch. What was the delivery person wearing when Jake received the food?
A. A blue denim jacket, a blue-green long dress, and white sports shoes.
B. A light green long-sleeved shirt, black long pants, and black leather shoes.
C. A green long-sleeved jacket, black long pants, and white sports shoes.
D. An orange-gray T-shirt, khaki loose long pants, and gray sports shoes.
Reanson: Around 3 seconds before and after 14:09:41, when Jake picked up the food from the delivery person, the delivery person was wearing a green long-sleeved jacket, black long pants, and white sports shoes.

Needle-in-A-Video-Haystack

Teaser Image

We conduct a Needle-in-A-Video-Haystack evaluation experiment. Specifically, we randomly select 10 long videos with durations ranging from 400s to 4000s from the existing benchmarks. Then, we insert 10s short video clips (needles) from the Internet at five arbitrary timestamps throughout each long video, from the beginning to the end. Then we feed the entire long video into the model and conduct a question-answering test regarding the content of these short clips. There are 4 questions for each clip, forming 20 questions for one long video. The performance of our VideoLucy is significantly better than that of the existing leading models, and its results are almost unaffected by the video length. This indicates that our VideoLucy has a very strong ability to search for question-relevant details in long video understanding. More experiments and details can be found in the paper.

Paper

BibTeX

@inproceedings{
zuo2025videolucy,
title={VideoLucy: Deep Memory Backtracking for Long Video Understanding},
author={Jialong Zuo, Yongtai Deng, Lingdong Kong, Jingkang Yang, Rui Jin, Yiwei Zhang, Nong Sang, Liang Pan, Ziwei Liu, Changxin Gao},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025}
}