Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Naoki Katsura

RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D

Aug 23, 2023

Shuhei Kurita, Naoki Katsura, Eri Onami

Figure 1 for RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D

Figure 2 for RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D

Figure 3 for RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D

Figure 4 for RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D

Abstract:Grounding textual expressions on scene objects from first-person views is a truly demanding capability in developing agents that are aware of their surroundings and behave following intuitive text instructions. Such capability is of necessity for glass-devices or autonomous robots to localize referred objects in the real-world. In the conventional referring expression comprehension tasks of images, however, datasets are mostly constructed based on the web-crawled data and don't reflect diverse real-world structures on the task of grounding textual expressions in diverse objects in the real world. Recently, a massive-scale egocentric video dataset of Ego4D was proposed. Ego4D covers around the world diverse real-world scenes including numerous indoor and outdoor situations such as shopping, cooking, walking, talking, manufacturing, etc. Based on egocentric videos of Ego4D, we constructed a broad coverage of the video-based referring expression comprehension dataset: RefEgo. Our dataset includes more than 12k video clips and 41 hours for video-based referring expression comprehension annotation. In experiments, we combine the state-of-the-art 2D referring expression comprehension models with the object tracking algorithm, achieving the video-wise referred object tracking even in difficult conditions: the referred object becomes out-of-frame in the middle of the video or multiple similar objects are presented in the video.

* 15 pages, 11 figures. ICCV2023

Via

Access Paper or Ask Questions