Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Taichi Nishimura

EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos with Procedural Texts

Oct 07, 2024

Yuto Haneji, Taichi Nishimura, Hirotaka Kameko, Keisuke Shirai, Tomoya Yoshida, Keiya Kajimura, Koki Yamamoto, Taiyu Cui, Tomohiro Nishimoto, Shinsuke Mori

Figure 1 for EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos with Procedural Texts

Figure 2 for EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos with Procedural Texts

Figure 3 for EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos with Procedural Texts

Figure 4 for EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos with Procedural Texts

Abstract:Mistake action detection from egocentric videos is crucial for developing intelligent archives that detect workers' errors and provide feedback. Previous studies have been limited to specific domains, focused on detecting mistakes from videos without procedural texts, and analyzed whether actions are mistakes. To address these limitations, in this paper, we propose the EgoOops dataset, which includes egocentric videos, procedural texts, and three types of annotations: video-text alignment, mistake labels, and descriptions for mistakes. EgoOops covers five procedural domains and includes 50 egocentric videos. The video-text alignment allows the model to detect mistakes based on both videos and procedural texts. The mistake labels and descriptions enable detailed analysis of real-world mistakes. Based on EgoOops, we tackle two tasks: video-text alignment and mistake detection. For video-text alignment, we enhance the recent StepFormer model with an additional loss for fine-tuning. Based on the alignment results, we propose a multi-modal classifier to predict mistake labels. In our experiments, the proposed methods achieve higher performance than the baselines. In addition, our ablation study demonstrates the effectiveness of combining videos and texts. We will release the dataset and codes upon publication.

Via

Access Paper or Ask Questions

DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information

Sep 18, 2024

Shota Nakada, Taichi Nishimura, Hokuto Munakata, Masayoshi Kondo, Tatsuya Komatsu

Figure 1 for DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information

Figure 2 for DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information

Figure 3 for DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information

Figure 4 for DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information

Abstract:Current audio-visual representation learning can capture rough object categories (e.g., ``animals'' and ``instruments''), but it lacks the ability to recognize fine-grained details, such as specific categories like ``dogs'' and ``flutes'' within animals and instruments. To address this issue, we introduce DETECLAP, a method to enhance audio-visual representation learning with object information. Our key idea is to introduce an audio-visual label prediction loss to the existing Contrastive Audio-Visual Masked AutoEncoder to enhance its object awareness. To avoid costly manual annotations, we prepare object labels from both audio and visual inputs using state-of-the-art language-audio models and object detectors. We evaluate the method of audio-visual retrieval and classification using the VGGSound and AudioSet20K datasets. Our method achieves improvements in recall@10 of +1.5% and +1.2% for audio-to-visual and visual-to-audio retrieval, respectively, and an improvement in accuracy of +0.6% for audio-visual classification.

* under review

Via

Access Paper or Ask Questions

Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection

Aug 06, 2024

Taichi Nishimura, Shota Nakada, Hokuto Munakata, Tatsuya Komatsu

Figure 1 for Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection

Figure 2 for Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection

Figure 3 for Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection

Figure 4 for Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection

Abstract:We propose Lighthouse, a user-friendly library for reproducible video moment retrieval and highlight detection (MR-HD). Although researchers proposed various MR-HD approaches, the research community holds two main issues. The first is a lack of comprehensive and reproducible experiments across various methods, datasets, and video-text features. This is because no unified training and evaluation codebase covers multiple settings. The second is user-unfriendly design. Because previous works use different libraries, researchers set up individual environments. In addition, most works release only the training codes, requiring users to implement the whole inference process of MR-HD. Lighthouse addresses these issues by implementing a unified reproducible codebase that includes six models, three features, and five datasets. In addition, it provides an inference API and web demo to make these methods easily accessible for researchers and developers. Our experiments demonstrate that Lighthouse generally reproduces the reported scores in the reference papers. The code is available at https://github.com/line/lighthouse.

* 6 pages; library tech report

Via

Access Paper or Ask Questions

BioVL-QR: Egocentric Biochemical Video-and-Language Dataset Using Micro QR Codes

Apr 04, 2024

Taichi Nishimura, Koki Yamamoto, Yuto Haneji, Keiya Kajimura, Chihiro Nishiwaki, Eriko Daikoku, Natsuko Okuda, Fumihito Ono, Hirotaka Kameko, Shinsuke Mori

Abstract:This paper introduces a biochemical vision-and-language dataset, which consists of 24 egocentric experiment videos, corresponding protocols, and video-and-language alignments. The key challenge in the wet-lab domain is detecting equipment, reagents, and containers is difficult because the lab environment is scattered by filling objects on the table and some objects are indistinguishable. Therefore, previous studies assume that objects are manually annotated and given for downstream tasks, but this is costly and time-consuming. To address this issue, this study focuses on Micro QR Codes to detect objects automatically. From our preliminary study, we found that detecting objects only using Micro QR Codes is still difficult because the researchers manipulate objects, causing blur and occlusion frequently. To address this, we also propose a novel object labeling method by combining a Micro QR Code detector and an off-the-shelf hand object detector. As one of the applications of our dataset, we conduct the task of generating protocols from experiment videos and find that our approach can generate accurate protocols.

* 6 pages

Via

Access Paper or Ask Questions

Text-driven Affordance Learning from Egocentric Vision

Apr 03, 2024

Tomoya Yoshida, Shuhei Kurita, Taichi Nishimura, Shinsuke Mori

Abstract:Visual affordance learning is a key component for robots to understand how to interact with objects. Conventional approaches in this field rely on pre-defined objects and actions, falling short of capturing diverse interactions in realworld scenarios. The key idea of our approach is employing textual instruction, targeting various affordances for a wide range of objects. This approach covers both hand-object and tool-object interactions. We introduce text-driven affordance learning, aiming to learn contact points and manipulation trajectories from an egocentric view following textual instruction. In our task, contact points are represented as heatmaps, and the manipulation trajectory as sequences of coordinates that incorporate both linear and rotational movements for various manipulations. However, when we gather data for this task, manual annotations of these diverse interactions are costly. To this end, we propose a pseudo dataset creation pipeline and build a large pseudo-training dataset: TextAFF80K, consisting of over 80K instances of the contact points, trajectories, images, and text tuples. We extend existing referring expression comprehension models for our task, and experimental results show that our approach robustly handles multiple affordances, serving as a new standard for affordance learning in real-world scenarios.

Via

Access Paper or Ask Questions

Automatic Construction of a Large-Scale Corpus for Geoparsing Using Wikipedia Hyperlinks

Mar 25, 2024

Keyaki Ohno, Hirotaka Kameko, Keisuke Shirai, Taichi Nishimura, Shinsuke Mori

Figure 1 for Automatic Construction of a Large-Scale Corpus for Geoparsing Using Wikipedia Hyperlinks

Figure 2 for Automatic Construction of a Large-Scale Corpus for Geoparsing Using Wikipedia Hyperlinks

Figure 3 for Automatic Construction of a Large-Scale Corpus for Geoparsing Using Wikipedia Hyperlinks

Figure 4 for Automatic Construction of a Large-Scale Corpus for Geoparsing Using Wikipedia Hyperlinks

Abstract:Geoparsing is the task of estimating the latitude and longitude (coordinates) of location expressions in texts. Geoparsing must deal with the ambiguity of the expressions that indicate multiple locations with the same notation. For evaluating geoparsing systems, several corpora have been proposed in previous work. However, these corpora are small-scale and suffer from the coverage of location expressions on general domains. In this paper, we propose Wikipedia Hyperlink-based Location Linking (WHLL), a novel method to construct a large-scale corpus for geoparsing from Wikipedia articles. WHLL leverages hyperlinks in Wikipedia to annotate multiple location expressions with coordinates. With this method, we constructed the WHLL corpus, a new large-scale corpus for geoparsing. The WHLL corpus consists of 1.3M articles, each containing about 7.8 unique location expressions. 45.6% of location expressions are ambiguous and refer to more than one location with the same notation. In each article, location expressions of the article title and those hyperlinks to other articles are assigned with coordinates. By utilizing hyperlinks, we can accurately assign location expressions with coordinates even with ambiguous location expressions in the texts. Experimental results show that there remains room for improvement by disambiguating location expressions.

* LREC-COLING 2024

Via

Access Paper or Ask Questions

On the Audio Hallucinations in Large Audio-Video Language Models

Jan 18, 2024

Taichi Nishimura, Shota Nakada, Masayoshi Kondo

Abstract:Large audio-video language models can generate descriptions for both video and audio. However, they sometimes ignore audio content, producing audio descriptions solely reliant on visual information. This paper refers to this as audio hallucinations and analyzes them in large audio-video language models. We gather 1,000 sentences by inquiring about audio information and annotate them whether they contain hallucinations. If a sentence is hallucinated, we also categorize the type of hallucination. The results reveal that 332 sentences are hallucinated with distinct trends observed in nouns and verbs for each hallucination type. Based on this, we tackle a task of audio hallucination classification using pre-trained audio-text models in the zero-shot and fine-tuning settings. Our experimental results reveal that the zero-shot models achieve higher performance (52.2% in F1) than the random (40.3%) and the fine-tuning models achieve 87.9%, outperforming the zero-shot models.

* 6 pages

Via

Access Paper or Ask Questions

Large-scale Vision-Language Models Learn Super Images for Efficient and High-Performance Partially Relevant Video Retrieval

Dec 01, 2023

Taichi Nishimura, Shota Nakada, Masayoshi Kondo

Figure 1 for Large-scale Vision-Language Models Learn Super Images for Efficient and High-Performance Partially Relevant Video Retrieval

Figure 2 for Large-scale Vision-Language Models Learn Super Images for Efficient and High-Performance Partially Relevant Video Retrieval

Figure 3 for Large-scale Vision-Language Models Learn Super Images for Efficient and High-Performance Partially Relevant Video Retrieval

Figure 4 for Large-scale Vision-Language Models Learn Super Images for Efficient and High-Performance Partially Relevant Video Retrieval

Abstract:In this paper, we propose an efficient and high-performance method for partially relevant video retrieval (PRVR), which aims to retrieve untrimmed long videos that contain at least one relevant moment to the input text query. In terms of both efficiency and performance, the overlooked bottleneck of previous studies is the visual encoding of dense frames. This guides researchers to choose lightweight visual backbones, yielding sub-optimal retrieval performance due to their limited capabilities of learned visual representations. However, it is undesirable to simply replace them with high-performance large-scale vision-and-language models (VLMs) due to their low efficiency. To address these issues, instead of dense frames, we focus on super images, which are created by rearranging the video frames in a $N \times N$ grid layout. This reduces the number of visual encodings to $\frac{1}{N^2}$ and compensates for the low efficiency of large-scale VLMs, allowing us to adopt them as powerful encoders. Surprisingly, we discover that with a simple query-image attention trick, VLMs generalize well to super images effectively and demonstrate promising zero-shot performance against SOTA methods efficiently. In addition, we propose a fine-tuning approach by incorporating a few trainable modules into the VLM backbones. The experimental results demonstrate that our approaches efficiently achieve the best performance on ActivityNet Captions and TVR.

* 13 pages, 11 figures

Via

Access Paper or Ask Questions

Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos

Nov 29, 2023

Takehiko Ohkawa, Takuma Yagi, Taichi Nishimura, Ryosuke Furuta, Atsushi Hashimoto, Yoshitaka Ushiku, Yoichi Sato

Figure 1 for Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos

Figure 2 for Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos

Figure 3 for Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos

Figure 4 for Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos

Abstract:We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an egocentric view. While dense video captioning (predicting time segments and their captions) is primarily studied with exocentric videos (e.g., YouCook2), benchmarks with egocentric videos are restricted due to data scarcity. To overcome the limited video availability, transferring knowledge from abundant exocentric web videos is demanded as a practical approach. However, learning the correspondence between exocentric and egocentric views is difficult due to their dynamic view changes. The web videos contain mixed views focusing on either human body actions or close-up hand-object interactions, while the egocentric view is constantly shifting as the camera wearer moves. This necessitates the in-depth study of cross-view transfer under complex view changes. In this work, we first create a real-life egocentric dataset (EgoYC2) whose captions are shared with YouCook2, enabling transfer learning between these datasets assuming their ground-truth is accessible. To bridge the view gaps, we propose a view-invariant learning method using adversarial training in both the pre-training and fine-tuning stages. While the pre-training is designed to learn invariant features against the mixed views in the web videos, the view-invariant fine-tuning further mitigates the view gaps between both datasets. We validate our proposed method by studying how effectively it overcomes the view change problem and efficiently transfers the knowledge to the egocentric domain. Our benchmark pushes the study of the cross-view transfer into a new task domain of dense video captioning and will envision methodologies to describe egocentric videos in natural language.

Via

Access Paper or Ask Questions

Recipe Generation from Unsegmented Cooking Videos

Sep 21, 2022

Taichi Nishimura, Atsushi Hashimoto, Yoshitaka Ushiku, Hirotaka Kameko, Shinsuke Mori

Figure 1 for Recipe Generation from Unsegmented Cooking Videos

Figure 2 for Recipe Generation from Unsegmented Cooking Videos

Figure 3 for Recipe Generation from Unsegmented Cooking Videos

Figure 4 for Recipe Generation from Unsegmented Cooking Videos

Abstract:This paper tackles recipe generation from unsegmented cooking videos, a task that requires agents to (1) extract key events in completing the dish and (2) generate sentences for the extracted events. Our task is similar to dense video captioning (DVC), which aims at detecting events thoroughly and generating sentences for them. However, unlike DVC, in recipe generation, recipe story awareness is crucial, and a model should output an appropriate number of key events in the correct order. We analyze the output of the DVC model and observe that although (1) several events are adoptable as a recipe story, (2) the generated sentences for such events are not grounded in the visual content. Based on this, we hypothesize that we can obtain correct recipes by selecting oracle events from the output events of the DVC model and re-generating sentences for them. To achieve this, we propose a novel transformer-based joint approach of training event selector and sentence generator for selecting oracle events from the outputs of the DVC model and generating grounded sentences for the events, respectively. In addition, we extend the model by including ingredients to generate more accurate recipes. The experimental results show that the proposed method outperforms state-of-the-art DVC models. We also confirm that, by modeling the recipe in a story-aware manner, the proposed model output the appropriate number of events in the correct order.

* 11 pages, 9 figures. Under review

Via

Access Paper or Ask Questions