Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Haeji Jung

VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions

Jul 17, 2024

Seokha Moon, Hyun Woo, Hongbeen Park, Haeji Jung, Reza Mahjourian, Hyung-gun Chi, Hyerin Lim, Sangpil Kim, Jinkyu Kim

Figure 1 for VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions

Figure 2 for VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions

Figure 3 for VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions

Figure 4 for VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions

Abstract:Predicting future trajectories for other road agents is an essential task for autonomous vehicles. Established trajectory prediction methods primarily use agent tracks generated by a detection and tracking system and HD map as inputs. In this work, we propose a novel method that also incorporates visual input from surround-view cameras, allowing the model to utilize visual cues such as human gazes and gestures, road conditions, vehicle turn signals, etc, which are typically hidden from the model in prior methods. Furthermore, we use textual descriptions generated by a Vision-Language Model (VLM) and refined by a Large Language Model (LLM) as supervision during training to guide the model on what to learn from the input data. Despite using these extra inputs, our method achieves a latency of 53 ms, making it feasible for real-time processing, which is significantly faster than that of previous single-agent prediction methods with similar performance. Our experiments show that both the visual inputs and the textual descriptions contribute to improvements in trajectory prediction performance, and our qualitative analysis highlights how the model is able to exploit these additional inputs. Lastly, in this work we create and release the nuScenes-Text dataset, which augments the established nuScenes dataset with rich textual annotations for every scene, demonstrating the positive impact of utilizing VLM on trajectory prediction. Our project page is at https://moonseokha.github.io/VisionTrap/

* Accepted at ECCV 2024

Via

Access Paper or Ask Questions

Zero-Shot Cross-Lingual NER Using Phonemic Representations for Low-Resource Languages

Jun 23, 2024

Jimin Sohn, Haeji Jung, Alex Cheng, Jooeon Kang, Yilin Du, David R. Mortensen

Figure 1 for Zero-Shot Cross-Lingual NER Using Phonemic Representations for Low-Resource Languages

Figure 2 for Zero-Shot Cross-Lingual NER Using Phonemic Representations for Low-Resource Languages

Figure 3 for Zero-Shot Cross-Lingual NER Using Phonemic Representations for Low-Resource Languages

Figure 4 for Zero-Shot Cross-Lingual NER Using Phonemic Representations for Low-Resource Languages

Abstract:Existing zero-shot cross-lingual NER approaches require substantial prior knowledge of the target language, which is impractical for low-resource languages. In this paper, we propose a novel approach to NER using phonemic representation based on the International Phonetic Alphabet (IPA) to bridge the gap between representations of different languages. Our experiments show that our method significantly outperforms baseline models in extremely low-resource languages, with the highest average F-1 score (46.38%) and lowest standard deviation (12.67), particularly demonstrating its robustness with non-Latin scripts.

* 7 pages, 5 figures, 5 tables

Via

Access Paper or Ask Questions

Mitigating the Linguistic Gap with Phonemic Representations for Robust Multilingual Language Understanding

Feb 22, 2024

Haeji Jung, Changdae Oh, Jooeon Kang, Jimin Sohn, Kyungwoo Song, Jinkyu Kim, David R. Mortensen

Figure 1 for Mitigating the Linguistic Gap with Phonemic Representations for Robust Multilingual Language Understanding

Figure 2 for Mitigating the Linguistic Gap with Phonemic Representations for Robust Multilingual Language Understanding

Figure 3 for Mitigating the Linguistic Gap with Phonemic Representations for Robust Multilingual Language Understanding

Figure 4 for Mitigating the Linguistic Gap with Phonemic Representations for Robust Multilingual Language Understanding

Abstract:Approaches to improving multilingual language understanding often require multiple languages during the training phase, rely on complicated training techniques, and -- importantly -- struggle with significant performance gaps between high-resource and low-resource languages. We hypothesize that the performance gaps between languages are affected by linguistic gaps between those languages and provide a novel solution for robust multilingual language modeling by employing phonemic representations (specifically, using phonemes as input tokens to LMs rather than subwords). We present quantitative evidence from three cross-lingual tasks that demonstrate the effectiveness of phonemic representation, which is further justified by a theoretical analysis of the cross-lingual performance gap.

Via

Access Paper or Ask Questions

Cream: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

May 24, 2023

Geewook Kim, Hodong Lee, Daehee Kim, Haeji Jung, Sanghee Park, Yoonsik Kim, Sangdoo Yun, Taeho Kil, Bado Lee, Seunghyun Park

Figure 1 for Cream: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Figure 2 for Cream: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Figure 3 for Cream: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Figure 4 for Cream: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Abstract:Advances in Large Language Models (LLMs) have inspired a surge of research exploring their expansion into the visual domain. While recent models exhibit promise in generating abstract captions for images and conducting natural conversations, their performance on text-rich images leaves room for improvement. In this paper, we propose the Contrastive Reading Model (Cream), a novel neural architecture designed to enhance the language-image understanding capability of LLMs by capturing intricate details typically overlooked by existing methods. Cream integrates vision and auxiliary encoders, complemented by a contrastive feature alignment technique, resulting in a more effective understanding of textual information within document images. Our approach, thus, seeks to bridge the gap between vision and language understanding, paving the way for more sophisticated Document Intelligence Assistants. Rigorous evaluations across diverse tasks, such as visual question answering on document images, demonstrate the efficacy of Cream as a state-of-the-art model in the field of visual document understanding. We provide our codebase and newly-generated datasets at https://github.com/naver-ai/cream

Via

Access Paper or Ask Questions