Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ali Koksal

VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting

Dec 16, 2024

Muhammet Furkan Ilaslan, Ali Koksal, Kevin Qinhong Lin, Burak Satar, Mike Zheng Shou, Qianli Xu

Figure 1 for VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting

Figure 2 for VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting

Figure 3 for VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting

Figure 4 for VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting

Abstract:Large Language Model (LLM)-based agents have shown promise in procedural tasks, but the potential of multimodal instructions augmented by texts and videos to assist users remains under-explored. To address this gap, we propose the Visually Grounded Text-Video Prompting (VG-TVP) method which is a novel LLM-empowered Multimodal Procedural Planning (MPP) framework. It generates cohesive text and video procedural plans given a specified high-level objective. The main challenges are achieving textual and visual informativeness, temporal coherence, and accuracy in procedural plans. VG-TVP leverages the zero-shot reasoning capability of LLMs, the video-to-text generation ability of the video captioning models, and the text-to-video generation ability of diffusion models. VG-TVP improves the interaction between modalities by proposing a novel Fusion of Captioning (FoC) method and using Text-to-Video Bridge (T2V-B) and Video-to-Text Bridge (V2T-B). They allow LLMs to guide the generation of visually-grounded text plans and textual-grounded video plans. To address the scarcity of datasets suitable for MPP, we have curated a new dataset called Daily-Life Task Procedural Plans (Daily-PP). We conduct comprehensive experiments and benchmarks to evaluate human preferences (regarding textual and visual informativeness, temporal coherence, and plan accuracy). Our VG-TVP method outperforms unimodal baselines on the Daily-PP dataset.

* Accepted for The 39th Annual AAAI Conference on Artificial Intelligence 2025 in Main Track, 19 pages, 24 figures

Via

Access Paper or Ask Questions

Masked Diffusion with Task-awareness for Procedure Planning in Instructional Videos

Sep 14, 2023

Fen Fang, Yun Liu, Ali Koksal, Qianli Xu, Joo-Hwee Lim

Abstract:A key challenge with procedure planning in instructional videos lies in how to handle a large decision space consisting of a multitude of action types that belong to various tasks. To understand real-world video content, an AI agent must proficiently discern these action types (e.g., pour milk, pour water, open lid, close lid, etc.) based on brief visual observation. Moreover, it must adeptly capture the intricate semantic relation of the action types and task goals, along with the variable action sequences. Recently, notable progress has been made via the integration of diffusion models and visual representation learning to address the challenge. However, existing models employ rudimentary mechanisms to utilize task information to manage the decision space. To overcome this limitation, we introduce a simple yet effective enhancement - a masked diffusion model. The introduced mask acts akin to a task-oriented attention filter, enabling the diffusion/denoising process to concentrate on a subset of action types. Furthermore, to bolster the accuracy of task classification, we harness more potent visual representation learning techniques. In particular, we learn a joint visual-text embedding, where a text embedding is generated by prompting a pre-trained vision-language model to focus on human actions. We evaluate the method on three public datasets and achieve state-of-the-art performance on multiple metrics. Code is available at https://github.com/ffzzy840304/Masked-PDPP.

* 7 pages (main text excluding references), 3 figures, 7 tables

Via

Access Paper or Ask Questions