Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Pulkit Madan

Live Fitness Coaching as a Testbed for Situated Interaction

Jul 11, 2024

Sunny Panchal, Apratim Bhattacharyya, Guillaume Berger, Antoine Mercier, Cornelius Bohm, Florian Dietrichkeit, Reza Pourreza, Xuanlin Li, Pulkit Madan, Mingu Lee(+3 more)

Figure 1 for Live Fitness Coaching as a Testbed for Situated Interaction

Figure 2 for Live Fitness Coaching as a Testbed for Situated Interaction

Figure 3 for Live Fitness Coaching as a Testbed for Situated Interaction

Figure 4 for Live Fitness Coaching as a Testbed for Situated Interaction

Abstract:Tasks at the intersection of vision and language have had a profound impact in advancing the capabilities of vision-language models such as dialog-based assistants. However, models trained on existing tasks are largely limited to turn-based interactions, where each turn must be stepped (i.e., prompted) by the user. Open-ended, asynchronous interactions where an AI model may proactively deliver timely responses or feedback based on the unfolding situation in real-time are an open challenge. In this work, we present the QEVD benchmark and dataset which explores human-AI interaction in the challenging, yet controlled, real-world domain of fitness coaching - a task which intrinsically requires monitoring live user activity and providing timely feedback. It is the first benchmark that requires assistive vision-language models to recognize complex human actions, identify mistakes grounded in those actions, and provide appropriate feedback. Our experiments reveal the limitations of existing state of the art vision-language models for such asynchronous situated interactions. Motivated by this, we propose a simple end-to-end streaming baseline that can respond asynchronously to human actions with appropriate feedbacks at the appropriate time.

* The benchmark and dataset is available here: https://developer.qualcomm.com/software/ai-datasets/qevd

Via

Access Paper or Ask Questions

Painter: Teaching Auto-regressive Language Models to Draw Sketches

Aug 16, 2023

Reza Pourreza, Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Pulkit Madan, Roland Memisevic

Figure 1 for Painter: Teaching Auto-regressive Language Models to Draw Sketches

Figure 2 for Painter: Teaching Auto-regressive Language Models to Draw Sketches

Figure 3 for Painter: Teaching Auto-regressive Language Models to Draw Sketches

Figure 4 for Painter: Teaching Auto-regressive Language Models to Draw Sketches

Abstract:Large language models (LLMs) have made tremendous progress in natural language understanding and they have also been successfully adopted in other domains such as computer vision, robotics, reinforcement learning, etc. In this work, we apply LLMs to image generation tasks by directly generating the virtual brush strokes to paint an image. We present Painter, an LLM that can convert user prompts in text description format to sketches by generating the corresponding brush strokes in an auto-regressive way. We construct Painter based on off-the-shelf LLM that is pre-trained on a large text corpus, by fine-tuning it on the new task while preserving language understanding capabilities. We create a dataset of diverse multi-object sketches paired with textual prompts that covers several object types and tasks. Painter can generate sketches from text descriptions, remove objects from canvas, and detect and classify objects in sketches. Although this is an unprecedented pioneering work in using LLMs for auto-regressive image generation, the results are very encouraging.

Via

Access Paper or Ask Questions

Look, Remember and Reason: Visual Reasoning with Grounded Rationales

Jun 30, 2023

Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Reza Pourreza, Pulkit Madan, Roland Memisevic

Figure 1 for Look, Remember and Reason: Visual Reasoning with Grounded Rationales

Figure 2 for Look, Remember and Reason: Visual Reasoning with Grounded Rationales

Figure 3 for Look, Remember and Reason: Visual Reasoning with Grounded Rationales

Figure 4 for Look, Remember and Reason: Visual Reasoning with Grounded Rationales

Abstract:Large language models have recently shown human level performance on a variety of reasoning tasks. However, the ability of these models to perform complex visual reasoning has not been studied in detail yet. A key challenge in many visual reasoning tasks is that the visual information needs to be tightly integrated in the reasoning process. We propose to address this challenge by drawing inspiration from human visual problem solving which depends on a variety of low-level visual capabilities. It can often be cast as the three step-process of ``Look, Remember, Reason'': visual information is incrementally extracted using low-level visual routines in a step-by-step fashion until a final answer is reached. We follow the same paradigm to enable existing large language models, with minimal changes to the architecture, to solve visual reasoning problems. To this end, we introduce rationales over the visual input that allow us to integrate low-level visual capabilities, such as object recognition and tracking, as surrogate tasks. We show competitive performance on diverse visual reasoning tasks from the CLEVR, CATER, and ACRE datasets over state-of-the-art models designed specifically for these tasks.

Via

Access Paper or Ask Questions