Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sachit Menon

Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

Jun 20, 2024

Sachit Menon, Richard Zemel, Carl Vondrick

Figure 1 for Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

Figure 2 for Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

Figure 3 for Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

Figure 4 for Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

Abstract:When presented with questions involving visual thinking, humans naturally switch reasoning modalities, often forming mental images or drawing visual aids. Large language models have shown promising results in arithmetic and symbolic reasoning by expressing intermediate reasoning in text as a chain of thought, yet struggle to extend this capability to answer text queries that are easily solved by visual reasoning, even with extensive multimodal pretraining. We introduce a simple method, whiteboard-of-thought prompting, to unlock the visual reasoning capabilities of multimodal large language models across modalities. Whiteboard-of-thought prompting provides multimodal large language models with a metaphorical `whiteboard' to draw out reasoning steps as images, then returns these images back to the model for further processing. We find this can be accomplished with no demonstrations or specialized modules, instead leveraging models' existing ability to write code with libraries such as Matplotlib and Turtle. This simple approach shows state-of-the-art results on four difficult natural language tasks that involve visual and spatial reasoning. We identify multiple settings where GPT-4o using chain-of-thought fails dramatically, including more than one where it achieves $0\%$ accuracy, while whiteboard-of-thought enables up to $92\%$ accuracy in these same settings. We present a detailed exploration of where the technique succeeds as well as its sources of error.

* Project website: whiteboard.cs.columbia.edu/

Via

Access Paper or Ask Questions

Generating Illustrated Instructions

Dec 07, 2023

Sachit Menon, Ishan Misra, Rohit Girdhar

Abstract:We introduce the new task of generating Illustrated Instructions, i.e., visual instructions customized to a user's needs. We identify desiderata unique to this task, and formalize it through a suite of automatic and human evaluation metrics, designed to measure the validity, consistency, and efficacy of the generations. We combine the power of large language models (LLMs) together with strong text-to-image generation diffusion models to propose a simple approach called StackedDiffusion, which generates such illustrated instructions given text as input. The resulting model strongly outperforms baseline approaches and state-of-the-art multimodal LLMs; and in 30% of cases, users even prefer it to human-generated articles. Most notably, it enables various new and exciting applications far beyond what static articles on the web can provide, such as personalized instructions complete with intermediate steps and pictures in response to a user's individual situation.

* Project website: http://facebookresearch.github.io/IllustratedInstructions

Via

Access Paper or Ask Questions

ViperGPT: Visual Inference via Python Execution for Reasoning

Mar 14, 2023

Dídac Surís, Sachit Menon, Carl Vondrick

Abstract:Answering visual queries is a complex task that requires both visual processing and reasoning. End-to-end models, the dominant approach for this task, do not explicitly differentiate between the two, limiting interpretability and generalization. Learning modular programs presents a promising alternative, but has proven challenging due to the difficulty of learning both the programs and modules simultaneously. We introduce ViperGPT, a framework that leverages code-generation models to compose vision-and-language models into subroutines to produce a result for any query. ViperGPT utilizes a provided API to access the available modules, and composes them by generating Python code that is later executed. This simple approach requires no further training, and achieves state-of-the-art results across various complex visual tasks.

* Website: https://viper.cs.columbia.edu/

Via

Access Paper or Ask Questions

Affective Faces for Goal-Driven Dyadic Communication

Jan 26, 2023

Scott Geng, Revant Teotia, Purva Tendulkar, Sachit Menon, Carl Vondrick

Figure 1 for Affective Faces for Goal-Driven Dyadic Communication

Figure 2 for Affective Faces for Goal-Driven Dyadic Communication

Figure 3 for Affective Faces for Goal-Driven Dyadic Communication

Figure 4 for Affective Faces for Goal-Driven Dyadic Communication

Abstract:We introduce a video framework for modeling the association between verbal and non-verbal communication during dyadic conversation. Given the input speech of a speaker, our approach retrieves a video of a listener, who has facial expressions that would be socially appropriate given the context. Our approach further allows the listener to be conditioned on their own goals, personalities, or backgrounds. Our approach models conversations through a composition of large language models and vision-language models, creating internal representations that are interpretable and controllable. To study multimodal communication, we propose a new video dataset of unscripted conversations covering diverse topics and demographics. Experiments and visualizations show our approach is able to output listeners that are significantly more socially appropriate than baselines. However, many challenges remain, and we release our dataset publicly to spur further progress. See our website for video results, data, and code: https://realtalk.cs.columbia.edu.

Via

Access Paper or Ask Questions

Doubly Right Object Recognition: A Why Prompt for Visual Rationales

Dec 12, 2022

Chengzhi Mao, Revant Teotia, Amrutha Sundar, Sachit Menon, Junfeng Yang, Xin Wang, Carl Vondrick

Abstract:Many visual recognition models are evaluated only on their classification accuracy, a metric for which they obtain strong performance. In this paper, we investigate whether computer vision models can also provide correct rationales for their predictions. We propose a ``doubly right'' object recognition benchmark, where the metric requires the model to simultaneously produce both the right labels as well as the right rationales. We find that state-of-the-art visual models, such as CLIP, often provide incorrect rationales for their categorical predictions. However, by transferring the rationales from language models into visual representations through a tailored dataset, we show that we can learn a ``why prompt,'' which adapts large visual representations to produce correct rationales. Visualizations and empirical experiments show that our prompts significantly improve performance on doubly right object recognition, in addition to zero-shot transfer to unseen tasks and datasets.

Via

Access Paper or Ask Questions

Task Bias in Vision-Language Models

Dec 08, 2022

Sachit Menon, Ishaan Preetam Chandratreya, Carl Vondrick

Abstract:Incidental supervision from language has become a popular approach for learning generic visual representations that can be prompted to perform many recognition tasks in computer vision. We conduct an in-depth exploration of the CLIP model and show that its visual representation is often strongly biased towards solving some tasks more than others. Moreover, which task the representation will be biased towards is unpredictable, with little consistency across images. To resolve this task bias, we show how to learn a visual prompt that guides the representation towards features relevant to their task of interest. Our results show that these visual prompts can be independent of the input image and still effectively provide a conditioning mechanism to steer visual representations towards the desired task.

* First two authors contributed equally

Via

Access Paper or Ask Questions

Visual Classification via Description from Large Language Models

Oct 13, 2022

Sachit Menon, Carl Vondrick

Figure 1 for Visual Classification via Description from Large Language Models

Figure 2 for Visual Classification via Description from Large Language Models

Figure 3 for Visual Classification via Description from Large Language Models

Figure 4 for Visual Classification via Description from Large Language Models

Abstract:Vision-language models (VLMs) such as CLIP have shown promising performance on a variety of recognition tasks using the standard zero-shot classification procedure -- computing similarity between the query image and the embedded words for each category. By only using the category name, they neglect to make use of the rich context of additional information that language affords. The procedure gives no intermediate understanding of why a category is chosen, and furthermore provides no mechanism for adjusting the criteria used towards this decision. We present an alternative framework for classification with VLMs, which we call classification by description. We ask VLMs to check for descriptive features rather than broad categories: to find a tiger, look for its stripes; its claws; and more. By basing decisions on these descriptors, we can provide additional cues that encourage using the features we want to be used. In the process, we can get a clear idea of what features the model uses to construct its decision; it gains some level of inherent explainability. We query large language models (e.g., GPT-3) for these descriptors to obtain them in a scalable way. Extensive experiments show our framework has numerous advantages past interpretability. We show improvements in accuracy on ImageNet across distribution shifts; demonstrate the ability to adapt VLMs to recognize concepts unseen during training; and illustrate how descriptors can be edited to effectively mitigate bias compared to the baseline.

Via

Access Paper or Ask Questions

Forget-me-not! Contrastive Critics for Mitigating Posterior Collapse

Jul 19, 2022

Sachit Menon, David Blei, Carl Vondrick

Figure 1 for Forget-me-not! Contrastive Critics for Mitigating Posterior Collapse

Figure 2 for Forget-me-not! Contrastive Critics for Mitigating Posterior Collapse

Figure 3 for Forget-me-not! Contrastive Critics for Mitigating Posterior Collapse

Figure 4 for Forget-me-not! Contrastive Critics for Mitigating Posterior Collapse

Abstract:Variational autoencoders (VAEs) suffer from posterior collapse, where the powerful neural networks used for modeling and inference optimize the objective without meaningfully using the latent representation. We introduce inference critics that detect and incentivize against posterior collapse by requiring correspondence between latent variables and the observations. By connecting the critic's objective to the literature in self-supervised contrastive representation learning, we show both theoretically and empirically that optimizing inference critics increases the mutual information between observations and latents, mitigating posterior collapse. This approach is straightforward to implement and requires significantly less training time than prior methods, yet obtains competitive results on three established datasets. Overall, the approach lays the foundation to bridge the previously disconnected frameworks of contrastive learning and probabilistic modeling with variational autoencoders, underscoring the benefits both communities may find at their intersection.

* Conference on Uncertainty in Artificial Intelligence (UAI) 2022

Via

Access Paper or Ask Questions

Shadows Shed Light on 3D Objects

Jun 17, 2022

Ruoshi Liu, Sachit Menon, Chengzhi Mao, Dennis Park, Simon Stent, Carl Vondrick

Figure 1 for Shadows Shed Light on 3D Objects

Figure 2 for Shadows Shed Light on 3D Objects

Figure 3 for Shadows Shed Light on 3D Objects

Figure 4 for Shadows Shed Light on 3D Objects

Abstract:3D reconstruction is a fundamental problem in computer vision, and the task is especially challenging when the object to reconstruct is partially or fully occluded. We introduce a method that uses the shadows cast by an unobserved object in order to infer the possible 3D volumes behind the occlusion. We create a differentiable image formation model that allows us to jointly infer the 3D shape of an object, its pose, and the position of a light source. Since the approach is end-to-end differentiable, we are able to integrate learned priors of object geometry in order to generate realistic 3D shapes of different object categories. Experiments and visualizations show that the method is able to generate multiple possible solutions that are consistent with the observation of the shadow. Our approach works even when the position of the light source and object pose are both unknown. Our approach is also robust to real-world images where ground-truth shadow mask is unknown.

* 19 pages, 10 figures

Via

Access Paper or Ask Questions

PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models

Mar 08, 2020

Sachit Menon, Alexandru Damian, Shijia Hu, Nikhil Ravi, Cynthia Rudin

Figure 1 for PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models

Figure 2 for PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models

Figure 3 for PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models

Figure 4 for PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models

Abstract:The primary aim of single-image super-resolution is to construct a high-resolution (HR) image from a corresponding low-resolution (LR) input. In previous approaches, which have generally been supervised, the training objective typically measures a pixel-wise average distance between the super-resolved (SR) and HR images. Optimizing such metrics often leads to blurring, especially in high variance (detailed) regions. We propose an alternative formulation of the super-resolution problem based on creating realistic SR images that downscale correctly. We present a novel super-resolution algorithm addressing this problem, PULSE (Photo Upsampling via Latent Space Exploration), which generates high-resolution, realistic images at resolutions previously unseen in the literature. It accomplishes this in an entirely self-supervised fashion and is not confined to a specific degradation operator used during training, unlike previous methods (which require training on databases of LR-HR image pairs for supervised learning). Instead of starting with the LR image and slowly adding detail, PULSE traverses the high-resolution natural image manifold, searching for images that downscale to the original LR image. This is formalized through the "downscaling loss," which guides exploration through the latent space of a generative model. By leveraging properties of high-dimensional Gaussians, we restrict the search space to guarantee that our outputs are realistic. PULSE thereby generates super-resolved images that both are realistic and downscale correctly. We show extensive experimental results demonstrating the efficacy of our approach in the domain of face super-resolution (also known as face hallucination). Our method outperforms state-of-the-art methods in perceptual quality at higher resolutions and scale factors than previously possible.

* Sachit Menon and Alexandru Damian contributed equally. CVPR 2020 camera-ready

Via

Access Paper or Ask Questions