Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Carl Vondrick

Generative Data Mining with Longtail-Guided Diffusion

Feb 04, 2025

David S. Hayden, Mao Ye, Timur Garipov, Gregory P. Meyer, Carl Vondrick, Zhao Chen, Yuning Chai, Eric Wolff, Siddhartha S. Srinivasa

Abstract:It is difficult to anticipate the myriad challenges that a predictive model will encounter once deployed. Common practice entails a reactive, cyclical approach: model deployment, data mining, and retraining. We instead develop a proactive longtail discovery process by imagining additional data during training. In particular, we develop general model-based longtail signals, including a differentiable, single forward pass formulation of epistemic uncertainty that does not impact model parameters or predictive performance but can flag rare or hard inputs. We leverage these signals as guidance to generate additional training data from a latent diffusion model in a process we call Longtail Guidance (LTG). Crucially, we can perform LTG without retraining the diffusion model or the predictive model, and we do not need to expose the predictive model to intermediate diffusion states. Data generated by LTG exhibit semantically meaningful variation, yield significant generalization improvements on image classification benchmarks, and can be analyzed to proactively discover, explain, and address conceptual gaps in a predictive model.

* 20 pages

Via

Access Paper or Ask Questions

MedAutoCorrect: Image-Conditioned Autocorrection in Medical Reporting

Dec 04, 2024

Arnold Caleb Asiimwe, Dídac Surís, Pranav Rajpurkar, Carl Vondrick

Figure 1 for MedAutoCorrect: Image-Conditioned Autocorrection in Medical Reporting

Figure 2 for MedAutoCorrect: Image-Conditioned Autocorrection in Medical Reporting

Figure 3 for MedAutoCorrect: Image-Conditioned Autocorrection in Medical Reporting

Figure 4 for MedAutoCorrect: Image-Conditioned Autocorrection in Medical Reporting

Abstract:In medical reporting, the accuracy of radiological reports, whether generated by humans or machine learning algorithms, is critical. We tackle a new task in this paper: image-conditioned autocorrection of inaccuracies within these reports. Using the MIMIC-CXR dataset, we first intentionally introduce a diverse range of errors into reports. Subsequently, we propose a two-stage framework capable of pinpointing these errors and then making corrections, simulating an \textit{autocorrection} process. This method aims to address the shortcomings of existing automated medical reporting systems, like factual errors and incorrect conclusions, enhancing report reliability in vital healthcare applications. Importantly, our approach could serve as a guardrail, ensuring the accuracy and trustworthiness of automated report generation. Experiments on established datasets and state of the art report generation models validate this method's potential in correcting medical reporting errors.

Via

Access Paper or Ask Questions

Self-Improving Autonomous Underwater Manipulation

Oct 24, 2024

Ruoshi Liu, Huy Ha, Mengxue Hou, Shuran Song, Carl Vondrick

Figure 1 for Self-Improving Autonomous Underwater Manipulation

Figure 2 for Self-Improving Autonomous Underwater Manipulation

Figure 3 for Self-Improving Autonomous Underwater Manipulation

Figure 4 for Self-Improving Autonomous Underwater Manipulation

Abstract:Underwater robotic manipulation faces significant challenges due to complex fluid dynamics and unstructured environments, causing most manipulation systems to rely heavily on human teleoperation. In this paper, we introduce AquaBot, a fully autonomous manipulation system that combines behavior cloning from human demonstrations with self-learning optimization to improve beyond human teleoperation performance. With extensive real-world experiments, we demonstrate AquaBot's versatility across diverse manipulation tasks, including object grasping, trash sorting, and rescue retrieval. Our real-world experiments show that AquaBot's self-optimized policy outperforms a human operator by 41% in speed. AquaBot represents a promising step towards autonomous and self-improving underwater manipulation systems. We open-source both hardware and software implementation details.

* Project Page: https://aquabot.cs.columbia.edu/

Via

Access Paper or Ask Questions

Differentiable Robot Rendering

Oct 17, 2024

Ruoshi Liu, Alper Canberk, Shuran Song, Carl Vondrick

Figure 1 for Differentiable Robot Rendering

Figure 2 for Differentiable Robot Rendering

Figure 3 for Differentiable Robot Rendering

Figure 4 for Differentiable Robot Rendering

Abstract:Vision foundation models trained on massive amounts of visual data have shown unprecedented reasoning and planning skills in open-world settings. A key challenge in applying them to robotic tasks is the modality gap between visual data and action data. We introduce differentiable robot rendering, a method allowing the visual appearance of a robot body to be directly differentiable with respect to its control parameters. Our model integrates a kinematics-aware deformable model and Gaussians Splatting and is compatible with any robot form factors and degrees of freedom. We demonstrate its capability and usage in applications including reconstruction of robot poses from images and controlling robots through vision language models. Quantitative and qualitative results show that our differentiable rendering model provides effective gradients for robotic control directly from pixels, setting the foundation for the future applications of vision foundation models in robotics.

* Project Page: https://drrobot.cs.columbia.edu/

Via

Access Paper or Ask Questions

EraseDraw: Learning to Insert Objects by Erasing Them from Images

Aug 31, 2024

Alper Canberk, Maksym Bondarenko, Ege Ozguroglu, Ruoshi Liu, Carl Vondrick

Figure 1 for EraseDraw: Learning to Insert Objects by Erasing Them from Images

Figure 2 for EraseDraw: Learning to Insert Objects by Erasing Them from Images

Figure 3 for EraseDraw: Learning to Insert Objects by Erasing Them from Images

Figure 4 for EraseDraw: Learning to Insert Objects by Erasing Them from Images

Abstract:Creative processes such as painting often involve creating different components of an image one by one. Can we build a computational model to perform this task? Prior works often fail by making global changes to the image, inserting objects in unrealistic spatial locations, and generating inaccurate lighting details. We observe that while state-of-the-art models perform poorly on object insertion, they can remove objects and erase the background in natural images very well. Inverting the direction of object removal, we obtain high-quality data for learning to insert objects that are spatially, physically, and optically consistent with the surroundings. With this scalable automatic data generation pipeline, we can create a dataset for learning object insertion, which is used to train our proposed text conditioned diffusion model. Qualitative and quantitative experiments have shown that our model achieves state-of-the-art results in object insertion, particularly for in-the-wild images. We show compelling results on diverse insertion prompts and images across various domains.In addition, we automate iterative insertion by combining our insertion model with beam search guided by CLIP.

Via

Access Paper or Ask Questions

Controlling the World by Sleight of Hand

Aug 13, 2024

Sruthi Sudhakar, Ruoshi Liu, Basile Van Hoorick, Carl Vondrick, Richard Zemel

Figure 1 for Controlling the World by Sleight of Hand

Figure 2 for Controlling the World by Sleight of Hand

Figure 3 for Controlling the World by Sleight of Hand

Figure 4 for Controlling the World by Sleight of Hand

Abstract:Humans naturally build mental models of object interactions and dynamics, allowing them to imagine how their surroundings will change if they take a certain action. While generative models today have shown impressive results on generating/editing images unconditionally or conditioned on text, current methods do not provide the ability to perform object manipulation conditioned on actions, an important tool for world modeling and action planning. Therefore, we propose to learn an action-conditional generative models by learning from unlabeled videos of human hands interacting with objects. The vast quantity of such data on the internet allows for efficient scaling which can enable high-performing action-conditional models. Given an image, and the shape/location of a desired hand interaction, CosHand, synthesizes an image of a future after the interaction has occurred. Experiments show that the resulting model can predict the effects of hand-object interactions well, with strong generalization particularly to translation, stretching, and squeezing interactions of unseen objects in unseen environments. Further, CosHand can be sampled many times to predict multiple possible effects, modeling the uncertainty of forces in the interaction/environment. Finally, method generalizes to different embodiments, including non-human hands, i.e. robot hands, suggesting that generative video models can be powerful models for robotics.

Via

Access Paper or Ask Questions

Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

Jun 24, 2024

Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, Carl Vondrick

Figure 1 for Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

Figure 2 for Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

Figure 3 for Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

Figure 4 for Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

Abstract:A key challenge in manipulation is learning a policy that can robustly generalize to diverse visual environments. A promising mechanism for learning robust policies is to leverage video generative models, which are pretrained on large-scale datasets of internet videos. In this paper, we propose a visuomotor policy learning framework that fine-tunes a video diffusion model on human demonstrations of a given task. At test time, we generate an example of an execution of the task conditioned on images of a novel scene, and use this synthesized execution directly to control the robot. Our key insight is that using common tools allows us to effortlessly bridge the embodiment gap between the human hand and the robot manipulator. We evaluate our approach on four tasks of increasing complexity and demonstrate that harnessing internet-scale generative models allows the learned policy to achieve a significantly higher degree of generalization than existing behavior cloning approaches.

* Project page: https://dreamitate.cs.columbia.edu/

Via

Access Paper or Ask Questions

Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

Jun 20, 2024

Sachit Menon, Richard Zemel, Carl Vondrick

Figure 1 for Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

Figure 2 for Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

Figure 3 for Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

Figure 4 for Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

Abstract:When presented with questions involving visual thinking, humans naturally switch reasoning modalities, often forming mental images or drawing visual aids. Large language models have shown promising results in arithmetic and symbolic reasoning by expressing intermediate reasoning in text as a chain of thought, yet struggle to extend this capability to answer text queries that are easily solved by visual reasoning, even with extensive multimodal pretraining. We introduce a simple method, whiteboard-of-thought prompting, to unlock the visual reasoning capabilities of multimodal large language models across modalities. Whiteboard-of-thought prompting provides multimodal large language models with a metaphorical `whiteboard' to draw out reasoning steps as images, then returns these images back to the model for further processing. We find this can be accomplished with no demonstrations or specialized modules, instead leveraging models' existing ability to write code with libraries such as Matplotlib and Turtle. This simple approach shows state-of-the-art results on four difficult natural language tasks that involve visual and spatial reasoning. We identify multiple settings where GPT-4o using chain-of-thought fails dramatically, including more than one where it achieves $0\%$ accuracy, while whiteboard-of-thought enables up to $92\%$ accuracy in these same settings. We present a detailed exploration of where the technique succeeds as well as its sources of error.

* Project website: whiteboard.cs.columbia.edu/

Via

Access Paper or Ask Questions

See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding

Jun 17, 2024

Amith Ananthram, Elias Stengel-Eskin, Carl Vondrick, Mohit Bansal, Kathleen McKeown

Figure 1 for See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding

Figure 2 for See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding

Figure 3 for See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding

Figure 4 for See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding

Abstract:Vision-language models (VLMs) can respond to queries about images in many languages. However, beyond language, culture affects how we see things. For example, individuals from Western cultures focus more on the central figure in an image while individuals from Eastern cultures attend more to scene context. In this work, we present a novel investigation that demonstrates and localizes VLMs' Western bias in image understanding. We evaluate large VLMs across subjective and objective visual tasks with culturally diverse images and annotations. We find that VLMs perform better on the Western subset than the Eastern subset of each task. Controlled experimentation tracing the source of this bias highlights the importance of a diverse language mix in text-only pre-training for building equitable VLMs, even when inference is performed in English. Moreover, while prompting in the language of a target culture can lead to reductions in bias, it is not a substitute for building AI more representative of the world's languages.

* 17 pages, 7 figures. Code/models: https://github.com/amith-ananthram/see-it-from-my-perspective

Via

Access Paper or Ask Questions

How Video Meetings Change Your Expression

Jun 03, 2024

Sumit Sarin, Utkarsh Mall, Purva Tendulkar, Carl Vondrick

Abstract:Do our facial expressions change when we speak over video calls? Given two unpaired sets of videos of people, we seek to automatically find spatio-temporal patterns that are distinctive of each set. Existing methods use discriminative approaches and perform post-hoc explainability analysis. Such methods are insufficient as they are unable to provide insights beyond obvious dataset biases, and the explanations are useful only if humans themselves are good at the task. Instead, we tackle the problem through the lens of generative domain translation: our method generates a detailed report of learned, input-dependent spatio-temporal features and the extent to which they vary between the domains. We demonstrate that our method can discover behavioral differences between conversing face-to-face (F2F) and on video-calls (VCs). We also show the applicability of our method on discovering differences in presidential communication styles. Additionally, we are able to predict temporal change-points in videos that decouple expressions in an unsupervised way, and increase the interpretability and usefulness of our model. Finally, our method, being generative, can be used to transform a video call to appear as if it were recorded in a F2F setting. Experiments and visualizations show our approach is able to discover a range of behaviors, taking a step towards deeper understanding of human behaviors.

* Project webpage is available at: https://facet.cs.columbia.edu

Via

Access Paper or Ask Questions