Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hengkai Pan

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Nov 07, 2024

Gaoyue Zhou, Hengkai Pan, Yann LeCun, Lerrel Pinto

Figure 1 for DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Figure 2 for DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Figure 3 for DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Figure 4 for DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Abstract:The ability to predict future outcomes given control actions is fundamental for physical reasoning. However, such predictive models, often called world models, have proven challenging to learn and are typically developed for task-specific solutions with online policy learning. We argue that the true potential of world models lies in their ability to reason and plan across diverse problems using only passive data. Concretely, we require world models to have the following three properties: 1) be trainable on offline, pre-collected trajectories, 2) support test-time behavior optimization, and 3) facilitate task-agnostic reasoning. To realize this, we present DINO World Model (DINO-WM), a new method to model visual dynamics without reconstructing the visual world. DINO-WM leverages spatial patch features pre-trained with DINOv2, enabling it to learn from offline behavioral trajectories by predicting future patch features. This design allows DINO-WM to achieve observational goals through action sequence optimization, facilitating task-agnostic behavior planning by treating desired goal patch features as prediction targets. We evaluate DINO-WM across various domains, including maze navigation, tabletop pushing, and particle manipulation. Our experiments demonstrate that DINO-WM can generate zero-shot behavioral solutions at test time without relying on expert demonstrations, reward modeling, or pre-learned inverse models. Notably, DINO-WM exhibits strong generalization capabilities compared to prior state-of-the-art work, adapting to diverse task families such as arbitrarily configured mazes, push manipulation with varied object shapes, and multi-particle scenarios.

Via

Access Paper or Ask Questions

DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control

Sep 18, 2024

Zichen Jeff Cui, Hengkai Pan, Aadhithya Iyer, Siddhant Haldar, Lerrel Pinto

Figure 1 for DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control

Figure 2 for DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control

Figure 3 for DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control

Figure 4 for DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control

Abstract:Imitation learning has proven to be a powerful tool for training complex visuomotor policies. However, current methods often require hundreds to thousands of expert demonstrations to handle high-dimensional visual observations. A key reason for this poor data efficiency is that visual representations are predominantly either pretrained on out-of-domain data or trained directly through a behavior cloning objective. In this work, we present DynaMo, a new in-domain, self-supervised method for learning visual representations. Given a set of expert demonstrations, we jointly learn a latent inverse dynamics model and a forward dynamics model over a sequence of image embeddings, predicting the next frame in latent space, without augmentations, contrastive sampling, or access to ground truth actions. Importantly, DynaMo does not require any out-of-domain data such as Internet datasets or cross-embodied datasets. On a suite of six simulated and real environments, we show that representations learned with DynaMo significantly improve downstream imitation learning performance over prior self-supervised learning objectives, and pretrained representations. Gains from using DynaMo hold across policy classes such as Behavior Transformer, Diffusion Policy, MLP, and nearest neighbors. Finally, we ablate over key components of DynaMo and measure its impact on downstream policy performance. Robot videos are best viewed at https://dynamo-ssl.github.io

Via

Access Paper or Ask Questions

BURExtract-Llama: An LLM for Clinical Concept Extraction in Breast Ultrasound Reports

Aug 21, 2024

Yuxuan Chen, Haoyan Yang, Hengkai Pan, Fardeen Siddiqui, Antonio Verdone, Qingyang Zhang, Sumit Chopra, Chen Zhao, Yiqiu Shen

Figure 1 for BURExtract-Llama: An LLM for Clinical Concept Extraction in Breast Ultrasound Reports

Figure 2 for BURExtract-Llama: An LLM for Clinical Concept Extraction in Breast Ultrasound Reports

Figure 3 for BURExtract-Llama: An LLM for Clinical Concept Extraction in Breast Ultrasound Reports

Figure 4 for BURExtract-Llama: An LLM for Clinical Concept Extraction in Breast Ultrasound Reports

Abstract:Breast ultrasound is essential for detecting and diagnosing abnormalities, with radiology reports summarizing key findings like lesion characteristics and malignancy assessments. Extracting this critical information is challenging due to the unstructured nature of these reports, with varied linguistic styles and inconsistent formatting. While proprietary LLMs like GPT-4 are effective, they are costly and raise privacy concerns when handling protected health information. This study presents a pipeline for developing an in-house LLM to extract clinical information from radiology reports. We first use GPT-4 to create a small labeled dataset, then fine-tune a Llama3-8B model on it. Evaluated on clinician-annotated reports, our model achieves an average F1 score of 84.6%, which is on par with GPT-4. Our findings demonstrate the feasibility of developing an in-house LLM that not only matches GPT-4's performance but also offers cost reductions and enhanced data privacy.

* This paper has been accepted as the oral paper for the HCHM workshop, ACM Multimedia 2024

Via

Access Paper or Ask Questions