Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mia Chiquier

Teaching Humans Subtle Differences with DIFFusion

Apr 10, 2025

Mia Chiquier, Orr Avrech, Yossi Gandelsman, Berthy Feng, Katherine Bouman, Carl Vondrick

Abstract:Human expertise depends on the ability to recognize subtle visual differences, such as distinguishing diseases, species, or celestial phenomena. We propose a new method to teach novices how to differentiate between nuanced categories in specialized domains. Our method uses generative models to visualize the minimal change in features to transition between classes, i.e., counterfactuals, and performs well even in domains where data is sparse, examples are unpaired, and category boundaries are not easily explained by text. By manipulating the conditioning space of diffusion models, our proposed method DIFFusion disentangles category structure from instance identity, enabling high-fidelity synthesis even in challenging domains. Experiments across six domains show accurate transitions even with limited and unpaired examples across categories. User studies confirm that our generated counterfactuals outperform unpaired examples in teaching perceptual expertise, showing the potential of generative models for specialized visual learning.

Via

Access Paper or Ask Questions

Evolving Interpretable Visual Classifiers with Large Language Models

Apr 15, 2024

Mia Chiquier, Utkarsh Mall, Carl Vondrick

Figure 1 for Evolving Interpretable Visual Classifiers with Large Language Models

Figure 2 for Evolving Interpretable Visual Classifiers with Large Language Models

Figure 3 for Evolving Interpretable Visual Classifiers with Large Language Models

Figure 4 for Evolving Interpretable Visual Classifiers with Large Language Models

Abstract:Multimodal pre-trained models, such as CLIP, are popular for zero-shot classification due to their open-vocabulary flexibility and high performance. However, vision-language models, which compute similarity scores between images and class labels, are largely black-box, with limited interpretability, risk for bias, and inability to discover new visual concepts not written down. Moreover, in practical settings, the vocabulary for class names and attributes of specialized concepts will not be known, preventing these methods from performing well on images uncommon in large-scale vision-language datasets. To address these limitations, we present a novel method that discovers interpretable yet discriminative sets of attributes for visual recognition. We introduce an evolutionary search algorithm that uses a large language model and its in-context learning abilities to iteratively mutate a concept bottleneck of attributes for classification. Our method produces state-of-the-art, interpretable fine-grained classifiers. We outperform the latest baselines by 18.4% on five fine-grained iNaturalist datasets and by 22.2% on two KikiBouba datasets, despite the baselines having access to privileged information about class names.

Via

Access Paper or Ask Questions

Muscles in Action

Dec 05, 2022

Mia Chiquier, Carl Vondrick

Abstract:Small differences in a person's motion can engage drastically different muscles. While most visual representations of human activity are trained from video, people learn from multimodal experiences, including from the proprioception of their own muscles. We present a new visual perception task and dataset to model muscle activation in human activities from monocular video. Our Muscles in Action (MIA) dataset consists of 2 hours of synchronized video and surface electromyography (sEMG) data of subjects performing various exercises. Using this dataset, we learn visual representations that are predictive of muscle activation from monocular video. We present several models, including a transformer model, and measure their ability to generalize to new exercises and subjects. Putting muscles into computer vision systems will enable richer models of virtual humans, with applications in sports, fitness, and AR/VR.

Via

Access Paper or Ask Questions

Private Multiparty Perception for Navigation

Dec 02, 2022

Hui Lu, Mia Chiquier, Carl Vondrick

Abstract:We introduce a framework for navigating through cluttered environments by connecting multiple cameras together while simultaneously preserving privacy. Occlusions and obstacles in large environments are often challenging situations for navigation agents because the environment is not fully observable from a single camera view. Given multiple camera views of an environment, our approach learns to produce a multiview scene representation that can only be used for navigation, provably preventing one party from inferring anything beyond the output task. On a new navigation dataset that we will publicly release, experiments show that private multiparty representations allow navigation through complex scenes and around obstacles while jointly preserving privacy. Our approach scales to an arbitrary number of camera viewpoints. We believe developing visual representations that preserve privacy is increasingly important for many applications such as navigation.

Via

Access Paper or Ask Questions

Real-Time Neural Voice Camouflage

Dec 14, 2021

Mia Chiquier, Chengzhi Mao, Carl Vondrick

Figure 1 for Real-Time Neural Voice Camouflage

Figure 2 for Real-Time Neural Voice Camouflage

Figure 3 for Real-Time Neural Voice Camouflage

Figure 4 for Real-Time Neural Voice Camouflage

Abstract:Automatic speech recognition systems have created exciting possibilities for applications, however they also enable opportunities for systematic eavesdropping. We propose a method to camouflage a person's voice over-the-air from these systems without inconveniencing the conversation between people in the room. Standard adversarial attacks are not effective in real-time streaming situations because the characteristics of the signal will have changed by the time the attack is executed. We introduce predictive attacks, which achieve real-time performance by forecasting the attack that will be the most effective in the future. Under real-time constraints, our method jams the established speech recognition system DeepSpeech 4.17x more than baselines as measured through word error rate, and 7.27x more as measured through character error rate. We furthermore demonstrate our approach is practically effective in realistic environments over physical distances.

* 14 pages

Via

Access Paper or Ask Questions

The Boombox: Visual Reconstruction from Acoustic Vibrations

May 17, 2021

Boyuan Chen, Mia Chiquier, Hod Lipson, Carl Vondrick

Figure 1 for The Boombox: Visual Reconstruction from Acoustic Vibrations

Figure 2 for The Boombox: Visual Reconstruction from Acoustic Vibrations

Figure 3 for The Boombox: Visual Reconstruction from Acoustic Vibrations

Figure 4 for The Boombox: Visual Reconstruction from Acoustic Vibrations

Abstract:We introduce The Boombox, a container that uses acoustic vibrations to reconstruct an image of its inside contents. When an object interacts with the container, they produce small acoustic vibrations. The exact vibration characteristics depend on the physical properties of the box and the object. We demonstrate how to use this incidental signal in order to predict visual structure. After learning, our approach remains effective even when a camera cannot view inside the box. Although we use low-cost and low-power contact microphones to detect the vibrations, our results show that learning from multi-modal data enables us to transform cheap acoustic sensors into rich visual sensors. Due to the ubiquity of containers, we believe integrating perception capabilities into them will enable new applications in human-computer interaction and robotics. Our project website is at: boombox.cs.columbia.edu

* Website: boombox.cs.columbia.edu

Via

Access Paper or Ask Questions

Adversarial Attacks are Reversible with Natural Supervision

Mar 29, 2021

Chengzhi Mao, Mia Chiquier, Hao Wang, Junfeng Yang, Carl Vondrick

Figure 1 for Adversarial Attacks are Reversible with Natural Supervision

Figure 2 for Adversarial Attacks are Reversible with Natural Supervision

Figure 3 for Adversarial Attacks are Reversible with Natural Supervision

Figure 4 for Adversarial Attacks are Reversible with Natural Supervision

Abstract:We find that images contain intrinsic structure that enables the reversal of many adversarial attacks. Attack vectors cause not only image classifiers to fail, but also collaterally disrupt incidental structure in the image. We demonstrate that modifying the attacked image to restore the natural structure will reverse many types of attacks, providing a defense. Experiments demonstrate significantly improved robustness for several state-of-the-art models across the CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets. Our results show that our defense is still effective even if the attacker is aware of the defense mechanism. Since our defense is deployed during inference instead of training, it is compatible with pre-trained networks as well as most other defenses. Our results suggest deep networks are vulnerable to adversarial examples partly because their representations do not enforce the natural structure of images.

Via

Access Paper or Ask Questions