Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Matyas Bohacek

Uncovering Conceptual Blindspots in Generative Image Models Using Sparse Autoencoders

Jun 24, 2025

Matyas Bohacek, Thomas Fel, Maneesh Agrawala, Ekdeep Singh Lubana

Abstract:Despite their impressive performance, generative image models trained on large-scale datasets frequently fail to produce images with seemingly simple concepts -- e.g., human hands or objects appearing in groups of four -- that are reasonably expected to appear in the training data. These failure modes have largely been documented anecdotally, leaving open the question of whether they reflect idiosyncratic anomalies or more structural limitations of these models. To address this, we introduce a systematic approach for identifying and characterizing "conceptual blindspots" -- concepts present in the training data but absent or misrepresented in a model's generations. Our method leverages sparse autoencoders (SAEs) to extract interpretable concept embeddings, enabling a quantitative comparison of concept prevalence between real and generated images. We train an archetypal SAE (RA-SAE) on DINOv2 features with 32,000 concepts -- the largest such SAE to date -- enabling fine-grained analysis of conceptual disparities. Applied to four popular generative models (Stable Diffusion 1.5/2.1, PixArt, and Kandinsky), our approach reveals specific suppressed blindspots (e.g., bird feeders, DVD discs, and whitespaces on documents) and exaggerated blindspots (e.g., wood background texture and palm trees). At the individual datapoint level, we further isolate memorization artifacts -- instances where models reproduce highly specific visual templates seen during training. Overall, we propose a theoretically grounded framework for systematically identifying conceptual blindspots in generative models by assessing their conceptual fidelity with respect to the underlying data-generating process.

Via

Access Paper or Ask Questions

Dataset of News Articles with Provenance Metadata for Media Relevance Assessment

Jun 11, 2025

Tomas Peterka, Matyas Bohacek

Abstract:Out-of-context and misattributed imagery is the leading form of media manipulation in today's misinformation and disinformation landscape. The existing methods attempting to detect this practice often only consider whether the semantics of the imagery corresponds to the text narrative, missing manipulation so long as the depicted objects or scenes somewhat correspond to the narrative at hand. To tackle this, we introduce News Media Provenance Dataset, a dataset of news articles with provenance-tagged images. We formulate two tasks on this dataset, location of origin relevance (LOR) and date and time of origin relevance (DTOR), and present baseline results on six large language models (LLMs). We identify that, while the zero-shot performance on LOR is promising, the performance on DTOR hinders, leaving room for specialized architectures and future work.

* Workshop on NLP for Positive Impact @ ACL 2025

Via

Access Paper or Ask Questions

Synthetic Human Action Video Data Generation with Pose Transfer

Jun 11, 2025

Vaclav Knapp, Matyas Bohacek

Abstract:In video understanding tasks, particularly those involving human motion, synthetic data generation often suffers from uncanny features, diminishing its effectiveness for training. Tasks such as sign language translation, gesture recognition, and human motion understanding in autonomous driving have thus been unable to exploit the full potential of synthetic data. This paper proposes a method for generating synthetic human action video data using pose transfer (specifically, controllable 3D Gaussian avatar models). We evaluate this method on the Toyota Smarthome and NTU RGB+D datasets and show that it improves performance in action recognition tasks. Moreover, we demonstrate that the method can effectively scale few-shot datasets, making up for groups underrepresented in the real training data and adding diverse backgrounds. We open-source the method along with RANDOM People, a dataset with videos and avatars of novel human identities for pose transfer crowd-sourced from the internet.

* Synthetic Data for Computer Vision Workshop @ CVPR 2025

Via

Access Paper or Ask Questions

Towards an AI-Driven Video-Based American Sign Language Dictionary: Exploring Design and Usage Experience with Learners

Apr 08, 2025

Saad Hassan, Matyas Bohacek, Chaelin Kim, Denise Crochet

Figure 1 for Towards an AI-Driven Video-Based American Sign Language Dictionary: Exploring Design and Usage Experience with Learners

Figure 2 for Towards an AI-Driven Video-Based American Sign Language Dictionary: Exploring Design and Usage Experience with Learners

Figure 3 for Towards an AI-Driven Video-Based American Sign Language Dictionary: Exploring Design and Usage Experience with Learners

Figure 4 for Towards an AI-Driven Video-Based American Sign Language Dictionary: Exploring Design and Usage Experience with Learners

Abstract:Searching for unfamiliar American Sign Language (ASL) signs is challenging for learners because, unlike spoken languages, they cannot type a text-based query to look up an unfamiliar sign. Advances in isolated sign recognition have enabled the creation of video-based dictionaries, allowing users to submit a video and receive a list of the closest matching signs. Previous HCI research using Wizard-of-Oz prototypes has explored interface designs for ASL dictionaries. Building on these studies, we incorporate their design recommendations and leverage state-of-the-art sign-recognition technology to develop an automated video-based dictionary. We also present findings from an observational study with twelve novice ASL learners who used this dictionary during video-comprehension and question-answering tasks. Our results address human-AI interaction challenges not covered in previous WoZ research, including recording and resubmitting signs, unpredictable outputs, system latency, and privacy concerns. These insights offer guidance for designing and deploying video-based ASL dictionary systems.

Via

Access Paper or Ask Questions

Can Pose Transfer Models Generate Realistic Human Motion?

Jan 26, 2025

Vaclav Knapp, Matyas Bohacek

Figure 1 for Can Pose Transfer Models Generate Realistic Human Motion?

Figure 2 for Can Pose Transfer Models Generate Realistic Human Motion?

Figure 3 for Can Pose Transfer Models Generate Realistic Human Motion?

Figure 4 for Can Pose Transfer Models Generate Realistic Human Motion?

Abstract:Recent pose-transfer methods aim to generate temporally consistent and fully controllable videos of human action where the motion from a reference video is reenacted by a new identity. We evaluate three state-of-the-art pose-transfer methods -- AnimateAnyone, MagicAnimate, and ExAvatar -- by generating videos with actions and identities outside the training distribution and conducting a participant study about the quality of these videos. In a controlled environment of 20 distinct human actions, we find that participants, presented with the pose-transferred videos, correctly identify the desired action only 42.92% of the time. Moreover, the participants find the actions in the generated videos consistent with the reference (source) videos only 36.46% of the time. These results vary by method: participants find the splatting-based ExAvatar more consistent and photorealistic than the diffusion-based AnimateAnyone and MagicAnimate.

* Data and code available at https://github.com/matyasbohacek/pose-transfer-human-motion

Via

Access Paper or Ask Questions

Has an AI model been trained on your images?

Jan 11, 2025

Matyas Bohacek, Hany Farid

Figure 1 for Has an AI model been trained on your images?

Figure 2 for Has an AI model been trained on your images?

Figure 3 for Has an AI model been trained on your images?

Abstract:From a simple text prompt, generative-AI image models can create stunningly realistic and creative images bounded, it seems, by only our imagination. These models have achieved this remarkable feat thanks, in part, to the ingestion of billions of images collected from nearly every corner of the internet. Many creators have understandably expressed concern over how their intellectual property has been ingested without their permission or a mechanism to opt out of training. As a result, questions of fair use and copyright infringement have quickly emerged. We describe a method that allows us to determine if a model was trained on a specific image or set of images. This method is computationally efficient and assumes no explicit knowledge of the model architecture or weights (so-called black-box membership inference). We anticipate that this method will be crucial for auditing existing models and, looking ahead, ensuring the fairer development and deployment of generative AI models.

Via

Access Paper or Ask Questions

Human Action CLIPS: Detecting AI-generated Human Motion

Nov 30, 2024

Matyas Bohacek, Hany Farid

Figure 1 for Human Action CLIPS: Detecting AI-generated Human Motion

Figure 2 for Human Action CLIPS: Detecting AI-generated Human Motion

Figure 3 for Human Action CLIPS: Detecting AI-generated Human Motion

Figure 4 for Human Action CLIPS: Detecting AI-generated Human Motion

Abstract:Full-blown AI-generated video generation continues its journey through the uncanny valley to produce content that is perceptually indistinguishable from reality. Intermixed with many exciting and creative applications are malicious applications that harm individuals, organizations, and democracies. We describe an effective and robust technique for distinguishing real from AI-generated human motion. This technique leverages a multi-modal semantic embedding, making it robust to the types of laundering that typically confound more low- to mid-level approaches. This method is evaluated against a custom-built dataset of video clips with human actions generated by seven text-to-video AI models and matching real footage.

Via

Access Paper or Ask Questions

DeepSpeak Dataset v1.0

Aug 09, 2024

Sarah Barrington, Matyas Bohacek, Hany Farid

Abstract:We describe a large-scale dataset--{\em DeepSpeak}--of real and deepfake footage of people talking and gesturing in front of their webcams. The real videos in this first version of the dataset consist of $9$ hours of footage from $220$ diverse individuals. Constituting more than 25 hours of footage, the fake videos consist of a range of different state-of-the-art face-swap and lip-sync deepfakes with natural and AI-generated voices. We expect to release future versions of this dataset with different and updated deepfake technologies. This dataset is made freely available for research and non-commercial uses; requests for commercial use will be considered.

Via

Access Paper or Ask Questions

Nepotistically Trained Generative-AI Models Collapse

Nov 20, 2023

Matyas Bohacek, Hany Farid

Figure 1 for Nepotistically Trained Generative-AI Models Collapse

Figure 2 for Nepotistically Trained Generative-AI Models Collapse

Figure 3 for Nepotistically Trained Generative-AI Models Collapse

Figure 4 for Nepotistically Trained Generative-AI Models Collapse

Abstract:Trained on massive amounts of human-generated content, AI (artificial intelligence) image synthesis is capable of reproducing semantically coherent images that match the visual appearance of its training data. We show that when retrained on even small amounts of their own creation, these generative-AI models produce highly distorted images. We also show that this distortion extends beyond the text prompts used in retraining, and that once poisoned, the models struggle to fully heal even after retraining on only real images.

Via

Access Paper or Ask Questions