Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

David Jacobs

Using Attention Sinks to Identify and Evaluate Dormant Heads in Pretrained LLMs

Apr 04, 2025

Pedro Sandoval-Segura, Xijun Wang, Ashwinee Panda, Micah Goldblum, Ronen Basri, Tom Goldstein, David Jacobs

Abstract:Multi-head attention is foundational to large language models (LLMs), enabling different heads to have diverse focus on relevant input tokens. However, learned behaviors like attention sinks, where the first token receives most attention despite limited semantic importance, challenge our understanding of multi-head attention. To analyze this phenomenon, we propose a new definition for attention heads dominated by attention sinks, known as dormant attention heads. We compare our definition to prior work in a model intervention study where we test whether dormant heads matter for inference by zeroing out the output of dormant attention heads. Using six pretrained models and five benchmark datasets, we find our definition to be more model and dataset-agnostic. Using our definition on most models, more than 4% of a model's attention heads can be zeroed while maintaining average accuracy, and zeroing more than 14% of a model's attention heads can keep accuracy to within 1% of the pretrained model's average accuracy. Further analysis reveals that dormant heads emerge early in pretraining and can transition between dormant and active states during pretraining. Additionally, we provide evidence that they depend on characteristics of the input text.

* 22 pages, 14 figures

Via

Access Paper or Ask Questions

CinePile: A Long Video Question Answering Dataset and Benchmark

May 14, 2024

Ruchit Rawal, Khalid Saifullah, Ronen Basri, David Jacobs, Gowthami Somepalli, Tom Goldstein

Figure 1 for CinePile: A Long Video Question Answering Dataset and Benchmark

Figure 2 for CinePile: A Long Video Question Answering Dataset and Benchmark

Figure 3 for CinePile: A Long Video Question Answering Dataset and Benchmark

Figure 4 for CinePile: A Long Video Question Answering Dataset and Benchmark

Abstract:Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This paper details our innovative approach for creating a question-answer dataset, utilizing advanced LLMs with human-in-the-loop and building upon human-generated raw data. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene. Additionally, we evaluate recent video-centric LLMs, both open-source and proprietary, on the test split of our dataset. The findings reveal that even state-of-the-art video-centric LLMs significantly lag behind human performance in these tasks, highlighting the complexity and challenge inherent in video understanding. The dataset is available at https://hf.co/datasets/tomg-group-umd/cinepile

* Project page with all the artifacts - https://ruchitrawal.github.io/cinepile/

Via

Access Paper or Ask Questions

GaNI: Global and Near Field Illumination Aware Neural Inverse Rendering

Mar 22, 2024

Jiaye Wu, Saeed Hadadan, Geng Lin, Matthias Zwicker, David Jacobs, Roni Sengupta

Figure 1 for GaNI: Global and Near Field Illumination Aware Neural Inverse Rendering

Figure 2 for GaNI: Global and Near Field Illumination Aware Neural Inverse Rendering

Figure 3 for GaNI: Global and Near Field Illumination Aware Neural Inverse Rendering

Figure 4 for GaNI: Global and Near Field Illumination Aware Neural Inverse Rendering

Abstract:In this paper, we present GaNI, a Global and Near-field Illumination-aware neural inverse rendering technique that can reconstruct geometry, albedo, and roughness parameters from images of a scene captured with co-located light and camera. Existing inverse rendering techniques with co-located light-camera focus on single objects only, without modeling global illumination and near-field lighting more prominent in scenes with multiple objects. We introduce a system that solves this problem in two stages; we first reconstruct the geometry powered by neural volumetric rendering NeuS, followed by inverse neural radiosity that uses the previously predicted geometry to estimate albedo and roughness. However, such a naive combination fails and we propose multiple technical contributions that enable this two-stage approach. We observe that NeuS fails to handle near-field illumination and strong specular reflections from the flashlight in a scene. We propose to implicitly model the effects of near-field illumination and introduce a surface angle loss function to handle specular reflections. Similarly, we observe that invNeRad assumes constant illumination throughout the capture and cannot handle moving flashlights during capture. We propose a light position-aware radiance cache network and additional smoothness priors on roughness to reconstruct reflectance. Experimental evaluation on synthetic and real data shows that our method outperforms the existing co-located light-camera-based inverse rendering techniques. Our approach produces significantly better reflectance and slightly better geometry than capture strategies that do not require a dark room.

Via

Access Paper or Ask Questions

Towards Combinatorial Generalization for Catalysts: A Kohn-Sham Charge-Density Approach

Oct 28, 2023

Phillip Pope, David Jacobs

Figure 1 for Towards Combinatorial Generalization for Catalysts: A Kohn-Sham Charge-Density Approach

Figure 2 for Towards Combinatorial Generalization for Catalysts: A Kohn-Sham Charge-Density Approach

Figure 3 for Towards Combinatorial Generalization for Catalysts: A Kohn-Sham Charge-Density Approach

Figure 4 for Towards Combinatorial Generalization for Catalysts: A Kohn-Sham Charge-Density Approach

Abstract:The Kohn-Sham equations underlie many important applications such as the discovery of new catalysts. Recent machine learning work on catalyst modeling has focused on prediction of the energy, but has so far not yet demonstrated significant out-of-distribution generalization. Here we investigate another approach based on the pointwise learning of the Kohn-Sham charge-density. On a new dataset of bulk catalysts with charge densities, we show density models can generalize to new structures with combinations of elements not seen at train time, a form of combinatorial generalization. We show that over 80% of binary and ternary test cases achieve faster convergence than standard baselines in Density Functional Theory, amounting to an average reduction of 13% in the number of iterations required to reach convergence, which may be of independent interest. Our results suggest that density learning is a viable alternative, trading greater inference costs for a step towards combinatorial generalization, a key property for applications.

* Published at NeurIPS 2023

Via

Access Paper or Ask Questions

Measured Albedo in the Wild: Filling the Gap in Intrinsics Evaluation

Jun 29, 2023

Jiaye Wu, Sanjoy Chowdhury, Hariharmano Shanmugaraja, David Jacobs, Soumyadip Sengupta

Figure 1 for Measured Albedo in the Wild: Filling the Gap in Intrinsics Evaluation

Figure 2 for Measured Albedo in the Wild: Filling the Gap in Intrinsics Evaluation

Figure 3 for Measured Albedo in the Wild: Filling the Gap in Intrinsics Evaluation

Figure 4 for Measured Albedo in the Wild: Filling the Gap in Intrinsics Evaluation

Abstract:Intrinsic image decomposition and inverse rendering are long-standing problems in computer vision. To evaluate albedo recovery, most algorithms report their quantitative performance with a mean Weighted Human Disagreement Rate (WHDR) metric on the IIW dataset. However, WHDR focuses only on relative albedo values and often fails to capture overall quality of the albedo. In order to comprehensively evaluate albedo, we collect a new dataset, Measured Albedo in the Wild (MAW), and propose three new metrics that complement WHDR: intensity, chromaticity and texture metrics. We show that existing algorithms often improve WHDR metric but perform poorly on other metrics. We then finetune different algorithms on our MAW dataset to significantly improve the quality of the reconstructed albedo both quantitatively and qualitatively. Since the proposed intensity, chromaticity, and texture metrics and the WHDR are all complementary we further introduce a relative performance measure that captures average performance. By analysing existing algorithms we show that there is significant room for improvement. Our dataset and evaluation metrics will enable researchers to develop algorithms that improve albedo reconstruction. Code and Data available at: https://measuredalbedo.github.io/

* Accepted into ICCP2023

Via

Access Paper or Ask Questions

Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

May 17, 2023

Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, Yogesh Balaji

Abstract:Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy. While off-the-shelf billion-scale datasets for image generation are available, collecting similar video data of the same scale is still challenging. Also, training a video diffusion model is computationally much more expensive than its image counterpart. In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task. We find that naively extending the image noise prior to video noise prior in video diffusion leads to sub-optimal performance. Our carefully designed video noise prior leads to substantially better performance. Extensive experimental validation shows that our model, Preserve Your Own Correlation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It also achieves SOTA video generation quality on the small-scale UCF-101 benchmark with a $10\times$ smaller model using significantly less computation than the prior art.

* Project webpage: https://research.nvidia.com/labs/dir/pyoco

Via

Access Paper or Ask Questions

LASER: Neuro-Symbolic Learning of Semantic Video Representations

Apr 15, 2023

Jiani Huang, Ziyang Li, David Jacobs, Mayur Naik, Ser-Nam Lim

Abstract:Modern AI applications involving video, such as video-text alignment, video search, and video captioning, benefit from a fine-grained understanding of video semantics. Existing approaches for video understanding are either data-hungry and need low-level annotation, or are based on general embeddings that are uninterpretable and can miss important details. We propose LASER, a neuro-symbolic approach that learns semantic video representations by leveraging logic specifications that can capture rich spatial and temporal properties in video data. In particular, we formulate the problem in terms of alignment between raw videos and specifications. The alignment process efficiently trains low-level perception models to extract a fine-grained video representation that conforms to the desired high-level specification. Our pipeline can be trained end-to-end and can incorporate contrastive and semantic loss functions derived from specifications. We evaluate our method on two datasets with rich spatial and temporal specifications: 20BN-Something-Something and MUGEN. We demonstrate that our method not only learns fine-grained video semantics but also outperforms existing baselines on downstream tasks such as video retrieval.

Via

Access Paper or Ask Questions

HaLP: Hallucinating Latent Positives for Skeleton-based Self-Supervised Learning of Actions

Apr 01, 2023

Anshul Shah, Aniket Roy, Ketul Shah, Shlok Kumar Mishra, David Jacobs, Anoop Cherian, Rama Chellappa

Abstract:Supervised learning of skeleton sequence encoders for action recognition has received significant attention in recent times. However, learning such encoders without labels continues to be a challenging problem. While prior works have shown promising results by applying contrastive learning to pose sequences, the quality of the learned representations is often observed to be closely tied to data augmentations that are used to craft the positives. However, augmenting pose sequences is a difficult task as the geometric constraints among the skeleton joints need to be enforced to make the augmentations realistic for that action. In this work, we propose a new contrastive learning approach to train models for skeleton-based action recognition without labels. Our key contribution is a simple module, HaLP - to Hallucinate Latent Positives for contrastive learning. Specifically, HaLP explores the latent space of poses in suitable directions to generate new positives. To this end, we present a novel optimization formulation to solve for the synthetic positives with an explicit control on their hardness. We propose approximations to the objective, making them solvable in closed form with minimal overhead. We show via experiments that using these generated positives within a standard contrastive learning framework leads to consistent improvements across benchmarks such as NTU-60, NTU-120, and PKU-II on tasks like linear evaluation, transfer learning, and kNN evaluation. Our code will be made available at https://github.com/anshulbshah/HaLP.

* To be presented at CVPR 2023

Via

Access Paper or Ask Questions

LD-ZNet: A Latent Diffusion Approach for Text-Based Image Segmentation

Mar 22, 2023

Koutilya Pnvr, Bharat Singh, Pallabi Ghosh, Behjat Siddiquie, David Jacobs

Figure 1 for LD-ZNet: A Latent Diffusion Approach for Text-Based Image Segmentation

Figure 2 for LD-ZNet: A Latent Diffusion Approach for Text-Based Image Segmentation

Figure 3 for LD-ZNet: A Latent Diffusion Approach for Text-Based Image Segmentation

Figure 4 for LD-ZNet: A Latent Diffusion Approach for Text-Based Image Segmentation

Abstract:We present a technique for segmenting real and AI-generated images using latent diffusion models (LDMs) trained on internet-scale datasets. First, we show that the latent space of LDMs (z-space) is a better input representation compared to other feature representations like RGB images or CLIP encodings for text-based image segmentation. By training the segmentation models on the latent z-space, which creates a compressed representation across several domains like different forms of art, cartoons, illustrations, and photographs, we are also able to bridge the domain gap between real and AI-generated images. We show that the internal features of LDMs contain rich semantic information and present a technique in the form of LD-ZNet to further boost the performance of text-based segmentation. Overall, we show up to 6% improvement over standard baselines for text-to-image segmentation on natural images. For AI-generated imagery, we show close to 20% improvement compared to state-of-the-art techniques.

* Supplementary material is included in the paper following the references section

Via

Access Paper or Ask Questions

Hyperbolic Contrastive Learning for Visual Representations beyond Objects

Dec 01, 2022

Songwei Ge, Shlok Mishra, Simon Kornblith, Chun-Liang Li, David Jacobs

Figure 1 for Hyperbolic Contrastive Learning for Visual Representations beyond Objects

Figure 2 for Hyperbolic Contrastive Learning for Visual Representations beyond Objects

Figure 3 for Hyperbolic Contrastive Learning for Visual Representations beyond Objects

Figure 4 for Hyperbolic Contrastive Learning for Visual Representations beyond Objects

Abstract:Although self-/un-supervised methods have led to rapid progress in visual representation learning, these methods generally treat objects and scenes using the same lens. In this paper, we focus on learning representations for objects and scenes that preserve the structure among them. Motivated by the observation that visually similar objects are close in the representation space, we argue that the scenes and objects should instead follow a hierarchical structure based on their compositionality. To exploit such a structure, we propose a contrastive learning framework where a Euclidean loss is used to learn object representations and a hyperbolic loss is used to encourage representations of scenes to lie close to representations of their constituent objects in a hyperbolic space. This novel hyperbolic objective encourages the scene-object hypernymy among the representations by optimizing the magnitude of their norms. We show that when pretraining on the COCO and OpenImages datasets, the hyperbolic loss improves downstream performance of several baselines across multiple datasets and tasks, including image classification, object detection, and semantic segmentation. We also show that the properties of the learned representations allow us to solve various vision tasks that involve the interaction between scenes and objects in a zero-shot fashion. Our code can be found at \url{https://github.com/shlokk/HCL/tree/main/HCL}.

Via

Access Paper or Ask Questions