Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alex Murphy

Exploring Curriculum Learning for Vision-Language Tasks: A Study on Small-Scale Multimodal Training

Oct 20, 2024

Rohan Saha, Abrar Fahim, Alona Fyshe, Alex Murphy

Abstract:For specialized domains, there is often not a wealth of data with which to train large machine learning models. In such limited data / compute settings, various methods exist aiming to $\textit{do more with less}$, such as finetuning from a pretrained model, modulating difficulty levels as data are presented to a model (curriculum learning), and considering the role of model type / size. Approaches to efficient $\textit{machine}$ learning also take inspiration from $\textit{human}$ learning by considering use cases where machine learning systems have access to approximately the same number of words experienced by a 13 year old child (100M words). We investigate the role of 3 primary variables in a limited data regime as part of the multimodal track of the BabyLM challenge. We contrast: (i) curriculum learning, (ii), pretraining (with text-only data), (iii) model type. We modulate these variables and assess them on two types of tasks: (a) multimodal (text+image), and (b) unimodal (text-only) tasks. We find that curriculum learning benefits multimodal evaluations over non-curriclum learning models, particularly when combining text-only pretraining. On text-only tasks, curriculum learning appears to help models with smaller trainable parameter counts. We suggest possible reasons based on architectural differences and training designs as to why one might observe such results.

* CoNLL BabyLM Challenge 2024 camera ready

Via

Access Paper or Ask Questions

It's Not a Modality Gap: Characterizing and Addressing the Contrastive Gap

Jun 06, 2024

Abrar Fahim, Alex Murphy, Alona Fyshe

Figure 1 for It's Not a Modality Gap: Characterizing and Addressing the Contrastive Gap

Figure 2 for It's Not a Modality Gap: Characterizing and Addressing the Contrastive Gap

Figure 3 for It's Not a Modality Gap: Characterizing and Addressing the Contrastive Gap

Figure 4 for It's Not a Modality Gap: Characterizing and Addressing the Contrastive Gap

Abstract:Multi-modal contrastive models such as CLIP achieve state-of-the-art performance in zero-shot classification by embedding input images and texts on a joint representational space. Recently, a modality gap has been reported in two-encoder contrastive models like CLIP, meaning that the image and text embeddings reside in disjoint areas of the latent space. Previous studies suggest that this gap exists due to 1) the cone effect, 2) mismatched pairs in the dataset, and 3) insufficient training. We show that, even when accounting for all these factors, and even when using the same modality, the contrastive loss actually creates a gap during training. As a result, We propose that the modality gap is inherent to the two-encoder contrastive loss and rename it the contrastive gap. We present evidence that attributes this contrastive gap to low uniformity in CLIP space, resulting in embeddings that occupy only a small portion of the latent space. To close the gap, we adapt the uniformity and alignment properties of unimodal contrastive loss to the multi-modal setting and show that simply adding these terms to the CLIP loss distributes the embeddings more uniformly in the representational space, closing the gap. In our experiments, we show that the modified representational space achieves better performance than default CLIP loss in downstream tasks such as zero-shot image classification and multi-modal arithmetic.

Via

Access Paper or Ask Questions

Its Not a Modality Gap: Characterizing and Addressing the Contrastive Gap

May 28, 2024

Abrar Fahim, Alex Murphy, Alona Fyshe

Via

Access Paper or Ask Questions

What's the Opposite of a Face? Finding Shared Decodable Concepts and their Negations in the Brain

May 27, 2024

Cory Efird, Alex Murphy, Joel Zylberberg, Alona Fyshe

Figure 1 for What's the Opposite of a Face? Finding Shared Decodable Concepts and their Negations in the Brain

Figure 2 for What's the Opposite of a Face? Finding Shared Decodable Concepts and their Negations in the Brain

Figure 3 for What's the Opposite of a Face? Finding Shared Decodable Concepts and their Negations in the Brain

Figure 4 for What's the Opposite of a Face? Finding Shared Decodable Concepts and their Negations in the Brain

Abstract:Prior work has offered evidence for functional localization in the brain; different anatomical regions preferentially activate for certain types of visual input. For example, the fusiform face area preferentially activates for visual stimuli that include a face. However, the spectrum of visual semantics is extensive, and only a few semantically-tuned patches of cortex have so far been identified in the human brain. Using a multimodal (natural language and image) neural network architecture (CLIP) we train a highly accurate contrastive model that maps brain responses during naturalistic image viewing to CLIP embeddings. We then use a novel adaptation of the DBSCAN clustering algorithm to cluster the parameters of these participant-specific contrastive models. This reveals what we call Shared Decodable Concepts (SDCs): clusters in CLIP space that are decodable from common sets of voxels across multiple participants. Examining the images most and least associated with each SDC cluster gives us additional insight into the semantic properties of each SDC. We note SDCs for previously reported visual features (e.g. orientation tuning in early visual cortex) as well as visual semantic concepts such as faces, places and bodies. In cases where our method finds multiple clusters for a visuo-semantic concept, the least associated images allow us to dissociate between confounding factors. For example, we discovered two clusters of food images, one driven by color, the other by shape. We also uncover previously unreported areas such as regions of extrastriate body area (EBA) tuned for legs/hands and sensitivity to numerosity in right intraparietal sulcus, and more. Thus, our contrastive-learning methodology better characterizes new and existing visuo-semantic representations in the brain by leveraging multimodal neural network representations and a novel adaptation of clustering algorithms.

Via

Access Paper or Ask Questions

Correcting Biased Centered Kernel Alignment Measures in Biological and Artificial Neural Networks

May 02, 2024

Alex Murphy, Joel Zylberberg, Alona Fyshe

Figure 1 for Correcting Biased Centered Kernel Alignment Measures in Biological and Artificial Neural Networks

Figure 2 for Correcting Biased Centered Kernel Alignment Measures in Biological and Artificial Neural Networks

Figure 3 for Correcting Biased Centered Kernel Alignment Measures in Biological and Artificial Neural Networks

Figure 4 for Correcting Biased Centered Kernel Alignment Measures in Biological and Artificial Neural Networks

Abstract:Centred Kernel Alignment (CKA) has recently emerged as a popular metric to compare activations from biological and artificial neural networks (ANNs) in order to quantify the alignment between internal representations derived from stimuli sets (e.g. images, text, video) that are presented to both systems. In this paper we highlight issues that the community should take into account if using CKA as an alignment metric with neural data. Neural data are in the low-data high-dimensionality domain, which is one of the cases where (biased) CKA results in high similarity scores even for pairs of random matrices. Using fMRI and MEG data from the THINGS project, we show that if biased CKA is applied to representations of different sizes in the low-data high-dimensionality domain, they are not directly comparable due to biased CKA's sensitivity to differing feature-sample ratios and not stimuli-driven responses. This situation can arise both when comparing a pre-selected area of interest (e.g. ROI) to multiple ANN layers, as well as when determining to which ANN layer multiple regions of interest (ROIs) / sensor groups of different dimensionality are most similar. We show that biased CKA can be artificially driven to its maximum value when using independent random data of different sample-feature ratios. We further show that shuffling sample-feature pairs of real neural data does not drastically alter biased CKA similarity in comparison to unshuffled data, indicating an undesirable lack of sensitivity to stimuli-driven neural responses. Positive alignment of true stimuli-driven responses is only achieved by using debiased CKA. Lastly, we report findings that suggest biased CKA is sensitive to the inherent structure of neural data, only differing from shuffled data when debiased CKA detects stimuli-driven alignment.

* ICLR 2024 Re-Align Workshop

Via

Access Paper or Ask Questions

Identifying Shared Decodable Concepts in the Human Brain Using Image-Language Foundation Models

Jun 06, 2023

Cory Efird, Alex Murphy, Joel Zylberberg, Alona Fyshe

Figure 1 for Identifying Shared Decodable Concepts in the Human Brain Using Image-Language Foundation Models

Figure 2 for Identifying Shared Decodable Concepts in the Human Brain Using Image-Language Foundation Models

Figure 3 for Identifying Shared Decodable Concepts in the Human Brain Using Image-Language Foundation Models

Figure 4 for Identifying Shared Decodable Concepts in the Human Brain Using Image-Language Foundation Models

Abstract:We introduce a method that takes advantage of high-quality pretrained multimodal representations to explore fine-grained semantic networks in the human brain. Previous studies have documented evidence of functional localization in the brain, with different anatomical regions preferentially activating for different types of sensory input. Many such localized structures are known, including the fusiform face area and parahippocampal place area. This raises the question of whether additional brain regions (or conjunctions of brain regions) are also specialized for other important semantic concepts. To identify such brain regions, we developed a data-driven approach to uncover visual concepts that are decodable from a massive functional magnetic resonance imaging (fMRI) dataset. Our analysis is broadly split into three sections. First, a fully connected neural network is trained to map brain responses to the outputs of an image-language foundation model, CLIP (Radford et al., 2021). Subsequently, a contrastive-learning dimensionality reduction method reveals the brain-decodable components of CLIP space. In the final section of our analysis, we localize shared decodable concepts in the brain using a voxel-masking optimization method to produce a shared decodable concept (SDC) space. The accuracy of our procedure is validated by comparing it to previous localization experiments that identify regions for faces, bodies, and places. In addition to these concepts, whose corresponding brain regions were already known, we localize novel concept representations which are shared across participants to other areas of the human brain. We also demonstrate how this method can be used to inspect fine-grained semantic networks for individual participants. We envisage that this extensible method can also be adapted to explore other questions at the intersection of AI and neuroscience.

* Under review

Via

Access Paper or Ask Questions