Abstract:In this paper, we propose Conceptual Codebook Learning (CoCoLe), a novel fine-tuning method for vision-language models (VLMs) to address the challenge of improving the generalization capability of VLMs while fine-tuning them on downstream tasks in a few-shot setting. We recognize that visual concepts, such as textures, shapes, and colors are naturally transferable across domains and play a crucial role in generalization tasks. Motivated by this interesting finding, we learn a conceptual codebook consisting of visual concepts as keys and conceptual prompts as values, which serves as a link between the image encoder's outputs and the text encoder's inputs. Specifically, for a given image, we leverage the codebook to identify the most relevant conceptual prompts associated with the class embeddings to perform the classification. Additionally, we incorporate a handcrafted concept cache as a regularization to alleviate the overfitting issues in low-shot scenarios. We observe that this conceptual codebook learning method is able to achieve enhanced alignment between visual and linguistic modalities. Extensive experimental results demonstrate that our CoCoLe method remarkably outperforms the existing state-of-the-art methods across various evaluation settings, including base-to-new generalization, cross-dataset evaluation, and domain generalization tasks. Detailed ablation studies further confirm the efficacy of each component in CoCoLe.
Abstract:The discovery of drug-target interactions (DTIs) plays a crucial role in pharmaceutical development. The deep learning model achieves more accurate results in DTI prediction due to its ability to extract robust and expressive features from drug and target chemical structures. However, existing deep learning methods typically generate drug features via aggregating molecular atom representations, ignoring the chemical properties carried by motifs, i.e., substructures of the molecular graph. The atom-drug double-level molecular representation learning can not fully exploit structure information and fails to interpret the DTI mechanism from the motif perspective. In addition, sequential model-based target feature extraction either fuses limited contextual information or requires expensive computational resources. To tackle the above issues, we propose a hierarchical graph representation learning-based DTI prediction method (HiGraphDTI). Specifically, HiGraphDTI learns hierarchical drug representations from triple-level molecular graphs to thoroughly exploit chemical information embedded in atoms, motifs, and molecules. Then, an attentional feature fusion module incorporates information from different receptive fields to extract expressive target features.Last, the hierarchical attention mechanism identifies crucial molecular segments, which offers complementary views for interpreting interaction mechanisms. The experiment results not only demonstrate the superiority of HiGraphDTI to the state-of-the-art methods, but also confirm the practical ability of our model in interaction interpretation and new DTI discovery.
Abstract:Computational social science research has made advances in machine learning and natural language processing that support content moderators in detecting harmful content. These advances often rely on training datasets annotated by crowdworkers for harmful content. In designing instructions for annotation tasks to generate training data for these algorithms, researchers often treat the harm concepts that we train algorithms to detect - 'hateful', 'offensive', 'toxic', 'racist', 'sexist', etc. - as interchangeable. In this work, we studied whether the way that researchers define 'harm' affects annotation outcomes. Using Venn diagrams, information gain comparisons, and content analyses, we reveal that annotators do not use the concepts 'hateful', 'offensive', and 'toxic' interchangeably. We identify that features of harm definitions and annotators' individual characteristics explain much of how annotators use these terms differently. Our results offer empirical evidence discouraging the common practice of using harm concepts interchangeably in content moderation research. Instead, researchers should make specific choices about which harm concepts to analyze based on their research goals. Recognizing that researchers are often resource constrained, we also encourage researchers to provide information to bound their findings when their concepts of interest differ from concepts that off-the-shelf harmful content detection algorithms identify. Finally, we encourage algorithm providers to ensure their instruments can adapt to contextually-specific content detection goals (e.g., soliciting instrument users' feedback).
Abstract:The collective attention on online items such as web pages, search terms, and videos reflects trends that are of social, cultural, and economic interest. Moreover, attention trends of different items exhibit mutual influence via mechanisms such as hyperlinks or recommendations. Many visualisation tools exist for time series, network evolution, or network influence; however, few systems connect all three. In this work, we present AttentionFlow, a new system to visualise networks of time series and the dynamic influence they have on one another. Centred around an ego node, our system simultaneously presents the time series on each node using two visual encodings: a tree ring for an overview and a line chart for details. AttentionFlow supports interactions such as overlaying time series of influence and filtering neighbours by time or flux. We demonstrate AttentionFlow using two real-world datasets, VevoMusic and WikiTraffic. We show that attention spikes in songs can be explained by external events such as major awards, or changes in the network such as the release of a new song. Separate case studies also demonstrate how an artist's influence changes over their career, and that correlated Wikipedia traffic is driven by cultural interests. More broadly, AttentionFlow can be generalised to visualise networks of time series on physical infrastructures such as road networks, or natural phenomena such as weather and geological measurements.
Abstract:We study the problem of globally recovering a dictionary from a set of signals via $\ell_1$-minimization. We assume that the signals are generated as i.i.d. random linear combinations of the $K$ atoms from a complete reference dictionary $D^*\in \mathbb R^{K\times K}$, where the linear combination coefficients are from either a Bernoulli type model or exact sparse model. First, we obtain a necessary and sufficient norm condition for the reference dictionary $D^*$ to be a sharp local minimum of the expected $\ell_1$ objective function. Our result substantially extends that of Wu and Yu (2015) and allows the combination coefficient to be non-negative. Secondly, we obtain an explicit bound on the region within which the objective value of the reference dictionary is minimal. Thirdly, we show that the reference dictionary is the unique sharp local minimum, thus establishing the first known global property of $\ell_1$-minimization dictionary learning. Motivated by the theoretical results, we introduce a perturbation-based test to determine whether a dictionary is a sharp local minimum of the objective function. In addition, we also propose a new dictionary learning algorithm based on Block Coordinate Descent, called DL-BCD, which is guaranteed to have monotonic convergence. Simulation studies show that DL-BCD has competitive performance in terms of recovery rate compared to many state-of-the-art dictionary learning algorithms.
Abstract:We study the theoretical properties of learning a dictionary from $N$ signals $\mathbf x_i\in \mathbb R^K$ for $i=1,...,N$ via $l_1$-minimization. We assume that $\mathbf x_i$'s are $i.i.d.$ random linear combinations of the $K$ columns from a complete (i.e., square and invertible) reference dictionary $\mathbf D_0 \in \mathbb R^{K\times K}$. Here, the random linear coefficients are generated from either the $s$-sparse Gaussian model or the Bernoulli-Gaussian model. First, for the population case, we establish a sufficient and almost necessary condition for the reference dictionary $\mathbf D_0$ to be locally identifiable, i.e., a local minimum of the expected $l_1$-norm objective function. Our condition covers both sparse and dense cases of the random linear coefficients and significantly improves the sufficient condition by Gribonval and Schnass (2010). In addition, we show that for a complete $\mu$-coherent reference dictionary, i.e., a dictionary with absolute pairwise column inner-product at most $\mu\in[0,1)$, local identifiability holds even when the random linear coefficient vector has up to $O(\mu^{-2})$ nonzeros on average. Moreover, our local identifiability results also translate to the finite sample case with high probability provided that the number of signals $N$ scales as $O(K\log K)$.