Abstract:Building on the unprecedented capabilities of large language models for command understanding and zero-shot recognition of multi-modal vision-language transformers, visual language navigation (VLN) has emerged as an effective way to address multiple fundamental challenges toward a natural language interface to robot navigation. However, such vision-language models are inherently vulnerable due to the lack of semantic meaning of the underlying embedding space. Using a recently developed gradient based optimization procedure, we demonstrate that images can be modified imperceptibly to match the representation of totally different images and unrelated texts for a vision-language model. Building on this, we develop algorithms that can adversarially modify a minimal number of images so that the robot will follow a route of choice for commands that require a number of landmarks. We demonstrate that experimentally using a recently proposed VLN system; for a given navigation command, a robot can be made to follow drastically different routes. We also develop an efficient algorithm to detect such malicious modifications reliably based on the fact that the adversarially modified images have much higher sensitivity to added Gaussian noise than the original images.
Abstract:Utilizing a shared embedding space, emerging multimodal models exhibit unprecedented zero-shot capabilities. However, the shared embedding space could lead to new vulnerabilities if different modalities can be misaligned. In this paper, we extend and utilize a recently developed effective gradient-based procedure that allows us to match the embedding of a given text by minimally modifying an image. Using the procedure, we show that we can align the embeddings of distinguishable texts to any image through unnoticeable adversarial attacks in joint image-text models, revealing that semantically unrelated images can have embeddings of identical texts and at the same time visually indistinguishable images can be matched to the embeddings of very different texts. Our technique achieves 100\% success rate when it is applied to text datasets and images from multiple sources. Without overcoming the vulnerability, multimodal models cannot robustly align inputs from different modalities in a semantically meaningful way. \textbf{Warning: the text data used in this paper are toxic in nature and may be offensive to some readers.}
Abstract:Deep neural networks trained in an end-to-end manner are proven to be efficient in a wide range of machine learning tasks. However, there is one drawback of end-to-end learning: The learned features and information are implicitly represented in neural network parameters, which cannot be used as regularities, concepts or knowledge to explicitly represent the data probability distribution. To resolve this issue, we propose in this paper a new machine learning theory, which defines in mathematics what are regularities. Briefly, regularities are concise representations of the non-random features, or 'non-randomness' in the data probability distribution. Combining with information theory, we claim that regularities can also be regarded as a small amount of information encoding a large amount of information. Our theory is based on spiking functions. That is, if a function can react to, or spike on specific data samples more frequently than random noise inputs, we say that such a function discovers non-randomness from the data distribution, and encodes the non-randomness into regularities. Our theory also discusses applying multiple spiking functions to the same data distribution. In this process, we claim that the 'best' regularities, or the optimal spiking functions, are those who can capture the largest amount of information from the data distribution, and then encode the captured information in the most concise way. Theorems and hypotheses are provided to describe in mathematics what are 'best' regularities and optimal spiking functions. Finally, we propose a machine learning approach, which can potentially obtain the optimal spiking functions regarding the given dataset in practice.
Abstract:Transformer-based models have dominated natural language processing and other areas in the last few years due to their superior (zero-shot) performance on benchmark datasets. However, these models are poorly understood due to their complexity and size. While probing-based methods are widely used to understand specific properties, the structures of the representation space are not systematically characterized; consequently, it is unclear how such models generalize and overgeneralize to new inputs beyond datasets. In this paper, based on a new gradient descent optimization method, we are able to explore the embedding space of a commonly used vision-language model. Using the Imagenette dataset, we show that while the model achieves over 99\% zero-shot classification performance, it fails systematic evaluations completely. Using a linear approximation, we provide a framework to explain the striking differences. We have also obtained similar results using a different model to support that our results are applicable to other transformer models with continuous inputs. We also propose a robust way to detect the modified images.
Abstract:Pre-trained large foundation models play a central role in the recent surge of artificial intelligence, resulting in fine-tuned models with remarkable abilities when measured on benchmark datasets, standard exams, and applications. Due to their inherent complexity, these models are not well understood. While small adversarial inputs to such models are well known, the structures of the representation space are not well characterized despite their fundamental importance. In this paper, using the vision transformers as an example due to the continuous nature of their input space, we show via analyses and systematic experiments that the representation space consists of large piecewise linear subspaces where there exist very different inputs sharing the same representations, and at the same time, local normal spaces where there are visually indistinguishable inputs having very different representations. The empirical results are further verified using the local directional estimations of the Lipschitz constants of the underlying models. Consequently, the resulting representations change the results of downstream models, and such models are subject to overgeneralization and with limited semantically meaningful generalization capability.
Abstract:Link prediction is a crucial research area in knowledge graphs, with many downstream applications. In many real-world scenarios, inductive link prediction is required, where predictions have to be made among unseen entities. Embedding-based models usually need fine-tuning on new entity embeddings, and hence are difficult to be directly applied to inductive link prediction tasks. Logical rules captured by rule-based models can be directly applied to new entities with the same graph typologies, but the captured rules are discrete and usually lack generosity. Graph neural networks (GNNs) can generalize topological information to new graphs taking advantage of deep neural networks, which however may still need fine-tuning on new entity embeddings. In this paper, we propose SiaILP, a path-based model for inductive link prediction using siamese neural networks. Our model only depends on relation and path embeddings, which can be generalized to new entities without fine-tuning. Experiments show that our model achieves several new state-of-the-art performances in link prediction tasks using inductive versions of WN18RR, FB15k-237, and Nell995.
Abstract:Graph-based semi-supervised learning (GSSL) has been used successfully in various applications. Existing methods leverage the graph structure and labeled samples for classification. Label Propagation (LP) and Graph Neural Networks (GNNs) both iteratively pass messages on graphs, where LP propagates node labels through edges and GNN aggregates node features from the neighborhood. Recently, combining LP and GNN has led to improved performance. However, utilizing labels and features jointly in higher-order graphs has not been explored. Therefore, we propose Nonlinear Correct and Smooth (NLCS), which improves the existing post-processing approach by incorporating non-linearity and higher-order representation into the residual propagation to handle intricate node relationships effectively. Systematic evaluations show that our method achieves remarkable average improvements of 13.71% over base prediction and 2.16% over the state-of-the-art post-processing method on six commonly used datasets. Comparisons and analyses show our method effectively utilizes labels and features jointly in higher-order graphs to resolve challenging graph relationships.
Abstract:Stock price movement prediction is a challenging and essential problem in finance. While it is well established in modern behavioral finance that the share prices of related stocks often move after the release of news via reactions and overreactions of investors, how to capture the relationships between price movements and news articles via quantitative models is an active area research; existing models have achieved success with variable degrees. In this paper, we propose to improve stock price movement classification using news articles by incorporating regularization and optimization techniques from deep learning. More specifically, we capture the dependencies between news articles and stocks through embeddings and bidirectional recurrent neural networks as in recent models. We further incorporate weight decay, batch normalization, dropout, and label smoothing to improve the generalization of the trained models. To handle high fluctuations of validation accuracy of batch normalization, we propose dual-phase training to realize the improvements reliably. Our experimental results on a commonly used dataset show significant improvements, achieving average accuracy of 80.7% on the test set, which is more than 10.0% absolute improvement over existing models. Our ablation studies show batch normalization and label smoothing are most effective, leading to 6.0% and 3.4% absolute improvement, respectively on average.
Abstract:Recently, multi-layer perceptrons (MLPs) with ReLU activations have enabled new photo-realistic rendering techniques by encoding scene properties using their weights. For these models, termed coordinate based MLPs, sinusoidal encodings are necessary in allowing for convergence to the high frequency components of the signal due to their severe spectral bias. Previous work has explained this phenomenon using Neural Tangent Kernel (NTK) and Fourier analysis. However, the kernel regime does not expose the properties of the network that induce this behavior, and the Fourier decomposition is global, not allowing for insight on the network's local dynamics. A new interpretation of spectral bias directly through ReLU network computations would expose their limitations in dense settings, while providing a clearer explanation as to how this behavior emerges during the learning process. In this paper, we provide the first study of spectral bias in a coordinate based MLP through its activation regions and gradient descent dynamics, specifically using gradient confusion. We relate the confusion between inputs to the distinctiveness of their activation patterns, and find higher amounts of confusion when expressive power is limited. This leads to slower convergence to the high frequency components of the signal, which is magnified by the density of coordinates. Additionally, this method allows us to analyze the properties of the activation regions as spectral bias is reduced, in which we find distinct dynamics.
Abstract:In this paper, we provide a novel way to generate low-dimension (dense) vector embeddings for the noun and verb synsets in WordNet, so that the hypernym-hyponym tree structure is preserved in the embeddings. We call this embedding the sense spectrum (and sense spectra for embeddings). In order to create suitable labels for the training of sense spectra, we designed a new similarity measurement for noun and verb synsets in WordNet. We call this similarity measurement the hypernym intersection similarity (HIS), since it compares the common and unique hypernyms between two synsets. Our experiments show that on the noun and verb pairs of the SimLex-999 dataset, HIS outperforms the three similarity measurements in WordNet. Moreover, to the best of our knowledge, the sense spectra is the first dense embedding system that can explicitly and completely measure the hypernym-hyponym relationship in WordNet.