Abstract:Contextual ranking models have delivered impressive performance improvements over classical models in the document ranking task. However, these highly over-parameterized models tend to be data-hungry and require large amounts of data even for fine-tuning. In this paper, we propose data-augmentation methods for effective and robust ranking performance. One of the key benefits of using data augmentation is in achieving sample efficiency or learning effectively when we have only a small amount of training data. We propose supervised and unsupervised data augmentation schemes by creating training data using parts of the relevant documents in the query-document pairs. We then adapt a family of contrastive losses for the document ranking task that can exploit the augmented data to learn an effective ranking model. Our extensive experiments on subsets of the MS MARCO and TREC-DL test sets show that data augmentation, along with the ranking-adapted contrastive losses, results in performance improvements under most dataset sizes. Apart from sample efficiency, we conclusively show that data augmentation results in robust models when transferred to out-of-domain benchmarks. Our performance improvements in in-domain and more prominently in out-of-domain benchmarks show that augmentation regularizes the ranking model and improves its robustness and generalization capability.
Abstract:Earthquake prediction has been a challenging research area for many decades, where the future occurrence of this highly uncertain calamity is predicted. In this paper, several parametric and non-parametric features were calculated, where the non-parametric features were calculated using the parametric features. $8$ seismic features were calculated using Gutenberg-Richter law, the total recurrence, and the seismic energy release. Additionally, criterions such as Maximum Relevance and Maximum Redundancy were applied to choose the pertinent features. These features along with others were used as input for an Extreme Learning Machine (ELM) Regression Model. Magnitude and time data of $5$ decades from the Assam-Guwahati region were used to create this model for magnitude prediction. The Testing Accuracy and Testing Speed were computed taking the Root Mean Squared Error (RMSE) as the parameter for evaluating the mode. As confirmed by the results, ELM shows better scalability with much faster training and testing speed (up to a thousand times faster) than traditional Support Vector Machines. The testing RMSE came out to be around $0.097$. To further test the model's robustness -- magnitude-time data from California was used to calculate the seismic indicators which were then fed into an ELM and then tested on the Assam-Guwahati region. The model proves to be robust and can be implemented in early warning systems as it continues to be a major part of Disaster Response and management.
Abstract:The convolutional layers of standard convolutional neural networks (CNNs) are equivariant to translation. However, the convolution and fully-connected layers are not equivariant or invariant to other affine geometric transformations. Recently, a new class of CNNs is proposed in which the conventional layers of CNNs are replaced with equivariant convolution, pooling, and batch-normalization layers. The final classification layer in equivariant neural networks is invariant to different affine geometric transformations such as rotation, reflection and translation, and the scalar value is obtained by either eliminating the spatial dimensions of filter responses using convolution and down-sampling throughout the network or average is taken over the filter responses. In this work, we propose to integrate the orthogonal moments which gives the high-order statistics of the function as an effective means for encoding global invariance with respect to rotation, reflection and translation in fully-connected layers. As a result, the intermediate layers of the network become equivariant while the classification layer becomes invariant. The most widely used Zernike, pseudo-Zernike and orthogonal Fourier-Mellin moments are considered for this purpose. The effectiveness of the proposed work is evaluated by integrating the invariant transition and fully-connected layer in the architecture of group-equivariant CNNs (G-CNNs) on rotated MNIST and CIFAR10 datasets.
Abstract:Recently, neural networks have been successfully employed to improve upon state-of-the-art performance in ad-hoc retrieval tasks via machine-learned ranking functions. While neural retrieval models grow in complexity and impact, little is understood about their correspondence with well-studied IR principles. Recent work on interpretability in machine learning has provided tools and techniques to understand neural models in general, yet there has been little progress towards explaining ranking models. We investigate whether one can explain the behavior of neural ranking models in terms of their congruence with well understood principles of document ranking by using established theories from axiomatic IR. Axiomatic analysis of information retrieval models has formalized a set of constraints on ranking decisions that reasonable retrieval models should fulfill. We operationalize this axiomatic thinking to reproduce rankings based on combinations of elementary constraints. This allows us to investigate to what extent the ranking decisions of neural rankers can be explained in terms of retrieval axioms, and which axioms apply in which situations. Our experimental study considers a comprehensive set of axioms over several representative neural rankers. While the existing axioms can already explain the particularly confident ranking decisions rather well, future work should extend the axiom set to also cover the other still "unexplainable" neural IR rank decisions.
Abstract:Probing complex language models has recently revealed several insights into linguistic and semantic patterns found in the learned representations. In this article, we probe BERT specifically to understand and measure the relational knowledge it captures in its parametric memory. While probing for linguistic understanding is commonly applied to all layers of BERT as well as fine-tuned models, this has not been done for factual knowledge. We utilize existing knowledge base completion tasks (LAMA) to probe every layer of pre-trained as well as fine-tuned BERT models(ranking, question answering, NER). Our findings show that knowledge is not just contained in BERT's final layers. Intermediate layers contribute a significant amount (17-60%) to the total knowledge found. Probing intermediate layers also reveals how different types of knowledge emerge at varying rates. When BERT is fine-tuned, relational knowledge is forgotten. The extent of forgetting is impacted by the fine-tuning objective and the training data. We found that ranking models forget the least and retain more knowledge in their final layer compared to masked language modeling and question-answering. However, masked language modeling performed the best at acquiring new knowledge from the training data. When it comes to learning facts, we found that capacity and fact density are key factors. We hope this initial work will spur further research into understanding the parametric memory of language models and the effect of training objectives on factual knowledge. The code to repeat the experiments is publicly available on GitHub.
Abstract:Complex machine learning models are deployed in several critical domains including healthcare and autonomous vehicles nowadays, albeit as functional black boxes. Consequently, there has been a recent surge in interpreting decisions of such complex models in order to explain their actions to humans. Models that correspond to human interpretation of a task are more desirable in certain contexts and can help attribute liability, build trust, expose biases and in turn build better models. It is, therefore, crucial to understand how and which models conform to human understanding of tasks. In this paper, we present a large-scale crowdsourcing study that reveals and quantifies the dissonance between human and machine understanding, through the lens of an image classification task. In particular, we seek to answer the following questions: Which (well-performing) complex ML models are closer to humans in their use of features to make accurate predictions? How does task difficulty affect the feature selection capability of machines in comparison to humans? Are humans consistently better at selecting features that make image recognition more accurate? Our findings have important implications on human-machine collaboration, considering that a long term goal in the field of artificial intelligence is to make machines capable of learning and reasoning like humans.
Abstract:Learning-to-rank (LTR) is a class of supervised learning techniques that apply to ranking problems dealing with a large number of features. The popularity and widespread application of LTR models in prioritizing information in a variety of domains makes their scrutability vital in today's landscape of fair and transparent learning systems. However, limited work exists that deals with interpreting the decisions of learning systems that output rankings. In this paper we propose a model agnostic local explanation method that seeks to identify a small subset of input features as explanation to a ranking decision. We introduce new notions of validity and completeness of explanations specifically for rankings, based on the presence or absence of selected features, as a way of measuring goodness. We devise a novel optimization problem to maximize validity directly and propose greedy algorithms as solutions. In extensive quantitative experiments we show that our approach outperforms other model agnostic explanation approaches across pointwise, pairwise and listwise LTR models in validity while not compromising on completeness.
Abstract:The problem of building a coherent and non-monotonous conversational agent with proper discourse and coverage is still an area of open research. Current architectures only take care of semantic and contextual information for a given query and fail to completely account for syntactic and external knowledge which are crucial for generating responses in a chit-chat system. To overcome this problem, we propose an end to end multi-stream deep learning architecture which learns unified embeddings for query-response pairs by leveraging contextual information from memory networks and syntactic information by incorporating Graph Convolution Networks (GCN) over their dependency parse. A stream of this network also utilizes transfer learning by pre-training a bidirectional transformer to extract semantic representation for each input sentence and incorporates external knowledge through the the neighborhood of the entities from a Knowledge Base (KB). We benchmark these embeddings on next sentence prediction task and significantly improve upon the existing techniques. Furthermore, we use AMUSED to represent query and responses along with its context to develop a retrieval based conversational agent which has been validated by expert linguists to have comprehensive engagement with humans.
Abstract:Prediction of toxicity levels of chemical compounds is an important issue in Quantitative Structure-Activity Relationship (QSAR) modeling. Although toxicity prediction has achieved significant progress in recent times through deep learning, prediction accuracy levels obtained by even very recent methods are not yet very high. We propose a multimodal deep learning method using multiple heterogeneous neural network types and data representations. We represent chemical compounds by strings, images, and numerical features. We train fully connected, convolutional, and recurrent neural networks and their ensembles. Each data representation or neural network type has its own strengths and weaknesses. Our motivation is to obtain a collective performance that could go beyond individual performance of each data representation or each neural network type. On a standard toxicity benchmark, our proposed method obtains significantly better accuracy levels than that by the state-of-the-art toxicity prediction methods.
Abstract:A recent trend in IR has been the usage of neural networks to learn retrieval models for text based adhoc search. While various approaches and architectures have yielded significantly better performance than traditional retrieval models such as BM25, it is still difficult to understand exactly why a document is relevant to a query. In the ML community several approaches for explaining decisions made by deep neural networks have been proposed -- including DeepSHAP which modifies the DeepLift algorithm to estimate the relative importance (shapley values) of input features for a given decision by comparing the activations in the network for a given image against the activations caused by a reference input. In image classification, the reference input tends to be a plain black image. While DeepSHAP has been well studied for image classification tasks, it remains to be seen how we can adapt it to explain the output of Neural Retrieval Models (NRMs). In particular, what is a good "black" image in the context of IR? In this paper we explored various reference input document construction techniques. Additionally, we compared the explanations generated by DeepSHAP to LIME (a model agnostic approach) and found that the explanations differ considerably. Our study raises concerns regarding the robustness and accuracy of explanations produced for NRMs. With this paper we aim to shed light on interesting problems surrounding interpretability in NRMs and highlight areas of future work.