Abstract:Large Language Models (LLMs) often generate incorrect or outdated information, especially in low-resource settings or when dealing with private data. To address this, Retrieval-Augmented Generation (RAG) uses external knowledge bases (KBs), but these can also suffer from inaccuracies. We introduce STACKFEED, a novel Structured Textual Actor-Critic Knowledge base editing with FEEDback approach that iteratively refines the KB based on expert feedback using a multi-actor, centralized critic reinforcement learning framework. Each document is assigned to an actor, modeled as a ReACT agent, which performs structured edits based on document-specific targeted instructions from a centralized critic. Experimental results show that STACKFEED significantly improves KB quality and RAG system performance, enhancing accuracy by up to 8% over baselines.
Abstract:LLMs have demonstrated significant potential in code generation tasks, achieving promising results at the function or statement level in various benchmarks. However, the complexities associated with creating code artifacts like classes, particularly within the context of real-world software repositories, remain underexplored. Existing research often treats class-level generation as an isolated task, neglecting the intricate dependencies and interactions that characterize real-world software development environments. To address this gap, we introduce RepoClassBench, a benchmark designed to rigorously evaluate LLMs in generating complex, class-level code within real-world repositories. RepoClassBench includes natural language to class generation tasks across Java and Python, from a selection of public repositories. We ensure that each class in our dataset not only has cross-file dependencies within the repository but also includes corresponding test cases to verify its functionality. We find that current models struggle with the realistic challenges posed by our benchmark, primarily due to their limited exposure to relevant repository contexts. To address this shortcoming, we introduce Retrieve-Repotools-Reflect (RRR), a novel approach that equips LLMs with static analysis tools to iteratively navigate & reason about repository-level context in an agent-based framework. Our experiments demonstrate that RRR significantly outperforms existing baselines on RepoClassBench, showcasing its effectiveness across programming languages and in various settings. Our findings emphasize the need for benchmarks that incorporate repository-level dependencies to more accurately reflect the complexities of software development. Our work illustrates the benefits of leveraging specialized tools to enhance LLMs understanding of repository context. We plan to make our dataset and evaluation harness public.
Abstract:Unsupervised node representations learnt using contrastive learning-based methods have shown good performance on downstream tasks. However, these methods rely on augmentations that mimic low-pass filters, limiting their performance on tasks requiring different eigen-spectrum parts. This paper presents a simple filter-based augmentation method to capture different parts of the eigen-spectrum. We show significant improvements using these augmentations. Further, we show that sharing the same weights across these different filter augmentations is possible, reducing the computational load. In addition, previous works have shown that good performance on downstream tasks requires high dimensional representations. Working with high dimensions increases the computations, especially when multiple augmentations are involved. We mitigate this problem and recover good performance through lower dimensional embeddings using simple random Fourier feature projections. Our method, FiGURe achieves an average gain of up to 4.4%, compared to the state-of-the-art unsupervised models, across all datasets in consideration, both homophilic and heterophilic. Our code can be found at: https://github.com/microsoft/figure.
Abstract:We propose a new approach to extracting data items or field values from semi-structured documents. Examples of such problems include extracting passenger name, departure time and departure airport from a travel itinerary, or extracting price of an item from a purchase receipt. Traditional approaches to data extraction use machine learning or program synthesis to process the whole document to extract the desired fields. Such approaches are not robust to format changes in the document, and the extraction process typically fails even if changes are made to parts of the document that are unrelated to the desired fields of interest. We propose a new approach to data extraction based on the concepts of landmarks and regions. Humans routinely use landmarks in manual processing of documents to zoom in and focus their attention on small regions of interest in the document. Inspired by this human intuition, we use the notion of landmarks in program synthesis to automatically synthesize extraction programs that first extract a small region of interest, and then automatically extract the desired value from the region in a subsequent step. We have implemented our landmark-based extraction approach in a tool LRSyn, and show extensive evaluation on documents in HTML as well as scanned images of invoices and receipts. Our results show that our approach is robust to various types of format changes that routinely happen in real-world settings.
Abstract:Graph Neural Networks (GNNs) exploit signals from node features and the input graph topology to improve node classification task performance. However, these models tend to perform poorly on heterophilic graphs, where connected nodes have different labels. Recently proposed GNNs work across graphs having varying levels of homophily. Among these, models relying on polynomial graph filters have shown promise. We observe that solutions to these polynomial graph filter models are also solutions to an overdetermined system of equations. It suggests that in some instances, the model needs to learn a reasonably high order polynomial. On investigation, we find the proposed models ineffective at learning such polynomials due to their designs. To mitigate this issue, we perform an eigendecomposition of the graph and propose to learn multiple adaptive polynomial filters acting on different subsets of the spectrum. We theoretically and empirically show that our proposed model learns a better filter, thereby improving classification accuracy. We study various aspects of our proposed model including, dependency on the number of eigencomponents utilized, latent polynomial filters learned, and performance of the individual polynomials on the node classification task. We further show that our model is scalable by evaluating over large graphs. Our model achieves performance gains of up to 5% over the state-of-the-art models and outperforms existing polynomial filter-based approaches in general.
Abstract:Graph Neural Networks (GNNs) exhibit excellent performance when graphs have strong homophily property, i.e. connected nodes have the same labels. However, they perform poorly on heterophilic graphs. Several approaches address the issue of heterophily by proposing models that adapt the graph by optimizing task-specific loss function using labelled data. These adaptations are made either via attention or by attenuating or enhancing various low-frequency/high-frequency signals, as needed for the task at hand. More recent approaches adapt the eigenvalues of the graph. One important interpretation of this adaptation is that these models select/weigh the eigenvectors of the graph. Based on this interpretation, we present an eigendecomposition based approach and propose EigenNetwork models that improve the performance of GNNs on heterophilic graphs. Performance improvement is achieved by learning flexible graph adaptation functions that modulate the eigenvalues of the graph. Regularization of these functions via parameter sharing helps to improve the performance even more. Our approach achieves up to 11% improvement in performance over the state-of-the-art methods on heterophilic graphs.
Abstract:Graph Neural Networks (GNNs) have shown excellent performance on graphs that exhibit strong homophily with respect to the node labels i.e. connected nodes have same labels. However, they perform poorly on heterophilic graphs. Recent approaches have typically modified aggregation schemes, designed adaptive graph filters, etc. to address this limitation. In spite of this, the performance on heterophilic graphs can still be poor. We propose a simple alternative method that exploits Truncated Singular Value Decomposition (TSVD) of topological structure and node features. Our approach achieves up to ~30% improvement in performance over state-of-the-art methods on heterophilic graphs. This work is an early investigation into methods that differ from aggregation based approaches. Our experimental results suggest that it might be important to explore other alternatives to aggregation methods for heterophilic setting.
Abstract:Graph Neural Networks have shown excellent performance on semi-supervised classification tasks. However, they assume access to a graph that may not be often available in practice. In the absence of any graph, constructing k-Nearest Neighbor (kNN) graphs from the given data have shown to give improvements when used with GNNs over other semi-supervised methods. This paper proposes a semi-supervised graph learning method for cases when there are no graphs available. This method learns a graph as a convex combination of the unsupervised kNN graph and a supervised label-affinity graph. The label-affinity graph directly captures all the nodes' label-affinity with the labeled nodes, i.e., how likely a node has the same label as the labeled nodes. This affinity measure contrasts with the kNN graph where the metric measures closeness in the feature space. Our experiments suggest that this approach gives close to or better performance (up to 1.5%), while being simpler and faster (up to 70x) to train, than state-of-the-art graph learning methods. We also conduct several experiments to highlight the importance of individual components and contrast them with state-of-the-art methods.
Abstract:We consider the problem of learning latent features (aka embedding) for users and items in a recommendation setting. Given only a user-item interaction graph, the goal is to recommend items for each user. Traditional approaches employ matrix factorization-based collaborative filtering methods. Recent methods using graph convolutional networks (e.g., LightGCN) achieve state-of-the-art performance. They learn both user and item embedding. One major drawback of most existing methods is that they are not inductive; they do not generalize for users and items unseen during training. Besides, existing network models are quite complex, difficult to train and scale. Motivated by LightGCN, we propose a graph convolutional network modeling approach for collaborative filtering CF-GCN. We solely learn user embedding and derive item embedding using light variant CF-LGCN-U performing neighborhood aggregation, making it scalable due to reduced model complexity. CF-LGCN-U models naturally possess the inductive capability for new items, and we propose a simple solution to generalize for new users. We show how the proposed models are related to LightGCN. As a by-product, we suggest a simple solution to make LightGCN inductive. We perform comprehensive experiments on several benchmark datasets and demonstrate the capabilities of the proposed approach. Experimental results show that similar or better generalization performance is achievable than the state of the art methods in both transductive and inductive settings.
Abstract:We consider the problem of learning efficient and inductive graph convolutional networks for text classification with a large number of examples and features. Existing state-of-the-art graph embedding based methods such as predictive text embedding (PTE) and TextGCN have shortcomings in terms of predictive performance, scalability and inductive capability. To address these limitations, we propose a heterogeneous graph convolutional network (HeteGCN) modeling approach that unites the best aspects of PTE and TextGCN together. The main idea is to learn feature embeddings and derive document embeddings using a HeteGCN architecture with different graphs used across layers. We simplify TextGCN by dissecting into several HeteGCN models which (a) helps to study the usefulness of individual models and (b) offers flexibility in fusing learned embeddings from different models. In effect, the number of model parameters is reduced significantly, enabling faster training and improving performance in small labeled training set scenario. Our detailed experimental studies demonstrate the efficacy of the proposed approach.