NetEase
Abstract:In the rapidly evolving field of artificial intelligence, convolutional neural networks are essential for tackling complex challenges such as machine vision and medical diagnosis. Recently, to address the challenges in processing speed and power consumption of conventional digital convolution operations, many optical components have been suggested to replace the digital convolution layer in the neural network, accelerating various machine vision tasks. Nonetheless, the analog nature of the optical convolution kernel has not been fully explored. Here, we develop a spatial frequency domain training method to create arbitrarily shaped analog convolution kernels using an optical metasurface as the convolution layer, with its receptive field largely surpassing digital convolution kernels. By employing spatial multiplexing, the multiple parallel convolution kernels with both positive and negative weights are generated under the incoherent illumination condition. We experimentally demonstrate a 98.59% classification accuracy on the MNIST dataset, with simulations showing 92.63% and 68.67% accuracy on the Fashion-MNIST and CIFAR-10 datasets with additional digital layers. This work underscores the unique advantage of analog optical convolution, offering a promising avenue to accelerate machine vision tasks, especially in edge devices.
Abstract:The continuous evolution of small Unmanned Aerial Systems (sUAS) demands advanced testing methodologies to ensure their safe and reliable operations in the real-world. To push the boundaries of sUAS simulation testing in realistic environments, we previously developed the DroneReqValidator (DRV) platform, allowing developers to automatically conduct simulation testing in digital twin of earth. In this paper, we present DRV 2.0, which introduces a novel component called DroneWiS (Drone Wind Simulation). DroneWiS allows sUAS developers to automatically simulate realistic windy conditions and test the resilience of sUAS against wind. Unlike current state-of-the-art simulation tools such as Gazebo and AirSim that only simulate basic wind conditions, DroneWiS leverages Computational Fluid Dynamics (CFD) to compute the unique wind flows caused by the interaction of wind with the objects in the environment such as buildings and uneven terrains. This simulation capability provides deeper insights to developers about the navigation capability of sUAS in challenging and realistic windy conditions. DroneWiS equips sUAS developers with a powerful tool to test, debug, and improve the reliability and safety of sUAS in real-world. A working demonstration is available at https://youtu.be/khBHEBST8Wc
Abstract:We introduce SpreadsheetBench, a challenging spreadsheet manipulation benchmark exclusively derived from real-world scenarios, designed to immerse current large language models (LLMs) in the actual workflow of spreadsheet users. Unlike existing benchmarks that rely on synthesized queries and simplified spreadsheet files, SpreadsheetBench is built from 912 real questions gathered from online Excel forums, which reflect the intricate needs of users. The associated spreadsheets from the forums contain a variety of tabular data such as multiple tables, non-standard relational tables, and abundant non-textual elements. Furthermore, we propose a more reliable evaluation metric akin to online judge platforms, where multiple spreadsheet files are created as test cases for each instruction, ensuring the evaluation of robust solutions capable of handling spreadsheets with varying values. Our comprehensive evaluation of various LLMs under both single-round and multi-round inference settings reveals a substantial gap between the state-of-the-art (SOTA) models and human performance, highlighting the benchmark's difficulty.
Abstract:We introduce TableLLM, a robust large language model (LLM) with 13 billion parameters, purpose-built for proficiently handling tabular data manipulation tasks, whether they are embedded within documents or spreadsheets, catering to real-world office scenarios. We propose a distant supervision method for training, which comprises a reasoning process extension strategy, aiding in training LLMs to understand reasoning patterns more effectively as well as a cross-way validation strategy, ensuring the quality of the automatically generated data. To evaluate the performance of TableLLM, we have crafted a benchmark tailored to address both document and spreadsheet formats as well as constructed a well-organized evaluation pipeline capable of handling both scenarios. Thorough evaluations underscore the advantages of TableLLM when compared to various existing general-purpose and tabular data-focused LLMs. We have publicly released the model checkpoint, source code, benchmarks, and a web application for user interaction.Our codes and data are publicly available at https://github.com/TableLLM/TableLLM.
Abstract:In this paper, we examine the collaborative dynamics between humans and language models (LMs), where the interactions typically involve LMs proposing text segments and humans editing or responding to these proposals. Productive engagement with LMs in such scenarios necessitates that humans discern effective text-based interaction strategies, such as editing and response styles, from historical human-LM interactions. This objective is inherently causal, driven by the counterfactual `what-if' question: how would the outcome of collaboration change if humans employed a different text editing/refinement strategy? A key challenge in answering this causal inference question is formulating an appropriate causal estimand: the conventional average treatment effect (ATE) estimand is inapplicable to text-based treatments due to their high dimensionality. To address this concern, we introduce a new causal estimand -- Incremental Stylistic Effect (ISE) -- which characterizes the average impact of infinitesimally shifting a text towards a specific style, such as increasing formality. We establish the conditions for the non-parametric identification of ISE. Building on this, we develop CausalCollab, an algorithm designed to estimate the ISE of various interaction strategies in dynamic human-LM collaborations. Our empirical investigations across three distinct human-LM collaboration scenarios reveal that CausalCollab effectively reduces confounding and significantly improves counterfactual estimation over a set of competitive baselines.
Abstract:The goal of our research was to investigate the effects of collaboration between academia and industry on Natural Language Processing (NLP). To do this, we created a pipeline to extract affiliations and citations from NLP papers and divided them into three categories: academia, industry, and hybrid (collaborations between academia and industry). Our empirical analysis found that there is a trend towards an increase in industry and academia-industry collaboration publications and that these types of publications tend to have a higher impact compared to those produced solely within academia.
Abstract:Recent work has shown that standard training via empirical risk minimization (ERM) can produce models that achieve high accuracy on average but low accuracy on underrepresented groups due to the prevalence of spurious features. A predominant approach to tackle this group robustness problem minimizes the worst group error (akin to a minimax strategy) on the training data, hoping it will generalize well on the testing data. However, this is often suboptimal, especially when the out-of-distribution (OOD) test data contains previously unseen groups. Inspired by ideas from the information retrieval and learning-to-rank literature, this paper first proposes to use Discounted Cumulative Gain (DCG) as a metric of model quality for facilitating better hyperparameter tuning and model selection. Being a ranking-based metric, DCG weights multiple poorly-performing groups (instead of considering just the group with the worst performance). As a natural next step, we build on our results to propose a ranking-based training method called Discounted Rank Upweighting (DRU), which differentially reweights a ranked list of poorly-performing groups in the training data to learn models that exhibit strong OOD performance on the test data. Results on several synthetic and real-world datasets highlight the superior generalization ability of our group-ranking-based (akin to soft-minimax) approach in selecting and learning models that are robust to group distributional shifts.
Abstract:The practical importance of coherent forecasts in hierarchical forecasting has inspired many studies on forecast reconciliation. Under this approach, so-called base forecasts are produced for every series in the hierarchy and are subsequently adjusted to be coherent in a second reconciliation step. Reconciliation methods have been shown to improve forecast accuracy, but will, in general, adjust the base forecast of every series. However, in an operational context, it is sometimes necessary or beneficial to keep forecasts of some variables unchanged after forecast reconciliation. In this paper, we formulate reconciliation methodology that keeps forecasts of a pre-specified subset of variables unchanged or "immutable". In contrast to existing approaches, these immutable forecasts need not all come from the same level of a hierarchy, and our method can also be applied to grouped hierarchies. We prove that our approach preserves unbiasedness in base forecasts. Our method can also account for correlations between base forecasting errors and ensure non-negativity of forecasts. We also perform empirical experiments, including an application to sales of a large scale online retailer, to assess the impacts of our proposed methodology.
Abstract:Structural re-parameterization (Rep) methods achieve noticeable improvements on simple VGG-style networks. Despite the prevalence, current Rep methods simply re-parameterize all operations into an augmented network, including those that rarely contribute to the model's performance. As such, the price to pay is an expensive computational overhead to manipulate these unnecessary behaviors. To eliminate the above caveats, we aim to bootstrap the training with minimal cost by devising a dynamic re-parameterization (DyRep) method, which encodes Rep technique into the training process that dynamically evolves the network structures. Concretely, our proposal adaptively finds the operations which contribute most to the loss in the network, and applies Rep to enhance their representational capacity. Besides, to suppress the noisy and redundant operations introduced by Rep, we devise a de-parameterization technique for a more compact re-parameterization. With this regard, DyRep is more efficient than Rep since it smoothly evolves the given network instead of constructing an over-parameterized network. Experimental results demonstrate our effectiveness, e.g., DyRep improves the accuracy of ResNet-18 by $2.04\%$ on ImageNet and reduces $22\%$ runtime over the baseline. Code is available at: https://github.com/hunto/DyRep.
Abstract:Creating abstractive summaries from meeting transcripts has proven to be challenging due to the limited amount of labeled data available for training neural network models. Moreover, Transformer-based architectures have proven to beat state-of-the-art models in summarizing news data. In this paper, we utilize a Transformer-based Pointer Generator Network to generate abstract summaries for meeting transcripts. This model uses 2 LSTMs as an encoder and a decoder, a Pointer network which copies words from the inputted text, and a Generator network to produce out-of-vocabulary words (hence making the summary abstractive). Moreover, a coverage mechanism is used to avoid repetition of words in the generated summary. First, we show that training the model on a news summary dataset and using zero-shot learning to test it on the meeting dataset proves to produce better results than training it on the AMI meeting dataset. Second, we show that training this model first on out-of-domain data, such as the CNN-Dailymail dataset, followed by a fine-tuning stage on the AMI meeting dataset is able to improve the performance of the model significantly. We test our model on a testing set from the AMI dataset and report the ROUGE-2 score of the generated summary to compare with previous literature. We also report the Factual score of our summaries since it is a better benchmark for abstractive summaries since the ROUGE-2 score is limited to measuring word-overlaps. We show that our improved model is able to improve on previous models by at least 5 ROUGE-2 scores, which is a substantial improvement. Also, a qualitative analysis of the summaries generated by our model shows that these summaries and human-readable and indeed capture most of the important information from the transcripts.