Joy
Abstract:Text embeddings enable various applications, but their performance deteriorates on longer texts. In this paper, we find that the performance degradation is due to a phenomenon called Length Collapse, where longer text embeddings collapse into a narrow space. This collapse results in a distributional inconsistency between embeddings of different text lengths, ultimately hurting the performance of downstream tasks. Theoretically, by considering the self-attention mechanism inherently functions as a low-pass filter, we prove that long sequences increase the attenuation rate of the low-pass filter effect of the self-attention mechanism. With layers going deeper, excessive low-pass filtering causes the token signals to retain only their Direct-Current (DC) component, which means the input token feature maps will collapse into a narrow space, especially in long texts. Based on the above analysis, we propose to mitigate the undesirable length collapse limitation by introducing a temperature in softmax(), which achieves a higher low-filter attenuation rate. The tuning-free method, called TempScale, can be plugged into multiple transformer-based embedding models. Empirically, we demonstrate that TempScale can improve existing embedding models, especially on long text inputs, bringing up to 0.53% performance gains on 40 datasets from Massive Text Embedding Benchmark (MTEB) and 0.82% performance gains on 4 datasets from LongEmbed, which specifically focuses on long context retrieval.
Abstract:Timely and effective load shedding in power systems is critical for maintaining supply-demand balance and preventing cascading blackouts. To eliminate load shedding bias against specific regions in the system, optimization-based methods are uniquely positioned to help balance between economical and equity considerations. However, the resulting optimization problem involves complex constraints, which can be time-consuming to solve and thus cannot meet the real-time requirements of load shedding. To tackle this challenge, in this paper we present an efficient machine learning algorithm to enable millisecond-level computation for the optimization-based load shedding problem. Numerical studies on both a 3-bus toy example and a realistic RTS-GMLC system have demonstrated the validity and efficiency of the proposed algorithm for delivering equitable and real-time load shedding decisions.
Abstract:Recently, researchers have uncovered that neural retrieval models prefer AI-generated content (AIGC), called source bias. Compared to active search behavior, recommendation represents another important means of information acquisition, where users are more prone to source bias. Furthermore, delving into the recommendation scenario, as AIGC becomes integrated within the feedback loop involving users, data, and the recommender system, it progressively contaminates the candidate items, the user interaction history, and ultimately, the data used to train the recommendation models. How and to what extent the source bias affects the neural recommendation models within feedback loop remains unknown. In this study, we extend the investigation of source bias into the realm of recommender systems, specifically examining its impact across different phases of the feedback loop. We conceptualize the progression of AIGC integration into the recommendation content ecosystem in three distinct phases-HGC dominate, HGC-AIGC coexist, and AIGC dominance-each representing past, present, and future states, respectively. Through extensive experiments across three datasets from diverse domains, we demonstrate the prevalence of source bias and reveal a potential digital echo chamber with source bias amplification throughout the feedback loop. This trend risks creating a recommender ecosystem with limited information source, such as AIGC, being disproportionately recommended. To counteract this bias and prevent its escalation in the feedback loop, we introduce a black-box debiasing method that maintains model impartiality towards both HGC and AIGC. Our experimental results validate the effectiveness of the proposed debiasing method, confirming its potential to disrupt the feedback loop.
Abstract:The proliferation of Large Language Models (LLMs) has led to an influx of AI-generated content (AIGC) on the internet, transforming the corpus of Information Retrieval (IR) systems from solely human-written to a coexistence with LLM-generated content. The impact of this surge in AIGC on IR systems remains an open question, with the primary challenge being the lack of a dedicated benchmark for researchers. In this paper, we introduce Cocktail, a comprehensive benchmark tailored for evaluating IR models in this mixed-sourced data landscape of the LLM era. Cocktail consists of 16 diverse datasets with mixed human-written and LLM-generated corpora across various text retrieval tasks and domains. Additionally, to avoid the potential bias from previously included dataset information in LLMs, we also introduce an up-to-date dataset, named NQ-UTD, with queries derived from recent events. Through conducting over 1,000 experiments to assess state-of-the-art retrieval models against the benchmarked datasets in Cocktail, we uncover a clear trade-off between ranking performance and source bias in neural retrieval models, highlighting the necessity for a balanced approach in designing future IR systems. We hope Cocktail can serve as a foundational resource for IR research in the LLM era, with all data and code publicly available at \url{https://github.com/KID-22/Cocktail}.
Abstract:Prompt and effective corrective actions in response to unexpected contingencies are crucial for improving power system resilience and preventing cascading blackouts. The optimal load shedding (OLS) accounting for network limits has the potential to address the diverse system-wide impacts of contingency scenarios as compared to traditional local schemes. However, due to the fast cascading propagation of initial contingencies, real-time OLS solutions are challenging to attain in large systems with high computation and communication needs. In this paper, we propose a decentralized design that leverages offline training of a neural network (NN) model for individual load centers to autonomously construct the OLS solutions from locally available measurements. Our learning-for-OLS approach can greatly reduce the computation and communication needs during online emergency responses, thus preventing the cascading propagation of contingencies for enhanced power grid resilience. Numerical studies on both the IEEE 118-bus system and a synthetic Texas 2000-bus system have demonstrated the efficiency and effectiveness of our scalable OLS learning design for timely power system emergency operations.
Abstract:Accurate and quick identification of high-impedance faults is critical for the reliable operation of distribution systems. Unlike other faults in power grids, HIFs are very difficult to detect by conventional overcurrent relays due to the low fault current. Although HIFs can be affected by various factors, the voltage current characteristics can substantially imply how the system responds to the disturbance and thus provides opportunities to effectively localize HIFs. In this work, we propose a data-driven approach for the identification of HIF events. To tackle the nonlinearity of the voltage current trajectory, first, we formulate optimization problems to approximate the trajectory with piecewise functions. Then we collect the function features of all segments as inputs and use the support vector machine approach to efficiently identify HIFs at different locations. Numerical studies on the IEEE 123-node test feeder demonstrate the validity and accuracy of the proposed approach for real-time HIF identification.
Abstract:Recently, the emergence of large language models (LLMs) has revolutionized the paradigm of information retrieval (IR) applications, especially in web search. With their remarkable capabilities in generating human-like texts, LLMs have created enormous texts on the Internet. As a result, IR systems in the LLMs era are facing a new challenge: the indexed documents now are not only written by human beings but also automatically generated by the LLMs. How these LLM-generated documents influence the IR systems is a pressing and still unexplored question. In this work, we conduct a quantitative evaluation of different IR models in scenarios where both human-written and LLM-generated texts are involved. Surprisingly, our findings indicate that neural retrieval models tend to rank LLM-generated documents higher.We refer to this category of biases in neural retrieval models towards the LLM-generated text as the \textbf{source bias}. Moreover, we discover that this bias is not confined to the first-stage neural retrievers, but extends to the second-stage neural re-rankers. Then, we provide an in-depth analysis from the perspective of text compression and observe that neural models can better understand the semantic information of LLM-generated text, which is further substantiated by our theoretical analysis.We also discuss the potential server concerns stemming from the observed source bias and hope our findings can serve as a critical wake-up call to the IR community and beyond. To facilitate future explorations of IR in the LLM era, the constructed two new benchmarks and codes will later be available at \url{https://github.com/KID-22/LLM4IR-Bias}.
Abstract:Objective: Our study aimed to construct an exhaustive Complementary and Integrative Health (CIH) Lexicon (CIHLex) to better represent the often underrepresented physical and psychological CIH approaches in standard terminologies. We also intended to apply advanced Natural Language Processing (NLP) models such as Bidirectional Encoder Representations from Transformers (BERT) and GPT-3.5 Turbo for CIH named entity recognition, evaluating their performance against established models like MetaMap and CLAMP. Materials and Methods: We constructed the CIHLex by integrating various resources, compiling and integrating data from biomedical literature and relevant knowledge bases. The Lexicon encompasses 198 unique concepts with 1090 corresponding unique terms. We matched these concepts to the Unified Medical Language System (UMLS). Additionally, we developed and utilized BERT models and compared their efficiency in CIH named entity recognition to that of other models such as MetaMap, CLAMP, and GPT3.5-turbo. Results: From the 198 unique concepts in CIHLex, 62.1% could be matched to at least one term in the UMLS. Moreover, 75.7% of the mapped UMLS Concept Unique Identifiers (CUIs) were categorized as "Therapeutic or Preventive Procedure." Among the models applied to CIH named entity recognition, BLUEBERT delivered the highest macro average F1-score of 0.90, surpassing other models. Conclusion: Our CIHLex significantly augments representation of CIH approaches in biomedical literature. Demonstrating the utility of advanced NLP models, BERT notably excelled in CIH entity recognition. These results highlight promising strategies for enhancing standardization and recognition of CIH terminology in biomedical contexts.
Abstract:Accurate load forecasting is critical for electricity market operations and other real-time decision-making tasks in power systems. This paper considers the short-term load forecasting (STLF) problem for residential customers within a community. Existing STLF work mainly focuses on forecasting the aggregated load for either a feeder system or a single customer, but few efforts have been made on forecasting the load at individual appliance level. In this work, we present an STLF algorithm for efficiently predicting the power consumption of individual electrical appliances. The proposed method builds upon a powerful recurrent neural network (RNN) architecture in deep learning, termed as long short-term memory (LSTM). As each appliance has uniquely repetitive consumption patterns, the patterns of prediction error will be tracked such that past prediction errors can be used for improving the final prediction performance. Numerical tests on real-world load datasets demonstrate the improvement of the proposed method over existing LSTM-based method and other benchmark approaches.
Abstract:Effective and timely responses to unexpected contingencies are crucial for enhancing the resilience of power grids. Given the fast, complex process of cascading propagation, corrective actions such as optimal load shedding (OLS) are difficult to attain in large-scale networks due to the computation complexity and communication latency issues. This work puts forth an innovative learning-for-OLS approach by constructing the optimal decision rules of load shedding under a variety of potential contingency scenarios through offline neural network (NN) training. Notably, the proposed NN-based OLS decisions are fully decentralized, enabling individual load centers to quickly react to the specific contingency using readily available local measurements. Numerical studies on the IEEE 14-bus system have demonstrated the effectiveness of our scalable OLS design for real-time responses to severe grid emergency events.