Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zeyi Wen

SEAL: Structure and Element Aware Learning to Improve Long Structured Document Retrieval

Aug 28, 2025

Xinhao Huang, Zhibo Ren, Yipeng Yu, Ying Zhou, Zulong Chen, Zeyi Wen

Abstract:In long structured document retrieval, existing methods typically fine-tune pre-trained language models (PLMs) using contrastive learning on datasets lacking explicit structural information. This practice suffers from two critical issues: 1) current methods fail to leverage structural features and element-level semantics effectively, and 2) the lack of datasets containing structural metadata. To bridge these gaps, we propose \our, a novel contrastive learning framework. It leverages structure-aware learning to preserve semantic hierarchies and masked element alignment for fine-grained semantic discrimination. Furthermore, we release \dataset, a long structured document retrieval dataset with rich structural annotations. Extensive experiments on both released and industrial datasets across various modern PLMs, along with online A/B testing, demonstrate consistent performance improvements, boosting NDCG@10 from 73.96\% to 77.84\% on BGE-M3. The resources are available at https://github.com/xinhaoH/SEAL.

* Accepted at EMNLP 2025 Main Conference

Via

Access Paper or Ask Questions

Fast-PGM: Fast Probabilistic Graphical Model Learning and Inference

May 28, 2024

Jiantong Jiang, Zeyi Wen, Peiyu Yang, Atif Mansoor, Ajmal Mian

Figure 1 for Fast-PGM: Fast Probabilistic Graphical Model Learning and Inference

Figure 2 for Fast-PGM: Fast Probabilistic Graphical Model Learning and Inference

Abstract:Probabilistic graphical models (PGMs) serve as a powerful framework for modeling complex systems with uncertainty and extracting valuable insights from data. However, users face challenges when applying PGMs to their problems in terms of efficiency and usability. This paper presents Fast-PGM, an efficient and open-source library for PGM learning and inference. Fast-PGM supports comprehensive tasks on PGMs, including structure and parameter learning, as well as exact and approximate inference, and enhances efficiency of the tasks through computational and memory optimizations and parallelization techniques. Concurrently, Fast-PGM furnishes developers with flexible building blocks, furnishes learners with detailed documentation, and affords non-experts user-friendly interfaces, thereby ameliorating the usability of PGMs to users across a spectrum of expertise levels. The source code of Fast-PGM is available at https://github.com/jjiantong/FastPGM.

Via

Access Paper or Ask Questions

Interpretable Traffic Event Analysis with Bayesian Networks

Oct 10, 2023

Tong Yuan, Jian Yang, Zeyi Wen

Abstract:Although existing machine learning-based methods for traffic accident analysis can provide good quality results to downstream tasks, they lack interpretability which is crucial for this critical problem. This paper proposes an interpretable framework based on Bayesian Networks for traffic accident prediction. To enable the ease of interpretability, we design a dataset construction pipeline to feed the traffic data into the framework while retaining the essential traffic data information. With a concrete case study, our framework can derive a Bayesian Network from a dataset based on the causal relationships between weather and traffic events across the United States. Consequently, our framework enables the prediction of traffic accidents with competitive accuracy while examining how the probability of these events changes under different conditions, thus illustrating transparent relationships between traffic and weather events. Additionally, the visualization of the network simplifies the analysis of relationships between different variables, revealing the primary causes of traffic accidents and ultimately providing a valuable reference for reducing traffic accidents.

* 11 pages, 7 figures

Via

Access Paper or Ask Questions

Fast Parallel Bayesian Network Structure Learning

Dec 08, 2022

Jiantong Jiang, Zeyi Wen, Ajmal Mian

Abstract:Bayesian networks (BNs) are a widely used graphical model in machine learning for representing knowledge with uncertainty. The mainstream BN structure learning methods require performing a large number of conditional independence (CI) tests. The learning process is very time-consuming, especially for high-dimensional problems, which hinders the adoption of BNs to more applications. Existing works attempt to accelerate the learning process with parallelism, but face issues including load unbalancing, costly atomic operations and dominant parallel overhead. In this paper, we propose a fast solution named Fast-BNS on multi-core CPUs to enhance the efficiency of the BN structure learning. Fast-BNS is powered by a series of efficiency optimizations including (i) designing a dynamic work pool to monitor the processing of edges and to better schedule the workloads among threads, (ii) grouping the CI tests of the edges with the same endpoints to reduce the number of unnecessary CI tests, (iii) using a cache-friendly data storage to improve the memory efficiency, and (iv) generating the conditioning sets on-the-fly to avoid extra memory consumption. A comprehensive experimental study shows that the sequential version of Fast-BNS is up to 50 times faster than its counterpart, and the parallel version of Fast-BNS achieves 4.8 to 24.5 times speedup over the state-of-the-art multi-threaded solution. Moreover, Fast-BNS has a good scalability to the network size as well as sample size. Fast-BNS source code is freely available at https://github.com/jjiantong/FastBN.

Via

Access Paper or Ask Questions

Fast Parallel Exact Inference on Bayesian Networks: Poster

Dec 08, 2022

Jiantong Jiang, Zeyi Wen, Atif Mansoor, Ajmal Mian

Abstract:Bayesian networks (BNs) are attractive, because they are graphical and interpretable machine learning models. However, exact inference on BNs is time-consuming, especially for complex problems. To improve the efficiency, we propose a fast BN exact inference solution named Fast-BNI on multi-core CPUs. Fast-BNI enhances the efficiency of exact inference through hybrid parallelism that tightly integrates coarse- and fine-grained parallelism. We also propose techniques to further simplify the bottleneck operations of BN exact inference. Fast-BNI source code is freely available at https://github.com/jjiantong/FastBN.

Via

Access Paper or Ask Questions

Practical Federated Gradient Boosting Decision Trees

Dec 13, 2019

Qinbin Li, Zeyi Wen, Bingsheng He

Figure 1 for Practical Federated Gradient Boosting Decision Trees

Figure 2 for Practical Federated Gradient Boosting Decision Trees

Figure 3 for Practical Federated Gradient Boosting Decision Trees

Figure 4 for Practical Federated Gradient Boosting Decision Trees

Abstract:Gradient Boosting Decision Trees (GBDTs) have become very successful in recent years, with many awards in machine learning and data mining competitions. There have been several recent studies on how to train GBDTs in the federated learning setting. In this paper, we focus on horizontal federated learning, where data samples with the same features are distributed among multiple parties. However, existing studies are not efficient or effective enough for practical use. They suffer either from the inefficiency due to the usage of costly data transformations such as secret sharing and homomorphic encryption, or from the low model accuracy due to differential privacy designs. In this paper, we study a practical federated environment with relaxed privacy constraints. In this environment, a dishonest party might obtain some information about the other parties' data, but it is still impossible for the dishonest party to derive the actual raw data of other parties. Specifically, each party boosts a number of trees by exploiting similarity information based on locality-sensitive hashing. We prove that our framework is secure without exposing the original record to other parties, while the computation overhead in the training process is kept low. Our experimental studies show that, compared with normal training with the local data of each party, our approach can significantly improve the predictive accuracy, and achieve comparable accuracy to the original GBDT with the data from all parties.

* Accepted to AAAI-20

Via

Access Paper or Ask Questions

Privacy-Preserving Gradient Boosting Decision Trees

Dec 11, 2019

Qinbin Li, Zhaomin Wu, Zeyi Wen, Bingsheng He

Figure 1 for Privacy-Preserving Gradient Boosting Decision Trees

Figure 2 for Privacy-Preserving Gradient Boosting Decision Trees

Figure 3 for Privacy-Preserving Gradient Boosting Decision Trees

Figure 4 for Privacy-Preserving Gradient Boosting Decision Trees

Abstract:The Gradient Boosting Decision Tree (GBDT) is a popular machine learning model for various tasks in recent years. In this paper, we study how to improve model accuracy of GBDT while preserving the strong guarantee of differential privacy. Sensitivity and privacy budget are two key design aspects for the effectiveness of differential private models. Existing solutions for GBDT with differential privacy suffer from the significant accuracy loss due to too loose sensitivity bounds and ineffective privacy budget allocations (especially across different trees in the GBDT model). Loose sensitivity bounds lead to more noise to obtain a fixed privacy level. Ineffective privacy budget allocations worsen the accuracy loss especially when the number of trees is large. Therefore, we propose a new GBDT training algorithm that achieves tighter sensitivity bounds and more effective noise allocations. Specifically, by investigating the property of gradient and the contribution of each tree in GBDTs, we propose to adaptively control the gradients of training data for each iteration and leaf node clipping in order to tighten the sensitivity bounds. Furthermore, we design a novel boosting framework to allocate the privacy budget between trees so that the accuracy loss can be further reduced. Our experiments show that our approach can achieve much better model accuracy than other baselines.

* Accepted to AAAI-20

Via

Access Paper or Ask Questions

Entity Extraction with Knowledge from Web Scale Corpora

Nov 21, 2019

Zeyi Wen, Zeyu Huang, Rui Zhang

Figure 1 for Entity Extraction with Knowledge from Web Scale Corpora

Figure 2 for Entity Extraction with Knowledge from Web Scale Corpora

Figure 3 for Entity Extraction with Knowledge from Web Scale Corpora

Figure 4 for Entity Extraction with Knowledge from Web Scale Corpora

Abstract:Entity extraction is an important task in text mining and natural language processing. A popular method for entity extraction is by comparing substrings from free text against a dictionary of entities. In this paper, we present several techniques as a post-processing step for improving the effectiveness of the existing entity extraction technique. These techniques utilise models trained with the web-scale corpora which makes our techniques robust and versatile. Experiments show that our techniques bring a notable improvement on efficiency and effectiveness.

Via

Access Paper or Ask Questions

Adaptive Kernel Value Caching for SVM Training

Nov 08, 2019

Qinbin Li, Zeyi Wen, Bingsheng He

Figure 1 for Adaptive Kernel Value Caching for SVM Training

Figure 2 for Adaptive Kernel Value Caching for SVM Training

Figure 3 for Adaptive Kernel Value Caching for SVM Training

Figure 4 for Adaptive Kernel Value Caching for SVM Training

Abstract:Support Vector Machines (SVMs) can solve structured multi-output learning problems such as multi-label classification, multiclass classification and vector regression. SVM training is expensive especially for large and high dimensional datasets. The bottleneck of the SVM training often lies in the kernel value computation. In many real-world problems, the same kernel values are used in many iterations during the training, which makes the caching of kernel values potentially useful. The majority of the existing studies simply adopt the LRU (least recently used) replacement strategy for caching kernel values. However, as we analyze in this paper, the LRU strategy generally achieves high hit ratio near the final stage of the training, but does not work well in the whole training process. Therefore, we propose a new caching strategy called EFU (less frequently used) which replaces the less frequently used kernel values that enhances LFU (least frequently used). Our experimental results show that EFU often has 20\% higher hit ratio than LRU in the training with the Gaussian kernel. To further optimize the strategy, we propose a caching strategy called HCST (hybrid caching for the SVM training), which has a novel mechanism to automatically adapt the better caching strategy in the different stages of the training. We have integrated the caching strategy into ThunderSVM, a recent SVM library on many-core processors. Our experiments show that HCST adaptively achieves high hit ratios with little runtime overhead among different problems including multi-label classification, multiclass classification and regression problems. Compared with other existing caching strategies, HCST achieves 20\% more reduction in training time on average.

* Accepted by IEEE Transactions on Neural Networks and Learning Systems (TNNLS)

Via

Access Paper or Ask Questions

Federated Learning Systems: Vision, Hype and Reality for Data Privacy and Protection

Jul 23, 2019

Qinbin Li, Zeyi Wen, Bingsheng He

Figure 1 for Federated Learning Systems: Vision, Hype and Reality for Data Privacy and Protection

Figure 2 for Federated Learning Systems: Vision, Hype and Reality for Data Privacy and Protection

Figure 3 for Federated Learning Systems: Vision, Hype and Reality for Data Privacy and Protection

Figure 4 for Federated Learning Systems: Vision, Hype and Reality for Data Privacy and Protection

Abstract:Federated learning systems enable the collaborative training of machine learning models among different organizations under the privacy restrictions. As researchers try to support more machine learning models with different privacy-preserving approaches, current federated learning systems face challenges from various issues such as unpractical system assumptions, scalability and efficiency. Inspired by federated systems in other fields such as databases and cloud computing, we investigate the characteristics of federated learning systems. We find that two important features for other federated systems, i.e., heterogeneity and autonomy, are rarely considered in the existing federated learning systems. Moreover, we provide a thorough categorization for federated learning systems according to four different aspects, including data partition, model, privacy level, and communication architecture. Lastly, we take a systematic comparison among the existing federated learning systems and present future research opportunities and directions.

Via

Access Paper or Ask Questions