Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiao-Yong Wei

Rethinking Domain-Specific LLM Benchmark Construction: A Comprehensiveness-Compactness Approach

Aug 13, 2025

Rubing Chen, Jiaxin Wu, Jian Wang, Xulu Zhang, Wenqi Fan, Chenghua Lin, Xiao-Yong Wei, Qing Li

Abstract:Numerous benchmarks have been built to evaluate the domain-specific abilities of large language models (LLMs), highlighting the need for effective and efficient benchmark construction. Existing domain-specific benchmarks primarily focus on the scaling law, relying on massive corpora for supervised fine-tuning or generating extensive question sets for broad coverage. However, the impact of corpus and question-answer (QA) set design on the precision and recall of domain-specific LLMs remains unexplored. In this paper, we address this gap and demonstrate that the scaling law is not always the optimal principle for benchmark construction in specific domains. Instead, we propose Comp-Comp, an iterative benchmarking framework based on a comprehensiveness-compactness principle. Here, comprehensiveness ensures semantic recall of the domain, while compactness enhances precision, guiding both corpus and QA set construction. To validate our framework, we conducted a case study in a well-renowned university, resulting in the creation of XUBench, a large-scale and comprehensive closed-domain benchmark. Although we use the academic domain as the case in this work, our Comp-Comp framework is designed to be extensible beyond academia, providing valuable insights for benchmark construction across various domains.

Via

Access Paper or Ask Questions

Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning

May 14, 2025

Dayong Liang, Changmeng Zheng, Zhiyuan Wen, Yi Cai, Xiao-Yong Wei, Qing Li

Abstract:Traditional scene graphs primarily focus on spatial relationships, limiting vision-language models' (VLMs) ability to reason about complex interactions in visual scenes. This paper addresses two key challenges: (1) conventional detection-to-construction methods produce unfocused, contextually irrelevant relationship sets, and (2) existing approaches fail to form persistent memories for generalizing interaction reasoning to new scenes. We propose Interaction-augmented Scene Graph Reasoning (ISGR), a framework that enhances VLMs' interactional reasoning through three complementary components. First, our dual-stream graph constructor combines SAM-powered spatial relation extraction with interaction-aware captioning to generate functionally salient scene graphs with spatial grounding. Second, we employ targeted interaction queries to activate VLMs' latent knowledge of object functionalities, converting passive recognition into active reasoning about how objects work together. Finally, we introduce a lone-term memory reinforcement learning strategy with a specialized interaction-focused reward function that transforms transient patterns into long-term reasoning heuristics. Extensive experiments demonstrate that our approach significantly outperforms baseline methods on interaction-heavy reasoning benchmarks, with particularly strong improvements on complex scene understanding tasks. The source code can be accessed at https://github.com/open_upon_acceptance.

Via

Access Paper or Ask Questions

FedConv: A Learning-on-Model Paradigm for Heterogeneous Federated Clients

Feb 28, 2025

Leming Shen, Qiang Yang, Kaiyan Cui, Yuanqing Zheng, Xiao-Yong Wei, Jianwei Liu, Jinsong Han

Figure 1 for FedConv: A Learning-on-Model Paradigm for Heterogeneous Federated Clients

Figure 2 for FedConv: A Learning-on-Model Paradigm for Heterogeneous Federated Clients

Figure 3 for FedConv: A Learning-on-Model Paradigm for Heterogeneous Federated Clients

Figure 4 for FedConv: A Learning-on-Model Paradigm for Heterogeneous Federated Clients

Abstract:Federated Learning (FL) facilitates collaborative training of a shared global model without exposing clients' private data. In practical FL systems, clients (e.g., edge servers, smartphones, and wearables) typically have disparate system resources. Conventional FL, however, adopts a one-size-fits-all solution, where a homogeneous large global model is transmitted to and trained on each client, resulting in an overwhelming workload for less capable clients and starvation for other clients. To address this issue, we propose FedConv, a client-friendly FL framework, which minimizes the computation and memory burden on resource-constrained clients by providing heterogeneous customized sub-models. FedConv features a novel learning-on-model paradigm that learns the parameters of the heterogeneous sub-models via convolutional compression. Unlike traditional compression methods, the compressed models in FedConv can be directly trained on clients without decompression. To aggregate the heterogeneous sub-models, we propose transposed convolutional dilation to convert them back to large models with a unified size while retaining personalized information from clients. The compression and dilation processes, transparent to clients, are optimized on the server leveraging a small public dataset. Extensive experiments on six datasets demonstrate that FedConv outperforms state-of-the-art FL systems in terms of model accuracy (by more than 35% on average), computation and communication overhead (with 33% and 25% reduction, respectively).

Via

Access Paper or Ask Questions

Cardiac Evidence Backtracking for Eating Behavior Monitoring using Collocative Electrocardiogram Imagining

Feb 20, 2025

Xu-Lu Zhang, Zhen-Qun Yang, Dong-Mei Jiang, Ga Liao, Qing Li, Ramesh Jain, Xiao-Yong Wei

Figure 1 for Cardiac Evidence Backtracking for Eating Behavior Monitoring using Collocative Electrocardiogram Imagining

Figure 2 for Cardiac Evidence Backtracking for Eating Behavior Monitoring using Collocative Electrocardiogram Imagining

Figure 3 for Cardiac Evidence Backtracking for Eating Behavior Monitoring using Collocative Electrocardiogram Imagining

Figure 4 for Cardiac Evidence Backtracking for Eating Behavior Monitoring using Collocative Electrocardiogram Imagining

Abstract:Eating monitoring has remained an open challenge in medical research for years due to the lack of non-invasive sensors for continuous monitoring and the reliable methods for automatic behavior detection. In this paper, we present a pilot study using the wearable 24-hour ECG for sensing and tailoring the sophisticated deep learning for ad-hoc and interpretable detection. This is accomplished using a collocative learning framework in which 1) we construct collocative tensors as pseudo-images from 1D ECG signals to improve the feasibility of 2D image-based deep models; 2) we formulate the cardiac logic of analyzing the ECG data in a comparative way as periodic attention regulators so as to guide the deep inference to collect evidence in a human comprehensible manner; and 3) we improve the interpretability of the framework by enabling the backtracking of evidence with a set of methods designed for Class Activation Mapping (CAM) decoding and decision tree/forest generation. The effectiveness of the proposed framework has been validated on the largest ECG dataset of eating behavior with superior performance over conventional models, and its capacity of cardiac evidence mining has also been verified through the consistency of the evidence it backtracked and that of the previous medical studies.

Via

Access Paper or Ask Questions

Mean of Means: Human Localization with Calibration-free and Unconstrained Camera Settings (extended version)

Feb 18, 2025

Tianyi Zhang, Wengyu Zhang, Xulu Zhang, Jiaxin Wu, Xiao-Yong Wei, Jiannong Cao, Qing Li

Figure 1 for Mean of Means: Human Localization with Calibration-free and Unconstrained Camera Settings (extended version)

Figure 2 for Mean of Means: Human Localization with Calibration-free and Unconstrained Camera Settings (extended version)

Figure 3 for Mean of Means: Human Localization with Calibration-free and Unconstrained Camera Settings (extended version)

Figure 4 for Mean of Means: Human Localization with Calibration-free and Unconstrained Camera Settings (extended version)

Abstract:Accurate human localization is crucial for various applications, especially in the Metaverse era. Existing high precision solutions rely on expensive, tag-dependent hardware, while vision-based methods offer a cheaper, tag-free alternative. However, current vision solutions based on stereo vision face limitations due to rigid perspective transformation principles and error propagation in multi-stage SVD solvers. These solutions also require multiple high-resolution cameras with strict setup constraints.To address these limitations, we propose a probabilistic approach that considers all points on the human body as observations generated by a distribution centered around the body's geometric center. This enables us to improve sampling significantly, increasing the number of samples for each point of interest from hundreds to billions. By modeling the relation between the means of the distributions of world coordinates and pixel coordinates, leveraging the Central Limit Theorem, we ensure normality and facilitate the learning process. Experimental results demonstrate human localization accuracy of 96\% within a 0.3$m$ range and nearly 100\% accuracy within a 0.5$m$ range, achieved at a low cost of only 10 USD using two web cameras with a resolution of 640$\times$480 pixels.

* arXiv admin note: substantial text overlap with arXiv:2407.20870

Via

Access Paper or Ask Questions

PolySmart @ TRECVid 2024 Video-To-Text

Dec 23, 2024

Jiaxin Wu, Wengyu Zhang, Xiao-Yong Wei, Qing Li

Figure 1 for PolySmart @ TRECVid 2024 Video-To-Text

Figure 2 for PolySmart @ TRECVid 2024 Video-To-Text

Figure 3 for PolySmart @ TRECVid 2024 Video-To-Text

Figure 4 for PolySmart @ TRECVid 2024 Video-To-Text

Abstract:In this paper, we present our methods and results for the Video-To-Text (VTT) task at TRECVid 2024, exploring the capabilities of Vision-Language Models (VLMs) like LLaVA and LLaVA-NeXT-Video in generating natural language descriptions for video content. We investigate the impact of fine-tuning VLMs on VTT datasets to enhance description accuracy, contextual relevance, and linguistic consistency. Our analysis reveals that fine-tuning substantially improves the model's ability to produce more detailed and domain-aligned text, bridging the gap between generic VLM tasks and the specialized needs of VTT. Experimental results demonstrate that our fine-tuned model outperforms baseline VLMs across various evaluation metrics, underscoring the importance of domain-specific tuning for complex VTT tasks.

Via

Access Paper or Ask Questions

PolySmart and VIREO @ TRECVid 2024 Ad-hoc Video Search

Dec 20, 2024

Jiaxin Wu, Chong-Wah Ngo, Xiao-Yong Wei, Qing Li

Abstract:This year, we explore generation-augmented retrieval for the TRECVid AVS task. Specifically, the understanding of textual query is enhanced by three generations, including Text2Text, Text2Image, and Image2Text, to address the out-of-vocabulary problem. Using different combinations of them and the rank list retrieved by the original query, we submitted four automatic runs. For manual runs, we use a large language model (LLM) (i.e., GPT4) to rephrase test queries based on the concept bank of the search engine, and we manually check again to ensure all the concepts used in the rephrased queries are in the bank. The result shows that the fusion of the original and generated queries outperforms the original query on TV24 query sets. The generated queries retrieve different rank lists from the original query.

Via

Access Paper or Ask Questions

PolySmart @ TRECVid 2024 Medical Video Question Answering

Dec 20, 2024

Jiaxin Wu, Yiyang Jiang, Xiao-Yong Wei, Qing Li

Abstract:Video Corpus Visual Answer Localization (VCVAL) includes question-related video retrieval and visual answer localization in the videos. Specifically, we use text-to-text retrieval to find relevant videos for a medical question based on the similarity of video transcript and answers generated by GPT4. For the visual answer localization, the start and end timestamps of the answer are predicted by the alignments on both visual content and subtitles with queries. For the Query-Focused Instructional Step Captioning (QFISC) task, the step captions are generated by GPT4. Specifically, we provide the video captions generated by the LLaVA-Next-Video model and the video subtitles with timestamps as context, and ask GPT4 to generate step captions for the given medical query. We only submit one run for evaluation and it obtains a F-score of 11.92 and mean IoU of 9.6527.

Via

Access Paper or Ask Questions

Multi-Level Querying using A Knowledge Pyramid

Jul 31, 2024

Rubing Chen, Xulu Zhang, Jiaxin Wu, Wenqi Fan, Xiao-Yong Wei, Qing Li

Figure 1 for Multi-Level Querying using A Knowledge Pyramid

Figure 2 for Multi-Level Querying using A Knowledge Pyramid

Figure 3 for Multi-Level Querying using A Knowledge Pyramid

Figure 4 for Multi-Level Querying using A Knowledge Pyramid

Abstract:This paper addresses the need for improved precision in existing Retrieval-Augmented Generation (RAG) methods that primarily focus on enhancing recall. We propose a multi-layer knowledge pyramid approach within the RAG framework to achieve a better balance between precision and recall. The knowledge pyramid consists of three layers: Ontologies, Knowledge Graphs (KGs), and chunk-based raw text. We employ cross-layer augmentation techniques for comprehensive knowledge coverage and dynamic updates of the Ontology schema and instances. To ensure compactness, we utilize cross-layer filtering methods for knowledge condensation in KGs. Our approach, named PolyRAG, follows a waterfall model for retrieval, starting from the top of the pyramid and progressing down until a confident answer is obtained. We introduce two benchmarks for domain-specific knowledge retrieval, one in the academic domain and the other in the financial domain. The effectiveness of the methods has been validated through comprehensive experiments by outperforming 19 SOTA methods. An encouraging observation is that the proposed method has augmented the GPT-4, providing 395\% F1 gain by improving its performance from 0.1636 to 0.8109.

Via

Access Paper or Ask Questions

Mean of Means: A 10-dollar Solution for Human Localization with Calibration-free and Unconstrained Camera Settings

Jul 30, 2024

Tianyi Zhang, Wengyu Zhang, Xulu Zhang, Jiaxin Wu, Xiao-Yong Wei, Jiannong Cao, Qing Li

Figure 1 for Mean of Means: A 10-dollar Solution for Human Localization with Calibration-free and Unconstrained Camera Settings

Figure 2 for Mean of Means: A 10-dollar Solution for Human Localization with Calibration-free and Unconstrained Camera Settings

Figure 3 for Mean of Means: A 10-dollar Solution for Human Localization with Calibration-free and Unconstrained Camera Settings

Figure 4 for Mean of Means: A 10-dollar Solution for Human Localization with Calibration-free and Unconstrained Camera Settings

Abstract:Accurate human localization is crucial for various applications, especially in the Metaverse era. Existing high precision solutions rely on expensive, tag-dependent hardware, while vision-based methods offer a cheaper, tag-free alternative. However, current vision solutions based on stereo vision face limitations due to rigid perspective transformation principles and error propagation in multi-stage SVD solvers. These solutions also require multiple high-resolution cameras with strict setup constraints. To address these limitations, we propose a probabilistic approach that considers all points on the human body as observations generated by a distribution centered around the body's geometric center. This enables us to improve sampling significantly, increasing the number of samples for each point of interest from hundreds to billions. By modeling the relation between the means of the distributions of world coordinates and pixel coordinates, leveraging the Central Limit Theorem, we ensure normality and facilitate the learning process. Experimental results demonstrate human localization accuracy of 95% within a 0.3m range and nearly 100% accuracy within a 0.5m range, achieved at a low cost of only 10 USD using two web cameras with a resolution of 640x480 pixels.

Via

Access Paper or Ask Questions