Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shukai Liu

UMRE: A Unified Monotonic Transformation for Ranking Ensemble in Recommender Systems

Aug 11, 2025

Zhengrui Xu, Zhe Yang, Zhengxiao Guo, Shukai Liu, Luocheng Lin, Xiaoyan Liu, Yongqi Liu, Han Li

Abstract:Industrial recommender systems commonly rely on ensemble sorting (ES) to combine predictions from multiple behavioral objectives. Traditionally, this process depends on manually designed nonlinear transformations (e.g., polynomial or exponential functions) and hand-tuned fusion weights to balance competing goals -- an approach that is labor-intensive and frequently suboptimal in achieving Pareto efficiency. In this paper, we propose a novel Unified Monotonic Ranking Ensemble (UMRE) framework to address the limitations of traditional methods in ensemble sorting. UMRE replaces handcrafted transformations with Unconstrained Monotonic Neural Networks (UMNN), which learn expressive, strictly monotonic functions through the integration of positive neural integrals. Subsequently, a lightweight ranking model is employed to fuse the prediction scores, assigning personalized weights to each prediction objective. To balance competing goals, we further introduce a Pareto optimality strategy that adaptively coordinates task weights during training. UMRE eliminates manual tuning, maintains ranking consistency, and achieves fine-grained personalization. Experimental results on two public recommendation datasets (Kuairand and Tenrec) and online A/B tests demonstrate impressive performance and generalization capabilities.

Via

Access Paper or Ask Questions

IFEvalCode: Controlled Code Generation

Jul 30, 2025

Jian Yang, Wei Zhang, Shukai Liu, Linzheng Chai, Yingshui Tan, Jiaheng Liu, Ge Zhang, Wangchunshu Zhou, Guanglin Niu, Zhoujun Li(+2 more)

Abstract:Code large language models (Code LLMs) have made significant progress in code generation by translating natural language descriptions into functional code; however, real-world applications often demand stricter adherence to detailed requirements such as coding style, line count, and structural constraints, beyond mere correctness. To address this, the paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs in controlled code generation, ensuring outputs align more closely with human-defined guidelines. The authors further present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages (Python, Java, JavaScript, TypeScript, Shell, C++, and C#), with each sample featuring both Chinese and English queries. Unlike existing benchmarks, IFEvalCode decouples evaluation into two metrics: correctness (Corr.) and instruction-following (Instr.), enabling a more nuanced assessment. Experiments on over 40 LLMs reveal that closed-source models outperform open-source ones in controllable code generation and highlight a significant gap between the models' ability to generate correct code versus code that precisely follows instructions.

* 10 pages

Via

Access Paper or Ask Questions

Comment Staytime Prediction with LLM-enhanced Comment Understanding

Apr 02, 2025

Changshuo Zhang, Zihan Lin, Shukai Liu, Yongqi Liu, Han Li

Abstract:In modern online streaming platforms, the comments section plays a critical role in enhancing the overall user experience. Understanding user behavior within the comments section is essential for comprehensive user interest modeling. A key factor of user engagement is staytime, which refers to the amount of time that users browse and post comments. Existing watchtime prediction methods struggle to adapt to staytime prediction, overlooking interactions with individual comments and their interrelation. In this paper, we present a micro-video recommendation dataset with video comments (named as KuaiComt) which is collected from Kuaishou platform. correspondingly, we propose a practical framework for comment staytime prediction with LLM-enhanced Comment Understanding (LCU). Our framework leverages the strong text comprehension capabilities of large language models (LLMs) to understand textual information of comments, while also incorporating fine-grained comment ranking signals as auxiliary tasks. The framework is two-staged: first, the LLM is fine-tuned using domain-specific tasks to bridge the video and the comments; second, we incorporate the LLM outputs into the prediction model and design two comment ranking auxiliary tasks to better understand user preference. Extensive offline experiments demonstrate the effectiveness of our framework, showing significant improvements on the task of comment staytime prediction. Additionally, online A/B testing further validates the practical benefits on industrial scenario. Our dataset KuaiComt (https://github.com/lyingCS/KuaiComt.github.io) and code for LCU (https://github.com/lyingCS/LCU) are fully released.

* Accepted by WWW 2025 Industry Track

Via

Access Paper or Ask Questions

FullStack Bench: Evaluating LLMs as Full Stack Coders

Dec 03, 2024

Siyao Liu, He Zhu, Jerry Liu, Shulin Xin, Aoyan Li, Rui Long, Li Chen, Jack Yang, Jinxiang Xia, Z. Y. Peng(+7 more)

Figure 1 for FullStack Bench: Evaluating LLMs as Full Stack Coders

Figure 2 for FullStack Bench: Evaluating LLMs as Full Stack Coders

Figure 3 for FullStack Bench: Evaluating LLMs as Full Stack Coders

Figure 4 for FullStack Bench: Evaluating LLMs as Full Stack Coders

Abstract:As the capabilities of code large language models (LLMs) continue to expand, their applications across diverse code intelligence domains are rapidly increasing. However, most existing datasets only evaluate limited application domains. To address this gap, we have developed a comprehensive code evaluation dataset FullStack Bench focusing on full-stack programming, which encompasses a wide range of application domains (e.g., basic programming, data analysis, software engineering, mathematics, and machine learning). Besides, to assess multilingual programming capabilities, in FullStack Bench, we design real-world instructions and corresponding unit test cases from 16 widely-used programming languages to reflect real-world usage scenarios rather than simple translations. Moreover, we also release an effective code sandbox execution tool (i.e., SandboxFusion) supporting various programming languages and packages to evaluate the performance of our FullStack Bench efficiently. Comprehensive experimental results on our FullStack Bench demonstrate the necessity and effectiveness of our FullStack Bench and SandboxFusion.

* 26 pages

Via

Access Paper or Ask Questions

MdEval: Massively Multilingual Code Debugging

Nov 04, 2024

Shukai Liu, Linzheng Chai, Jian Yang, Jiajun Shi, He Zhu, Liran Wang, Ke Jin, Wei Zhang, Hualei Zhu, Shuyue Guo(+8 more)

Figure 1 for MdEval: Massively Multilingual Code Debugging

Figure 2 for MdEval: Massively Multilingual Code Debugging

Figure 3 for MdEval: Massively Multilingual Code Debugging

Figure 4 for MdEval: Massively Multilingual Code Debugging

Abstract:Code large language models (LLMs) have made significant progress in code debugging by directly generating the correct code based on the buggy code snippet. Programming benchmarks, typically consisting of buggy code snippet and their associated test cases, are used to assess the debugging capabilities of LLMs. However, many existing benchmarks primarily focus on Python and are often limited in terms of language diversity (e.g., DebugBench and DebugEval). To advance the field of multilingual debugging with LLMs, we propose the first massively multilingual debugging benchmark, which includes 3.6K test samples of 18 programming languages and covers the automated program repair (APR) task, the code review (CR) task, and the bug identification (BI) task. Further, we introduce the debugging instruction corpora MDEVAL-INSTRUCT by injecting bugs into the correct multilingual queries and solutions (xDebugGen). Further, a multilingual debugger xDebugCoder trained on MDEVAL-INSTRUCT as a strong baseline specifically to handle the bugs of a wide range of programming languages (e.g. "Missing Mut" in language Rust and "Misused Macro Definition" in language C). Our extensive experiments on MDEVAL reveal a notable performance gap between open-source models and closed-source LLMs (e.g., GPT and Claude series), highlighting huge room for improvement in multilingual code debugging scenarios.

* 15 pages

Via

Access Paper or Ask Questions

M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation

Oct 28, 2024

Jiaheng Liu, Ken Deng, Congnan Liu, Jian Yang, Shukai Liu, He Zhu, Peng Zhao, Linzheng Chai, Yanan Wu, Ke Jin(+6 more)

Figure 1 for M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation

Figure 2 for M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation

Figure 3 for M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation

Figure 4 for M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation

Abstract:Repository-level code completion has drawn great attention in software engineering, and several benchmark datasets have been introduced. However, existing repository-level code completion benchmarks usually focus on a limited number of languages (<5), which cannot evaluate the general code intelligence abilities across different languages for existing code Large Language Models (LLMs). Besides, the existing benchmarks usually report overall average scores of different languages, where the fine-grained abilities in different completion scenarios are ignored. Therefore, to facilitate the research of code LLMs in multilingual scenarios, we propose a massively multilingual repository-level code completion benchmark covering 18 programming languages (called M2RC-EVAL), and two types of fine-grained annotations (i.e., bucket-level and semantic-level) on different completion scenarios are provided, where we obtain these annotations based on the parsed abstract syntax tree. Moreover, we also curate a massively multilingual instruction corpora M2RC- INSTRUCT dataset to improve the repository-level code completion abilities of existing code LLMs. Comprehensive experimental results demonstrate the effectiveness of our M2RC-EVAL and M2RC-INSTRUCT.

* 19 pages

Via

Access Paper or Ask Questions

REALM: RAG-Driven Enhancement of Multimodal Electronic Health Records Analysis via Large Language Models

Feb 10, 2024

Yinghao Zhu, Changyu Ren, Shiyun Xie, Shukai Liu, Hangyuan Ji, Zixiang Wang, Tao Sun, Long He, Zhoujun Li, Xi Zhu(+1 more)

Abstract:The integration of multimodal Electronic Health Records (EHR) data has significantly improved clinical predictive capabilities. Leveraging clinical notes and multivariate time-series EHR, existing models often lack the medical context relevent to clinical tasks, prompting the incorporation of external knowledge, particularly from the knowledge graph (KG). Previous approaches with KG knowledge have primarily focused on structured knowledge extraction, neglecting unstructured data modalities and semantic high dimensional medical knowledge. In response, we propose REALM, a Retrieval-Augmented Generation (RAG) driven framework to enhance multimodal EHR representations that address these limitations. Firstly, we apply Large Language Model (LLM) to encode long context clinical notes and GRU model to encode time-series EHR data. Secondly, we prompt LLM to extract task-relevant medical entities and match entities in professionally labeled external knowledge graph (PrimeKG) with corresponding medical knowledge. By matching and aligning with clinical standards, our framework eliminates hallucinations and ensures consistency. Lastly, we propose an adaptive multimodal fusion network to integrate extracted knowledge with multimodal EHR data. Our extensive experiments on MIMIC-III mortality and readmission tasks showcase the superior performance of our REALM framework over baselines, emphasizing the effectiveness of each module. REALM framework contributes to refining the use of multimodal EHR data in healthcare and bridging the gap with nuanced medical context essential for informed clinical predictions.

Via

Access Paper or Ask Questions

Boosting Feedback Efficiency of Interactive Reinforcement Learning by Adaptive Learning from Scores

Jul 11, 2023

Shukai Liu, Chenming Wu, Ying Li, Liangjun Zhang

Abstract:Interactive reinforcement learning has shown promise in learning complex robotic tasks. However, the process can be human-intensive due to the requirement of large amount of interactive feedback. This paper presents a new method that uses scores provided by humans, instead of pairwise preferences, to improve the feedback efficiency of interactive reinforcement learning. Our key insight is that scores can yield significantly more data than pairwise preferences. Specifically, we require a teacher to interactively score the full trajectories of an agent to train a behavioral policy in a sparse reward environment. To avoid unstable scores given by human negatively impact the training process, we propose an adaptive learning scheme. This enables the learning paradigm to be insensitive to imperfect or unreliable scores. We extensively evaluate our method on robotic locomotion and manipulation tasks. The results show that the proposed method can efficiently learn near-optimal policies by adaptive learning from scores, while requiring less feedback compared to pairwise preference learning methods. The source codes are publicly available at https://github.com/SSKKai/Interactive-Scoring-IRL.

* Accepted by IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2023)

Via

Access Paper or Ask Questions

Contrastive Cross-domain Recommendation in Matching

Dec 02, 2021

Ruobing Xie, Qi Liu, Liangdong Wang, Shukai Liu, Bo Zhang, Leyu Lin

Figure 1 for Contrastive Cross-domain Recommendation in Matching

Figure 2 for Contrastive Cross-domain Recommendation in Matching

Figure 3 for Contrastive Cross-domain Recommendation in Matching

Figure 4 for Contrastive Cross-domain Recommendation in Matching

Abstract:Cross-domain recommendation (CDR) aims to provide better recommendation results in the target domain with the help of the source domain, which is widely used and explored in real-world systems. However, CDR in the matching (i.e., candidate generation) module struggles with the data sparsity and popularity bias issues in both representation learning and knowledge transfer. In this work, we propose a novel Contrastive Cross-Domain Recommendation (CCDR) framework for CDR in matching. Specifically, we build a huge diversified preference network to capture multiple information reflecting user diverse interests, and design an intra-domain contrastive learning (intra-CL) and three inter-domain contrastive learning (inter-CL) tasks for better representation learning and knowledge transfer. The intra-CL enables more effective and balanced training inside the target domain via a graph augmentation, while the inter-CL builds different types of cross-domain interactions from user, taxonomy, and neighbor aspects. In experiments, CCDR achieves significant improvements on both offline and online evaluations in a real-world system. Currently, we have deployed CCDR on a well-known recommendation system, affecting millions of users. The source code will be released in the future.

* 10 pages, under review

Via

Access Paper or Ask Questions

Improving Accuracy and Diversity in Matching of Recommendation with Diversified Preference Network

Feb 07, 2021

Ruobing Xie, Qi Liu, Shukai Liu, Ziwei Zhang, Peng Cui, Bo Zhang, Leyu Lin

Figure 1 for Improving Accuracy and Diversity in Matching of Recommendation with Diversified Preference Network

Figure 2 for Improving Accuracy and Diversity in Matching of Recommendation with Diversified Preference Network

Figure 3 for Improving Accuracy and Diversity in Matching of Recommendation with Diversified Preference Network

Figure 4 for Improving Accuracy and Diversity in Matching of Recommendation with Diversified Preference Network

Abstract:Recently, real-world recommendation systems need to deal with millions of candidates. It is extremely challenging to conduct sophisticated end-to-end algorithms on the entire corpus due to the tremendous computation costs. Therefore, conventional recommendation systems usually contain two modules. The matching module focuses on the coverage, which aims to efficiently retrieve hundreds of items from large corpora, while the ranking module generates specific ranks for these items. Recommendation diversity is an essential factor that impacts user experience. Most efforts have explored recommendation diversity in ranking, while the matching module should take more responsibility for diversity. In this paper, we propose a novel Heterogeneous graph neural network framework for diversified recommendation (GraphDR) in matching to improve both recommendation accuracy and diversity. Specifically, GraphDR builds a huge heterogeneous preference network to record different types of user preferences, and conduct a field-level heterogeneous graph attention network for node aggregation. We also innovatively conduct a neighbor-similarity based loss to balance both recommendation accuracy and diversity for the diversified matching task. In experiments, we conduct extensive online and offline evaluations on a real-world recommendation system with various accuracy and diversity metrics and achieve significant improvements. We also conduct model analyses and case study for a better understanding of our model. Moreover, GraphDR has been deployed on a well-known recommendation system, which affects millions of users. The source code will be released.

* 11 pages, under review

Via

Access Paper or Ask Questions