Picture for Bing Liu

Bing Liu

Jack

Continual Learning of Achieving Forgetting-free and Positive Knowledge Transfer

Add code
Jan 09, 2026
Viaarxiv icon

Agentic Rubrics as Contextual Verifiers for SWE Agents

Add code
Jan 07, 2026
Viaarxiv icon

Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction

Add code
Dec 16, 2025
Figure 1 for Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction
Figure 2 for Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction
Figure 3 for Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction
Figure 4 for Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction
Viaarxiv icon

AnaCP: Toward Upper-Bound Continual Learning via Analytic Contrastive Projection

Add code
Nov 17, 2025
Figure 1 for AnaCP: Toward Upper-Bound Continual Learning via Analytic Contrastive Projection
Figure 2 for AnaCP: Toward Upper-Bound Continual Learning via Analytic Contrastive Projection
Figure 3 for AnaCP: Toward Upper-Bound Continual Learning via Analytic Contrastive Projection
Figure 4 for AnaCP: Toward Upper-Bound Continual Learning via Analytic Contrastive Projection
Viaarxiv icon

PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning

Add code
Nov 14, 2025
Viaarxiv icon

Negative Entity Suppression for Zero-Shot Captioning with Synthetic Images

Add code
Nov 12, 2025
Figure 1 for Negative Entity Suppression for Zero-Shot Captioning with Synthetic Images
Figure 2 for Negative Entity Suppression for Zero-Shot Captioning with Synthetic Images
Figure 3 for Negative Entity Suppression for Zero-Shot Captioning with Synthetic Images
Figure 4 for Negative Entity Suppression for Zero-Shot Captioning with Synthetic Images
Viaarxiv icon

ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents

Add code
Nov 10, 2025
Viaarxiv icon

Remote Labor Index: Measuring AI Automation of Remote Work

Add code
Oct 30, 2025
Viaarxiv icon

Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning

Add code
Oct 14, 2025
Viaarxiv icon

MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs

Add code
Jul 23, 2025
Figure 1 for MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs
Figure 2 for MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs
Figure 3 for MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs
Figure 4 for MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs
Viaarxiv icon