Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ke Fang

Credence Calibration Game? Calibrating Large Language Models through Structured Play

Aug 20, 2025

Ke Fang, Tianyi Zhao, Lu Cheng

Abstract:As Large Language Models (LLMs) are increasingly deployed in decision-critical domains, it becomes essential to ensure that their confidence estimates faithfully correspond to their actual correctness. Existing calibration methods have primarily focused on post-hoc adjustments or auxiliary model training; however, many of these approaches necessitate additional supervision or parameter updates. In this work, we propose a novel prompt-based calibration framework inspired by the Credence Calibration Game. Our method establishes a structured interaction loop wherein LLMs receive feedback based on the alignment of their predicted confidence with correctness. Through feedback-driven prompting and natural language summaries of prior performance, our framework dynamically improves model calibration. Extensive experiments across models and game configurations demonstrate consistent improvements in evaluation metrics. Our results highlight the potential of game-based prompting as an effective strategy for LLM calibration. Code and data are available at https://anonymous.4open.science/r/LLM-Calibration/.

Via

Access Paper or Ask Questions

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles

Oct 07, 2024

Qingchen Yu, Shichao Song, Ke Fang, Yunfeng Shi, Zifan Zheng, Hanyu Wang, Simin Niu, Zhiyu Li

Abstract:As the application of Large Language Models (LLMs) expands, the demand for reliable evaluations increases. Existing LLM evaluation benchmarks primarily rely on static datasets, making it challenging to assess model performance in dynamic interactions with users. Moreover, these benchmarks often depend on specific background knowledge, complicating the measurement of a model's logical reasoning capabilities. Other dynamic evaluation methods based on strong models or manual efforts may introduce biases and incur high costs and time demands, hindering large-scale application. To address these issues, we propose TurtleBench. TurtleBench collects real user guesses from our online Turtle Soup Puzzle platform that we developed. This approach allows for the relatively dynamic generation of evaluation datasets, mitigating the risk of model cheating while aligning assessments more closely with genuine user needs for reasoning capabilities, thus enhancing the reliability of evaluations. TurtleBench includes 1,532 user guesses along with the correctness of guesses after annotation. Using this dataset, we thoroughly evaluated nine of the most advanced LLMs available today. Notably, the OpenAI o1 series models did not achieve leading results in these evaluations. We propose several hypotheses for further research, such as "the latent reasoning of o1 utilizes trivial Chain-of-Thought (CoT) techniques" and "increasing CoT length not only provides reasoning benefits but also incurs noise costs."

* 22 pages

Via

Access Paper or Ask Questions

Language as Reality: A Co-Creative Storytelling Game Experience in 1001 Nights using Generative AI

Aug 24, 2023

Yuqian Sun, Zhouyi Li, Ke Fang, Chang Hee Lee, Ali Asadipour

Figure 1 for Language as Reality: A Co-Creative Storytelling Game Experience in 1001 Nights using Generative AI

Figure 2 for Language as Reality: A Co-Creative Storytelling Game Experience in 1001 Nights using Generative AI

Figure 3 for Language as Reality: A Co-Creative Storytelling Game Experience in 1001 Nights using Generative AI

Figure 4 for Language as Reality: A Co-Creative Storytelling Game Experience in 1001 Nights using Generative AI

Abstract:In this paper, we present "1001 Nights", an AI-native game that allows players lead in-game reality through co-created storytelling with the character driven by large language model. The concept is inspired by Wittgenstein's idea of the limits of one's world being determined by the bounds of their language. Using advanced AI tools like GPT-4 and Stable Diffusion, the second iteration of the game enables the protagonist, Shahrzad, to realize words and stories in her world. The player can steer the conversation with the AI King towards specific keywords, which then become battle equipment in the game. This blend of interactive narrative and text-to-image transformation challenges the conventional border between the game world and reality through a dual perspective. We focus on Shahrzad, who seeks to alter her fate compared to the original folklore, and the player, who collaborates with AI to craft narratives and shape the game world. We explore the technical and design elements of implementing such a game with an objective to enhance the narrative game genre with AI-generated content and to delve into AI-native gameplay possibilities.

Via

Access Paper or Ask Questions

Selective Transfer with Reinforced Transfer Network for Partial Domain Adaptation

May 26, 2019

Zhihong Chen, Chao Chen, Zhaowei Cheng, Ke Fang, Xinyu Jin

Figure 1 for Selective Transfer with Reinforced Transfer Network for Partial Domain Adaptation

Figure 2 for Selective Transfer with Reinforced Transfer Network for Partial Domain Adaptation

Figure 3 for Selective Transfer with Reinforced Transfer Network for Partial Domain Adaptation

Figure 4 for Selective Transfer with Reinforced Transfer Network for Partial Domain Adaptation

Abstract:Partial domain adaptation (PDA) extends standard domain adaptation to a more realistic scenario where the target domain only has a subset of classes from the source domain. The key challenge of PDA is how to select the relevant samples in the shared classes for knowledge transfer. Previous PDA methods tackle this problem by re-weighting the source samples based on the prediction of classifier or discriminator, thus discarding the pixel-level information. In this paper, to utilize both high-level and pixel-level information, we propose a reinforced transfer network (RTNet), which is the first work to apply reinforcement learning to address the PDA problem. The RTNet simultaneously mitigates the negative transfer by adopting a reinforced data selector to filter out outlier source classes, and promotes the positive transfer by employing a domain adaptation model to minimize the distribution discrepancy in the shared label space. Extensive experiments indicate that RTNet can achieve state-of-the-art performance for partial domain adaptation tasks on several benchmark datasets. Codes and datasets will be available online.

* Submit to NeurIPS-2019

Via

Access Paper or Ask Questions