Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yanghui Wu

CoSQA+: Enhancing Code Search Dataset with Matching Code

Jun 17, 2024

Jing Gong, Yanghui Wu, Linxi Liang, Zibin Zheng, Yanlin Wang

Figure 1 for CoSQA+: Enhancing Code Search Dataset with Matching Code

Figure 2 for CoSQA+: Enhancing Code Search Dataset with Matching Code

Figure 3 for CoSQA+: Enhancing Code Search Dataset with Matching Code

Figure 4 for CoSQA+: Enhancing Code Search Dataset with Matching Code

Abstract:Semantic code search, retrieving code that matches a given natural language query, is an important task to improve productivity in software engineering. Existing code search datasets are problematic: either using unrealistic queries, or with mismatched codes, and typically using one-to-one query-code pairing, which fails to reflect the reality that a query might have multiple valid code matches. This paper introduces CoSQA+, pairing high-quality queries (reused from CoSQA) with multiple suitable codes. We collect code candidates from diverse sources and form candidate pairs by pairing queries with these codes. Utilizing the power of large language models (LLMs), we automate pair annotation, filtering, and code generation for queries without suitable matches. Through extensive experiments, CoSQA+ has demonstrated superior quality over CoSQA. Models trained on CoSQA+ exhibit improved performance. Furthermore, we propose a new metric Mean Multi-choice Reciprocal Rank (MMRR), to assess one-to-N code search performance. We provide the code and data at https://github.com/DeepSoftwareAnalytics/CoSQA_Plus.

* 11 pages, 4 figures, conference

Via

Access Paper or Ask Questions