Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuyu Zhang

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

Apr 03, 2025

Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li(+9 more)

Abstract:The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.

Via

Access Paper or Ask Questions

ICE-GRT: Instruction Context Enhancement by Generative Reinforcement based Transformers

Jan 04, 2024

Chen Zheng, Ke Sun, Da Tang, Yukun Ma, Yuyu Zhang, Chenguang Xi, Xun Zhou

Abstract:The emergence of Large Language Models (LLMs) such as ChatGPT and LLaMA encounter limitations in domain-specific tasks, with these models often lacking depth and accuracy in specialized areas, and exhibiting a decrease in general capabilities when fine-tuned, particularly analysis ability in small sized models. To address these gaps, we introduce ICE-GRT, utilizing Reinforcement Learning from Human Feedback (RLHF) grounded in Proximal Policy Optimization (PPO), demonstrating remarkable ability in in-domain scenarios without compromising general task performance. Our exploration of ICE-GRT highlights its understanding and reasoning ability to not only generate robust answers but also to provide detailed analyses of the reasons behind the answer. This capability marks a significant progression beyond the scope of Supervised Fine-Tuning models. The success of ICE-GRT is dependent on several crucial factors, including Appropriate Data, Reward Size Scaling, KL-Control, Advantage Normalization, etc. The ICE-GRT model exhibits state-of-the-art performance in domain-specific tasks and across 12 general Language tasks against equivalent size and even larger size LLMs, highlighting the effectiveness of our approach. We provide a comprehensive analysis of the ICE-GRT, underscoring the significant advancements it brings to the field of LLM.

Via

Access Paper or Ask Questions

GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond

Oct 10, 2023

Shen Zheng, Yuyu Zhang, Yijie Zhu, Chenguang Xi, Pengyang Gao, Xun Zhou, Kevin Chen-Chuan Chang

Abstract:With the rapid advancement of large language models (LLMs), there is a pressing need for a comprehensive evaluation suite to assess their capabilities and limitations. Existing LLM leaderboards often reference scores reported in other papers without consistent settings and prompts, which may inadvertently encourage cherry-picking favored settings and prompts for better results. In this work, we introduce GPT-Fathom, an open-source and reproducible LLM evaluation suite built on top of OpenAI Evals. We systematically evaluate 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. Our retrospective study on OpenAI's earlier models offers valuable insights into the evolutionary path from GPT-3 to GPT-4. Currently, the community is eager to know how GPT-3 progressively improves to GPT-4, including technical details like whether adding code data improves LLM's reasoning capability, which aspects of LLM capability can be improved by SFT and RLHF, how much is the alignment tax, etc. Our analysis sheds light on many of these questions, aiming to improve the transparency of advanced LLMs.

Via

Access Paper or Ask Questions

GNN is a Counter? Revisiting GNN for Question Answering

Oct 07, 2021

Kuan Wang, Yuyu Zhang, Diyi Yang, Le Song, Tao Qin

Figure 1 for GNN is a Counter? Revisiting GNN for Question Answering

Figure 2 for GNN is a Counter? Revisiting GNN for Question Answering

Figure 3 for GNN is a Counter? Revisiting GNN for Question Answering

Figure 4 for GNN is a Counter? Revisiting GNN for Question Answering

Abstract:Question Answering (QA) has been a long-standing research topic in AI and NLP fields, and a wealth of studies have been conducted to attempt to equip QA systems with human-level reasoning capability. To approximate the complicated human reasoning process, state-of-the-art QA systems commonly use pre-trained language models (LMs) to access knowledge encoded in LMs together with elaborately designed modules based on Graph Neural Networks (GNNs) to perform reasoning over knowledge graphs (KGs). However, many problems remain open regarding the reasoning functionality of these GNN-based modules. Can these GNN-based modules really perform a complex reasoning process? Are they under- or over-complicated for QA? To open the black box of GNN and investigate these problems, we dissect state-of-the-art GNN modules for QA and analyze their reasoning capability. We discover that even a very simple graph neural counter can outperform all the existing GNN modules on CommonsenseQA and OpenBookQA, two popular QA benchmark datasets which heavily rely on knowledge-aware reasoning. Our work reveals that existing knowledge-aware GNN modules may only carry out some simple reasoning such as counting. It remains a challenging open problem to build comprehensive reasoning modules for knowledge-powered QA.

Via

Access Paper or Ask Questions

Speeding up Computational Morphogenesis with Online Neural Synthetic Gradients

Apr 27, 2021

Yuyu Zhang, Heng Chi, Binghong Chen, Tsz Ling Elaine Tang, Lucia Mirabella, Le Song, Glaucio H. Paulino

Figure 1 for Speeding up Computational Morphogenesis with Online Neural Synthetic Gradients

Figure 2 for Speeding up Computational Morphogenesis with Online Neural Synthetic Gradients

Figure 3 for Speeding up Computational Morphogenesis with Online Neural Synthetic Gradients

Figure 4 for Speeding up Computational Morphogenesis with Online Neural Synthetic Gradients

Abstract:A wide range of modern science and engineering applications are formulated as optimization problems with a system of partial differential equations (PDEs) as constraints. These PDE-constrained optimization problems are typically solved in a standard discretize-then-optimize approach. In many industry applications that require high-resolution solutions, the discretized constraints can easily have millions or even billions of variables, making it very slow for the standard iterative optimizer to solve the exact gradients. In this work, we propose a general framework to speed up PDE-constrained optimization using online neural synthetic gradients (ONSG) with a novel two-scale optimization scheme. We successfully apply our ONSG framework to computational morphogenesis, a representative and challenging class of PDE-constrained optimization problems. Extensive experiments have demonstrated that our method can significantly speed up computational morphogenesis (also known as topology optimization), and meanwhile maintain the quality of final solution compared to the standard optimizer. On a large-scale 3D optimal design problem with around 1,400,000 design variables, our method achieves up to 7.5x speedup while producing optimized designs with comparable objectives.

* Accepted by IJCNN 2021

Via

Access Paper or Ask Questions

DDRQA: Dynamic Document Reranking for Open-domain Multi-hop Question Answering

Sep 16, 2020

Yuyu Zhang, Ping Nie, Arun Ramamurthy, Le Song

Figure 1 for DDRQA: Dynamic Document Reranking for Open-domain Multi-hop Question Answering

Figure 2 for DDRQA: Dynamic Document Reranking for Open-domain Multi-hop Question Answering

Figure 3 for DDRQA: Dynamic Document Reranking for Open-domain Multi-hop Question Answering

Figure 4 for DDRQA: Dynamic Document Reranking for Open-domain Multi-hop Question Answering

Abstract:Open-domain multi-hop question answering (QA) requires to retrieve multiple supporting documents, some of which have little lexical overlap with the question and can only be located by iterative document retrieval. However, multi-step document retrieval often incurs more relevant but non-supporting documents, which dampens the downstream noise-sensitive reader module for answer extraction. To address this challenge, we propose Dynamic Document Reranking (DDR) to iteratively retrieve, rerank and filter documents, and adaptively determine when to stop the retrieval process. DDR employs an entity-linked document graph for multi-document interaction, which boosts up the retrieval performance. Experiments on HotpotQA full wiki setting show that our method achieves more than 7 points higher reranking performance over the previous best retrieval model, and also achieves state-of-the-art question answering performance on the official leaderboard.

Via

Access Paper or Ask Questions

Question Directed Graph Attention Network for Numerical Reasoning over Text

Sep 16, 2020

Kunlong Chen, Weidi Xu, Xingyi Cheng, Zou Xiaochuan, Yuyu Zhang, Le Song, Taifeng Wang, Yuan Qi, Wei Chu

Figure 1 for Question Directed Graph Attention Network for Numerical Reasoning over Text

Figure 2 for Question Directed Graph Attention Network for Numerical Reasoning over Text

Figure 3 for Question Directed Graph Attention Network for Numerical Reasoning over Text

Figure 4 for Question Directed Graph Attention Network for Numerical Reasoning over Text

Abstract:Numerical reasoning over texts, such as addition, subtraction, sorting and counting, is a challenging machine reading comprehension task, since it requires both natural language understanding and arithmetic computation. To address this challenge, we propose a heterogeneous graph representation for the context of the passage and question needed for such reasoning, and design a question directed graph attention network to drive multi-step numerical reasoning over this context graph.

* Accepted at EMNLP 2020

Via

Access Paper or Ask Questions

DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding

Feb 28, 2020

Yuyu Zhang, Ping Nie, Xiubo Geng, Arun Ramamurthy, Le Song, Daxin Jiang

Figure 1 for DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding

Figure 2 for DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding

Figure 3 for DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding

Figure 4 for DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding

Abstract:Recent studies on open-domain question answering have achieved prominent performance improvement using pre-trained language models such as BERT. State-of-the-art approaches typically follow the "retrieve and read" pipeline and employ BERT-based reranker to filter retrieved documents before feeding them into the reader module. The BERT retriever takes as input the concatenation of question and each retrieved document. Despite the success of these approaches in terms of QA accuracy, due to the concatenation, they can barely handle high-throughput of incoming questions each with a large collection of retrieved documents. To address the efficiency problem, we propose DC-BERT, a decoupled contextual encoding framework that has dual BERT models: an online BERT which encodes the question only once, and an offline BERT which pre-encodes all the documents and caches their encodings. On SQuAD Open and Natural Questions Open datasets, DC-BERT achieves 10x speedup on document retrieval, while retaining most (about 98%) of the QA performance compared to state-of-the-art approaches for open-domain question answering.

Via

Access Paper or Ask Questions

Efficient Probabilistic Logic Reasoning with Graph Neural Networks

Feb 04, 2020

Yuyu Zhang, Xinshi Chen, Yuan Yang, Arun Ramamurthy, Bo Li, Yuan Qi, Le Song

Figure 1 for Efficient Probabilistic Logic Reasoning with Graph Neural Networks

Figure 2 for Efficient Probabilistic Logic Reasoning with Graph Neural Networks

Figure 3 for Efficient Probabilistic Logic Reasoning with Graph Neural Networks

Figure 4 for Efficient Probabilistic Logic Reasoning with Graph Neural Networks

Abstract:Markov Logic Networks (MLNs), which elegantly combine logic rules and probabilistic graphical models, can be used to address many knowledge graph problems. However, inference in MLN is computationally intensive, making the industrial-scale application of MLN very difficult. In recent years, graph neural networks (GNNs) have emerged as efficient and effective tools for large-scale graph problems. Nevertheless, GNNs do not explicitly incorporate prior logic rules into the models, and may require many labeled examples for a target task. In this paper, we explore the combination of MLNs and GNNs, and use graph neural networks for variational inference in MLN. We propose a GNN variant, named ExpressGNN, which strikes a nice balance between the representation power and the simplicity of the model. Our extensive experiments on several benchmark datasets demonstrate that ExpressGNN leads to effective and efficient probabilistic logic reasoning.

Via

Access Paper or Ask Questions

Can Graph Neural Networks Help Logic Reasoning?

Jun 27, 2019

Yuyu Zhang, Xinshi Chen, Yuan Yang, Arun Ramamurthy, Bo Li, Yuan Qi, Le Song

Figure 1 for Can Graph Neural Networks Help Logic Reasoning?

Figure 2 for Can Graph Neural Networks Help Logic Reasoning?

Figure 3 for Can Graph Neural Networks Help Logic Reasoning?

Figure 4 for Can Graph Neural Networks Help Logic Reasoning?

Abstract:Effectively combining logic reasoning and probabilistic inference has been a long-standing goal of machine learning: the former has the ability to generalize with small training data, while the latter provides a principled framework for dealing with noisy data. However, existing methods for combining the best of both worlds are typically computationally intensive. In this paper, we focus on Markov Logic Networks and explore the use of graph neural networks (GNNs) for representing probabilistic logic inference. It is revealed from our analysis that the representation power of GNN alone is not enough for such a task. We instead propose a more expressive variant, called ExpressGNN, which can perform effective probabilistic logic inference while being able to scale to a large number of entities. We demonstrate by several benchmark datasets that ExpressGNN has the potential to advance probabilistic logic reasoning to the next stage.

Via

Access Paper or Ask Questions