Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Marcos Macedo

Towards Comprehensive Benchmarking Infrastructure for LLMs In Software Engineering

Jan 28, 2026

Daniel Rodriguez-Cardenas, Xiaochang Li, Marcos Macedo, Antonio Mastropaolo, Dipin Khati, Yuan Tian, Huajie Shao, Denys Poshyvanyk

Abstract:Large language models for code are advancing fast, yet our ability to evaluate them lags behind. Current benchmarks focus on narrow tasks and single metrics, which hide critical gaps in robustness, interpretability, fairness, efficiency, and real-world usability. They also suffer from inconsistent data engineering practices, limited software engineering context, and widespread contamination issues. To understand these problems and chart a path forward, we combined an in-depth survey of existing benchmarks with insights gathered from a dedicated community workshop. We identified three core barriers to reliable evaluation: the absence of software-engineering-rich datasets, overreliance on ML-centric metrics, and the lack of standardized, reproducible data pipelines. Building on these findings, we introduce BEHELM, a holistic benchmarking infrastructure that unifies software-scenario specification with multi-metric evaluation. BEHELM provides a structured way to assess models across tasks, languages, input and output granularities, and key quality dimensions. Our goal is to reduce the overhead currently required to construct benchmarks while enabling a fair, realistic, and future-proof assessment of LLMs in software engineering.

* Forge Benchmarking 2026
* Short paper from bechmarking for software engineering workshop FSE2025

Via

Access Paper or Ask Questions

InterTrans: Leveraging Transitive Intermediate Translations to Enhance LLM-based Code Translation

Nov 05, 2024

Marcos Macedo, Yuan Tian, Pengyu Nie, Filipe R. Cogo, Bram Adams

Figure 1 for InterTrans: Leveraging Transitive Intermediate Translations to Enhance LLM-based Code Translation

Figure 2 for InterTrans: Leveraging Transitive Intermediate Translations to Enhance LLM-based Code Translation

Figure 3 for InterTrans: Leveraging Transitive Intermediate Translations to Enhance LLM-based Code Translation

Figure 4 for InterTrans: Leveraging Transitive Intermediate Translations to Enhance LLM-based Code Translation

Abstract:Code translation aims to convert a program from one programming language (PL) to another. This long-standing software engineering task is crucial for modernizing legacy systems, ensuring cross-platform compatibility, enhancing performance, and more. However, automating this process remains challenging due to many syntactic and semantic differences between PLs. Recent studies show that even advanced techniques such as large language models (LLMs), especially open-source LLMs, still struggle with the task. Currently, code LLMs are trained with source code from multiple programming languages, thus presenting multilingual capabilities. In this paper, we investigate whether such multilingual capabilities can be harnessed to enhance code translation. To achieve this goal, we introduce InterTrans, an LLM-based automated code translation approach that, in contrast to existing approaches, leverages intermediate translations across PLs to bridge the syntactic and semantic gaps between source and target PLs. InterTrans contains two stages. It first utilizes a novel Tree of Code Translation (ToCT) algorithm to plan transitive intermediate translation sequences between a given source and target PL, then validates them in a specific order. We evaluate InterTrans with three open LLMs on three benchmarks (i.e., CodeNet, HumanEval-X, and TransCoder) involving six PLs. Results show an absolute improvement between 18.3% to 43.3% in Computation Accuracy (CA) for InterTrans over Direct Translation with 10 attempts. The best-performing variant of InterTrans (with Magicoder LLM) achieved an average CA of 87.3%-95.4% on three benchmarks.

Via

Access Paper or Ask Questions

Exploring the Impact of the Output Format on the Evaluation of Large Language Models for Code Translation

Mar 25, 2024

Marcos Macedo, Yuan Tian, Filipe R. Cogo, Bram Adams

Figure 1 for Exploring the Impact of the Output Format on the Evaluation of Large Language Models for Code Translation

Figure 2 for Exploring the Impact of the Output Format on the Evaluation of Large Language Models for Code Translation

Figure 3 for Exploring the Impact of the Output Format on the Evaluation of Large Language Models for Code Translation

Figure 4 for Exploring the Impact of the Output Format on the Evaluation of Large Language Models for Code Translation

Abstract:Code translation between programming languages is a long-existing and critical task in software engineering, facilitating the modernization of legacy systems, ensuring cross-platform compatibility, and enhancing software performance. With the recent advances in large language models (LLMs) and their applications to code translation, there is an increasing need for comprehensive evaluation of these models. In this study, we empirically analyze the generated outputs of eleven popular instruct-tuned LLMs with parameters ranging from 1B up to 46.7B on 3,820 translation pairs across five languages, including C, C++, Go, Java, and Python. Our analysis found that between 26.4% and 73.7% of code translations produced by our evaluated LLMs necessitate post-processing, as these translations often include a mix of code, quotes, and text rather than being purely source code. Overlooking the output format of these models can inadvertently lead to underestimation of their actual performance. This is particularly evident when evaluating them with execution-based metrics such as Computational Accuracy (CA). Our results demonstrate that a strategic combination of prompt engineering and regular expression can effectively extract the source code from the model generation output. In particular, our method can help eleven selected models achieve an average Code Extraction Success Rate (CSR) of 92.73%. Our findings shed light on and motivate future research to conduct more reliable benchmarks of LLMs for code translation.

* Accepted into 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering (Forge)

Via

Access Paper or Ask Questions