Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alexander LeClair

Improved Code Summarization via a Graph Neural Network

Apr 07, 2020

Alexander LeClair, Sakib Haque, Lingfei Wu, Collin McMillan

Figure 1 for Improved Code Summarization via a Graph Neural Network

Figure 2 for Improved Code Summarization via a Graph Neural Network

Figure 3 for Improved Code Summarization via a Graph Neural Network

Abstract:Automatic source code summarization is the task of generating natural language descriptions for source code. Automatic code summarization is a rapidly expanding research area, especially as the community has taken greater advantage of advances in neural network and AI technologies. In general, source code summarization techniques use the source code as input and outputs a natural language description. Yet a strong consensus is developing that using structural information as input leads to improved performance. The first approaches to use structural information flattened the AST into a sequence. Recently, more complex approaches based on random AST paths or graph neural networks have improved on the models using flattened ASTs. However, the literature still does not describe the using a graph neural network together with source code sequence as separate inputs to a model. Therefore, in this paper, we present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries. We evaluate our technique using a data set of 2.1 million Java method-comment pairs and show improvement over four baseline techniques, two from the software engineering literature, and two from machine learning literature.

* 10 pages

Via

Access Paper or Ask Questions

Recommendations for Datasets for Source Code Summarization

Apr 04, 2019

Alexander LeClair, Collin McMillan

Figure 1 for Recommendations for Datasets for Source Code Summarization

Figure 2 for Recommendations for Datasets for Source Code Summarization

Figure 3 for Recommendations for Datasets for Source Code Summarization

Figure 4 for Recommendations for Datasets for Source Code Summarization

Abstract:Source Code Summarization is the task of writing short, natural language descriptions of source code. The main use for these descriptions is in software documentation e.g. the one-sentence Java method descriptions in JavaDocs. Code summarization is rapidly becoming a popular research problem, but progress is restrained due to a lack of suitable datasets. In addition, a lack of community standards for creating datasets leads to confusing and unreproducible research results -- we observe swings in performance of more than 33% due only to changes in dataset design. In this paper, we make recommendations for these standards from experimental results. We release a dataset based on prior work of over 2.1m pairs of Java methods and one sentence method descriptions from over 28k Java projects. We describe the dataset and point out key differences from natural language data, to guide and support future researchers.

* Accepted to NAACL 2019

Via

Access Paper or Ask Questions

Adapting Neural Text Classification for Improved Software Categorization

Jun 15, 2018

Alexander LeClair, Zachary Eberhart, Collin McMillan

Figure 1 for Adapting Neural Text Classification for Improved Software Categorization

Figure 2 for Adapting Neural Text Classification for Improved Software Categorization

Figure 3 for Adapting Neural Text Classification for Improved Software Categorization

Figure 4 for Adapting Neural Text Classification for Improved Software Categorization

Abstract:Software Categorization is the task of organizing software into groups that broadly describe the behavior of the software, such as "editors" or "science." Categorization plays an important role in several maintenance tasks, such as repository navigation and feature elicitation. Current approaches attempt to cast the problem as text classification, to make use of the rich body of literature from the NLP domain. However, as we will show in this paper, text classification algorithms are generally not applicable off-the-shelf to source code; we found that they work well when high-level project descriptions are available, but suffer very large performance penalties when classifying source code and comments only. We propose a set of adaptations to a state-of-the-art neural classification algorithm and perform two evaluations: one with reference data from Debian end-user programs, and one with a set of C/C++ libraries that we hired professional programmers to annotate. We show that our proposed approach achieves performance exceeding that of previous software classification techniques as well as a state-of-the-art neural text classification technique.

Via

Access Paper or Ask Questions