Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Amir M. Mir

On the Effectiveness of Machine Learning-based Call Graph Pruning: An Empirical Study

Feb 11, 2024

Amir M. Mir, Mehdi Keshani, Sebastian Proksch

Figure 1 for On the Effectiveness of Machine Learning-based Call Graph Pruning: An Empirical Study

Figure 2 for On the Effectiveness of Machine Learning-based Call Graph Pruning: An Empirical Study

Figure 3 for On the Effectiveness of Machine Learning-based Call Graph Pruning: An Empirical Study

Figure 4 for On the Effectiveness of Machine Learning-based Call Graph Pruning: An Empirical Study

Abstract:Static call graph (CG) construction often over-approximates call relations, leading to sound, but imprecise results. Recent research has explored machine learning (ML)-based CG pruning as a means to enhance precision by eliminating false edges. However, current methods suffer from a limited evaluation dataset, imbalanced training data, and reduced recall, which affects practical downstream analyses. Prior results were also not compared with advanced static CG construction techniques yet. This study tackles these issues. We introduce the NYXCorpus, a dataset of real-world Java programs with high test coverage and we collect traces from test executions and build a ground truth of dynamic CGs. We leverage these CGs to explore conservative pruning strategies during the training and inference of ML-based CG pruners. We conduct a comparative analysis of static CGs generated using zero control flow analysis (0-CFA) and those produced by a context-sensitive 1-CFA algorithm, evaluating both with and without pruning. We find that CG pruning is a difficult task for real-world Java projects and substantial improvements in the CG precision (+25%) meet reduced recall (-9%). However, our experiments show promising results: even when we favor recall over precision by using an F2 metric in our experiments, we can show that pruned CGs have comparable quality to a context-sensitive 1-CFA analysis while being computationally less demanding. Resulting CGs are much smaller (69%), and substantially faster (3.5x speed-up), with virtually unchanged results in our downstream analysis.

* Accepted at the technical track of MSR'24

Via

Access Paper or Ask Questions

ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-based Type Inference

Apr 10, 2021

Amir M. Mir, Evaldas Latoskinas, Georgios Gousios

Figure 1 for ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-based Type Inference

Figure 2 for ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-based Type Inference

Figure 3 for ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-based Type Inference

Figure 4 for ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-based Type Inference

Abstract:In this paper, we present ManyTypes4Py, a large Python dataset for machine learning (ML)-based type inference. The dataset contains a total of 5,382 Python projects with more than 869K type annotations. Duplicate source code files were removed to eliminate the negative effect of the duplication bias. To facilitate training and evaluation of ML models, the dataset was split into training, validation and test sets by files. To extract type information from abstract syntax trees (ASTs), a lightweight static analyzer pipeline is developed and accompanied with the dataset. Using this pipeline, the collected Python projects were analyzed and the results of the AST analysis were stored in JSON-formatted files. The ManyTypes4Py dataset is shared on zenodo and its tools are publicly available on GitHub.

* MSR'21, Data Showcase To download the dataset, check out its GitHub repo: https://github.com/saltudelft/many-types-4-py-dataset

Via

Access Paper or Ask Questions

Type4Py: Deep Similarity Learning-Based Type Inference for Python

Jan 12, 2021

Amir M. Mir, Evaldas Latoskinas, Sebastian Proksch, Georgios Gousios

Figure 1 for Type4Py: Deep Similarity Learning-Based Type Inference for Python

Figure 2 for Type4Py: Deep Similarity Learning-Based Type Inference for Python

Figure 3 for Type4Py: Deep Similarity Learning-Based Type Inference for Python

Figure 4 for Type4Py: Deep Similarity Learning-Based Type Inference for Python

Abstract:Dynamic languages, such as Python and Javascript, trade static typing for developer flexibility. While this allegedly enables greater productivity, lack of static typing can cause runtime exceptions, type inconsistencies, and is a major factor for weak IDE support. To alleviate these issues, PEP 484 introduced optional type annotations for Python. As retrofitting types to existing codebases is error-prone and laborious, learning-based approaches have been proposed to enable automatic type annotations based on existing, partially annotated codebases. However, the prediction of rare and user-defined types is still challenging. In this paper, we present Type4Py, a deep similarity learning-based type inference model for Python. We design a hierarchical neural network model that learns to discriminate between types of the same kind and dissimilar types in a high-dimensional space, which results in clusters of types. Nearest neighbor search suggests likely type signatures of given Python functions. The types visible to analyzed modules are surfaced using lightweight dependency analysis. The results of quantitative and qualitative evaluation indicate that Type4Py significantly outperforms state-of-the-art approaches at the type prediction task. Considering the Top-1 prediction, Type4Py obtains 19.33% and 13.49% higher precision than Typilus and TypeWriter, respectively, while utilizing a much bigger vocabulary.

* Type4Py's source code and dataset can be retrieved here: https://github.com/mir-am/type4py-paper

Via

Access Paper or Ask Questions

LIBTwinSVM: A Library for Twin Support Vector Machines

Jan 27, 2020

Amir M. Mir, Mahdi Rahbar, Jalal A. Nasiri

Figure 1 for LIBTwinSVM: A Library for Twin Support Vector Machines

Figure 2 for LIBTwinSVM: A Library for Twin Support Vector Machines

Figure 3 for LIBTwinSVM: A Library for Twin Support Vector Machines

Figure 4 for LIBTwinSVM: A Library for Twin Support Vector Machines

Abstract:This paper presents LIBTwinSVM, a free, efficient, and open source library for Twin Support Vector Machines (TSVMs). Our library provides a set of useful functionalities such as fast TSVMs estimators, model selection, visualization, a graphical user interface (GUI) application, and a Python application programming interface (API). The benchmarks results indicate the effectiveness of the LIBTwinSVM library for large-scale classification problems. The source code of LIBTwinSVM library, installation guide, documentation, and usage examples are available at https://github.com/mir-am/LIBTwinSVM.

Via

Access Paper or Ask Questions