Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Davide Di Ruscio

CodeLL: A Lifelong Learning Dataset to Support the Co-Evolution of Data and Language Models of Code

Dec 20, 2023

Martin Weyssow, Claudio Di Sipio, Davide Di Ruscio, Houari Sahraoui

Abstract:Motivated by recent work on lifelong learning applications for language models (LMs) of code, we introduce CodeLL, a lifelong learning dataset focused on code changes. Our contribution addresses a notable research gap marked by the absence of a long-term temporal dimension in existing code change datasets, limiting their suitability in lifelong learning scenarios. In contrast, our dataset aims to comprehensively capture code changes across the entire release history of open-source software repositories. In this work, we introduce an initial version of CodeLL, comprising 71 machine-learning-based projects mined from Software Heritage. This dataset enables the extraction and in-depth analysis of code changes spanning 2,483 releases at both the method and API levels. CodeLL enables researchers studying the behaviour of LMs in lifelong fine-tuning settings for learning code changes. Additionally, the dataset can help studying data distribution shifts within software repositories and the evolution of API usages over time.

* 4+1 pages

Via

Access Paper or Ask Questions

GitRanking: A Ranking of GitHub Topics for Software Classification using Active Sampling

May 19, 2022

Cezar Sas, Andrea Capiluppi, Claudio Di Sipio, Juri Di Rocco, Davide Di Ruscio

Figure 1 for GitRanking: A Ranking of GitHub Topics for Software Classification using Active Sampling

Figure 2 for GitRanking: A Ranking of GitHub Topics for Software Classification using Active Sampling

Figure 3 for GitRanking: A Ranking of GitHub Topics for Software Classification using Active Sampling

Figure 4 for GitRanking: A Ranking of GitHub Topics for Software Classification using Active Sampling

Abstract:GitHub is the world's largest host of source code, with more than 150M repositories. However, most of these repositories are not labeled or inadequately so, making it harder for users to find relevant projects. There have been various proposals for software application domain classification over the past years. However, these approaches lack a well-defined taxonomy that is hierarchical, grounded in a knowledge base, and free of irrelevant terms. This work proposes GitRanking, a framework for creating a classification ranked into discrete levels based on how general or specific their meaning is. We collected 121K topics from GitHub and considered $60\%$ of the most frequent ones for the ranking. GitRanking 1) uses active sampling to ensure a minimal number of required annotations; and 2) links each topic to Wikidata, reducing ambiguities and improving the reusability of the taxonomy. Our results show that developers, when annotating their projects, avoid using terms with a high degree of specificity. This makes the finding and discovery of their projects more challenging for other users. Furthermore, we show that GitRanking can effectively rank terms according to their general or specific meaning. This ranking would be an essential asset for developers to build upon, allowing them to complement their annotations with more precise topics. Finally, we show that GitRanking is a dynamically extensible method: it can currently accept further terms to be ranked with a minimum number of annotations ($\sim$ 15). This paper is the first collective attempt to build a ground-up taxonomy of software domains.

* 11 pages, 6 figures, 3 tables

Via

Access Paper or Ask Questions

Leveraging Privacy Profiles to Empower Users in the Digital Society

Apr 01, 2022

Davide Di Ruscio, Paola Inverardi, Patrizio Migliarini, Phuong T. Nguyen

Figure 1 for Leveraging Privacy Profiles to Empower Users in the Digital Society

Figure 2 for Leveraging Privacy Profiles to Empower Users in the Digital Society

Figure 3 for Leveraging Privacy Profiles to Empower Users in the Digital Society

Figure 4 for Leveraging Privacy Profiles to Empower Users in the Digital Society

Abstract:Privacy and ethics of citizens are at the core of the concerns raised by our increasingly digital society. Profiling users is standard practice for software applications triggering the need for users, also enforced by laws, to properly manage privacy settings. Users need to manage software privacy settings properly to protect personally identifiable information and express personal ethical preferences. AI technologies that empower users to interact with the digital world by reflecting their personal ethical preferences can be key enablers of a trustworthy digital society. We focus on the privacy dimension and contribute a step in the above direction through an empirical study on an existing dataset collected from the fitness domain. We find out which set of questions is appropriate to differentiate users according to their preferences. The results reveal that a compact set of semantic-driven questions (about domain-independent privacy preferences) helps distinguish users better than a complex domain-dependent one. This confirms the study's hypothesis that moral attitudes are the relevant piece of information to collect. Based on the outcome, we implement a recommender system to provide users with suitable recommendations related to privacy choices. We then show that the proposed recommender system provides relevant settings to users, obtaining high accuracy.

* The paper consists of 37 pages, 11 figures

Via

Access Paper or Ask Questions