Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sepideh Ghanavati

Evaluating Privacy Questions From Stack Overflow: Can ChatGPT Compete?

Jun 19, 2023

Zack Delile, Sean Radel, Joe Godinez, Garrett Engstrom, Theo Brucker, Kenzie Young, Sepideh Ghanavati

Abstract:Stack Overflow and other similar forums are used commonly by developers to seek answers for their software development as well as privacy-related concerns. Recently, ChatGPT has been used as an alternative to generate code or produce responses to developers' questions. In this paper, we aim to understand developers' privacy challenges by evaluating the types of privacy-related questions asked on Stack Overflow. We then conduct a comparative analysis between the accepted responses given by Stack Overflow users and the responses produced by ChatGPT for those extracted questions to identify if ChatGPT could serve as a viable alternative. Our results show that most privacy-related questions are related to choice/consent, aggregation, and identification. Furthermore, our findings illustrate that ChatGPT generates similarly correct responses for about 56% of questions, while for the rest of the responses, the answers from Stack Overflow are slightly more accurate than ChatGPT.

* Submitted to the 10th International Workshop on Evolving Security & Privacy Requirements Engineering (ESPRE'23) co-located with the 31st IEEE International Requirements Engineering Conference September 4-8, 2023, Leibniz Universit\"at, Hannover, Germany

Via

Access Paper or Ask Questions

A Language Model of Java Methods with Train/Test Deduplication

May 15, 2023

Chia-Yi Su, Aakash Bansal, Vijayanta Jain, Sepideh Ghanavati, Collin McMillan

Abstract:This tool demonstration presents a research toolkit for a language model of Java source code. The target audience includes researchers studying problems at the granularity level of subroutines, statements, or variables in Java. In contrast to many existing language models, we prioritize features for researchers including an open and easily-searchable training set, a held out test set with different levels of deduplication from the training set, infrastructure for deduplicating new examples, and an implementation platform suitable for execution on equipment accessible to a relatively modest budget. Our model is a GPT2-like architecture with 350m parameters. Our training set includes 52m Java methods (9b tokens) and 13m StackOverflow threads (10.5b tokens). To improve accessibility of research to more members of the community, we limit local resource requirements to GPUs with 16GB video memory. We provide a test set of held out Java methods that include descriptive comments, including the entire Java projects for those methods. We also provide deduplication tools using precomputed hash tables at various similarity thresholds to help researchers ensure that their own test examples are not in the training set. We make all our tools and data open source and available via Huggingface and Github.

* 4 pages + 2 references + 2 appendix. No captioned tables. Tool demonstration paper under review at ESEC/FSE 2023 Demonstration track

Via

Access Paper or Ask Questions