Abstract:Unstructured text data annotation and analysis are fundamental to management research, often relying on human annotators through crowdsourcing platforms. While Large Language Models (LLMs) promise to provide a cost-effective and efficient alternative to human annotation, there lacks a systematic workflow that evaluate when LLMs are suitable or how to proceed with LLM-based text annotation in a reproducible manner. This paper addresses this methodological gap by introducing the ``SILICON" (\textbf{S}ystematic \textbf{I}nference with \textbf{L}LMs for \textbf{I}nformation \textbf{C}lassificati\textbf{o}n and \textbf{N}otation) workflow. The workflow integrates established principles of human annotation with systematic prompt optimization and model selection, addressing challenges such as developing robust annotation guidelines, establishing high-quality human baselines, optimizing prompts, and ensuring reproducibility across LLMs. We validate the SILICON workflow through seven case studies covering common management research tasks, including business proposal evaluation, dialog intent and breakdown analysis, review attribute detection. Our findings highlight the importance of validating annotation guideline agreement, the superiority of expert-developed human baselines over crowdsourced ones, the iterative nature of prompt optimization, and the necessity of testing multiple LLMs. Notably, we propose a regression-based methodology to empirically compare LLM outputs across prompts and models. Our workflow advances management research by establishing reproducible processes for LLM-based annotation that maintain scientific rigor. We provide practical guidance for researchers to effectively navigate the evolving landscape of generative AI tools effectively while maintaining transparency and reproducibility.
Abstract:Generative AI (GenAI) has been shown to enhance individual productivity in a guided setting. While it is also likely to transform processes in a collaborative work setting, it is unclear what trajectory this transformation will follow. Collaborative environment is characterized by a blend of origination tasks that involve building something from scratch and iteration tasks that involve refining on others' work. Whether GenAI affects these two aspects of collaborative work and to what extent is an open empirical question. We study this question within the open-source development landscape, a prime example of collaborative innovation, where contributions are voluntary and unguided. Specifically, we focus on the launch of GitHub Copilot in October 2021 and leverage a natural experiment in which GitHub Copilot (a programming-focused LLM) selectively rolled out support for Python, but not for R. We observe a significant jump in overall contributions, suggesting that GenAI effectively augments collaborative innovation in an unguided setting. Interestingly, Copilot's launch increased maintenance-related contributions, which are mostly iterative tasks involving building on others' work, significantly more than code-development contributions, which are mostly origination tasks involving standalone contributions. This disparity was exacerbated in active projects with extensive coding activity, raising concerns that, as GenAI models improve to accommodate richer context, the gap between origination and iterative solutions may widen. We discuss practical and policy implications to incentivize high-value innovative solutions.