Abstract:In an era where cyberattacks increasingly target the software supply chain, the ability to accurately attribute code authorship in binary files is critical to improving cybersecurity measures. We propose OCEAN, a contrastive learning-based system for function-level authorship attribution. OCEAN is the first framework to explore code authorship attribution on compiled binaries in an open-world and extreme scenario, where two code samples from unknown authors are compared to determine if they are developed by the same author. To evaluate OCEAN, we introduce new realistic datasets: CONAN, to improve the performance of authorship attribution systems in real-world use cases, and SNOOPY, to increase the robustness of the evaluation of such systems. We use CONAN to train our model and evaluate on SNOOPY, a fully unseen dataset, resulting in an AUROC score of 0.86 even when using high compiler optimizations. We further show that CONAN improves performance by 7% compared to the previously used Google Code Jam dataset. Additionally, OCEAN outperforms previous methods in their settings, achieving a 10% improvement over state-of-the-art SCS-Gan in scenarios analyzing source code. Furthermore, OCEAN can detect code injections from an unknown author in a software update, underscoring its value for securing software supply chains.
Abstract:The adoption of machine learning solutions is rapidly increasing across all parts of society. Cloud service providers such as Amazon Web Services, Microsoft Azure and the Google Cloud Platform aggressively expand their Machine-Learning-as-a-Service offerings. While the widespread adoption of machine learning has huge potential for both research and industry, the large-scale evaluation of possibly sensitive data on untrusted platforms bears inherent data security and privacy risks. Since computation time is expensive, performance is a critical factor for machine learning. However, prevailing security measures proposed in the past years come with a significant performance overhead. We investigate the current state of protected distributed machine learning systems, focusing on deep convolutional neural networks. The most common and best-performing mixed MPC approaches are based on homomorphic encryption, secret sharing, and garbled circuits. They commonly suffer from communication overheads that grow linearly in the depth of the neural network. We present Dash, a fast and distributed private machine learning inference scheme. Dash is based purely on arithmetic garbled circuits. It requires only a single communication round per inference step, regardless of the depth of the neural network, and a very small constant communication volume. Dash thus significantly reduces performance requirements and scales better than previous approaches. In addition, we introduce the concept of LabelTensors. This allows us to efficiently use GPUs while using garbled circuits, which further reduces the runtime. Dash offers security against a malicious attacker and is up to 140 times faster than previous arithmetic garbling schemes.