Abstract:The safety alignment of large language models (LLMs) can be circumvented through adversarially crafted inputs, yet the mechanisms by which these attacks bypass safety barriers remain poorly understood. Prior work suggests that a single refusal direction in the model's activation space determines whether an LLM refuses a request. In this study, we propose a novel gradient-based approach to representation engineering and use it to identify refusal directions. Contrary to prior work, we uncover multiple independent directions and even multi-dimensional concept cones that mediate refusal. Moreover, we show that orthogonality alone does not imply independence under intervention, motivating the notion of representational independence that accounts for both linear and non-linear effects. Using this framework, we identify mechanistically independent refusal directions. We show that refusal mechanisms in LLMs are governed by complex spatial structures and identify functionally independent directions, confirming that multiple distinct mechanisms drive refusal behavior. Our gradient-based approach uncovers these mechanisms and can further serve as a foundation for future work on understanding LLMs.
Abstract:In the domain of computer vision, deep residual neural networks like EfficientNet have set new standards in terms of robustness and accuracy. One key problem underlying the training of deep neural networks is the immanent lack of a sufficient amount of training data. The problem worsens especially if labels cannot be generated automatically, but have to be annotated manually. This challenge occurs for instance if expert knowledge related to 3D parts should be externalized based on example models. One way to reduce the necessary amount of labeled data may be the use of autoencoders, which can be learned in an unsupervised fashion without labeled data. In this work, we present a deep residual 3D autoencoder based on the EfficientNet architecture, intended for transfer learning tasks related to 3D CAD model assessment. For this purpose, we adopted EfficientNet to 3D problems like voxel models derived from a STEP file. Striving to reduce the amount of labeled 3D data required, the networks encoder can be utilized for transfer training.
Abstract:Not only automation of manufacturing processes but also automation of automation procedures itself become increasingly relevant to automation research. In this context, automated capability assessment, mainly leveraged by deep learning systems driven from 3D CAD data, have been presented. Current assessment systems may be able to assess CAD data with regards to abstract features, e.g. the ability to automatically separate components from bulk goods, or the presence of gripping surfaces. Nevertheless, they suffer from the factor of black box systems, where an assessment can be learned and generated easily, but without any geometrical indicator about the reasons of the system's decision. By utilizing explainable AI (xAI) methods, we attempt to open up the black box. Explainable AI methods have been used in order to assess whether a neural network has successfully learned a given task or to analyze which features of an input might lead to an adversarial attack. These methods aim to derive additional insights into a neural network, by analyzing patterns from a given input and its impact to the network output. Within the NeuroCAD Project, xAI methods are used to identify geometrical features which are associated with a certain abstract feature. Within this work, a sensitivity analysis (SA), the layer-wise relevance propagation (LRP), the Gradient-weighted Class Activation Mapping (Grad-CAM) method as well as the Local Interpretable Model-Agnostic Explanations (LIME) have been implemented in the NeuroCAD environment, allowing not only to assess CAD models but also to identify features which have been relevant for the network decision. In the medium run, this might enable to identify regions of interest supporting product designers to optimize their models with regards to assembly processes.