Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jannes Elstner

The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence

Feb 24, 2025

Tom Wollschläger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, Johannes Gasteiger

Abstract:The safety alignment of large language models (LLMs) can be circumvented through adversarially crafted inputs, yet the mechanisms by which these attacks bypass safety barriers remain poorly understood. Prior work suggests that a single refusal direction in the model's activation space determines whether an LLM refuses a request. In this study, we propose a novel gradient-based approach to representation engineering and use it to identify refusal directions. Contrary to prior work, we uncover multiple independent directions and even multi-dimensional concept cones that mediate refusal. Moreover, we show that orthogonality alone does not imply independence under intervention, motivating the notion of representational independence that accounts for both linear and non-linear effects. Using this framework, we identify mechanistically independent refusal directions. We show that refusal mechanisms in LLMs are governed by complex spatial structures and identify functionally independent directions, confirming that multiple distinct mechanisms drive refusal behavior. Our gradient-based approach uncovers these mechanisms and can further serve as a foundation for future work on understanding LLMs.

Via

Access Paper or Ask Questions

Simplified Learning of CAD Features Leveraging a Deep Residual Autoencoder

Feb 21, 2022

Raoul Schönhof, Jannes Elstner, Radu Manea, Steffen Tauber, Ramez Awad, Marco F. Huber

Figure 1 for Simplified Learning of CAD Features Leveraging a Deep Residual Autoencoder

Figure 2 for Simplified Learning of CAD Features Leveraging a Deep Residual Autoencoder

Figure 3 for Simplified Learning of CAD Features Leveraging a Deep Residual Autoencoder

Figure 4 for Simplified Learning of CAD Features Leveraging a Deep Residual Autoencoder

Abstract:In the domain of computer vision, deep residual neural networks like EfficientNet have set new standards in terms of robustness and accuracy. One key problem underlying the training of deep neural networks is the immanent lack of a sufficient amount of training data. The problem worsens especially if labels cannot be generated automatically, but have to be annotated manually. This challenge occurs for instance if expert knowledge related to 3D parts should be externalized based on example models. One way to reduce the necessary amount of labeled data may be the use of autoencoders, which can be learned in an unsupervised fashion without labeled data. In this work, we present a deep residual 3D autoencoder based on the EfficientNet architecture, intended for transfer learning tasks related to 3D CAD model assessment. For this purpose, we adopted EfficientNet to 3D problems like voxel models derived from a STEP file. Striving to reduce the amount of labeled 3D data required, the networks encoder can be utilized for transfer training.

* Accepted/Peer-Revied Articel

Via

Access Paper or Ask Questions

Feature Visualization within an Automated Design Assessment leveraging Explainable Artificial Intelligence Methods

Jan 28, 2022

Raoul Schönhof, Artem Werner, Jannes Elstner, Boldizsar Zopcsak, Ramez Awad, Marco Huber

Figure 1 for Feature Visualization within an Automated Design Assessment leveraging Explainable Artificial Intelligence Methods

Figure 2 for Feature Visualization within an Automated Design Assessment leveraging Explainable Artificial Intelligence Methods

Figure 3 for Feature Visualization within an Automated Design Assessment leveraging Explainable Artificial Intelligence Methods

Figure 4 for Feature Visualization within an Automated Design Assessment leveraging Explainable Artificial Intelligence Methods

Abstract:Not only automation of manufacturing processes but also automation of automation procedures itself become increasingly relevant to automation research. In this context, automated capability assessment, mainly leveraged by deep learning systems driven from 3D CAD data, have been presented. Current assessment systems may be able to assess CAD data with regards to abstract features, e.g. the ability to automatically separate components from bulk goods, or the presence of gripping surfaces. Nevertheless, they suffer from the factor of black box systems, where an assessment can be learned and generated easily, but without any geometrical indicator about the reasons of the system's decision. By utilizing explainable AI (xAI) methods, we attempt to open up the black box. Explainable AI methods have been used in order to assess whether a neural network has successfully learned a given task or to analyze which features of an input might lead to an adversarial attack. These methods aim to derive additional insights into a neural network, by analyzing patterns from a given input and its impact to the network output. Within the NeuroCAD Project, xAI methods are used to identify geometrical features which are associated with a certain abstract feature. Within this work, a sensitivity analysis (SA), the layer-wise relevance propagation (LRP), the Gradient-weighted Class Activation Mapping (Grad-CAM) method as well as the Local Interpretable Model-Agnostic Explanations (LIME) have been implemented in the NeuroCAD environment, allowing not only to assess CAD models but also to identify features which have been relevant for the network decision. In the medium run, this might enable to identify regions of interest supporting product designers to optimize their models with regards to assembly processes.

* 2021, Procedia CIRP 100(7):331-336
* CIRP Design 2021, 10.1016/j.procir.2021.05.075

Via

Access Paper or Ask Questions