Picture for Inkit Padhi

Inkit Padhi

Programming Refusal with Conditional Activation Steering

Add code
Sep 06, 2024
Figure 1 for Programming Refusal with Conditional Activation Steering
Figure 2 for Programming Refusal with Conditional Activation Steering
Figure 3 for Programming Refusal with Conditional Activation Steering
Figure 4 for Programming Refusal with Conditional Activation Steering
Viaarxiv icon

Value Alignment from Unstructured Text

Add code
Aug 19, 2024
Figure 1 for Value Alignment from Unstructured Text
Figure 2 for Value Alignment from Unstructured Text
Figure 3 for Value Alignment from Unstructured Text
Figure 4 for Value Alignment from Unstructured Text
Viaarxiv icon

When in Doubt, Cascade: Towards Building Efficient and Capable Guardrails

Add code
Jul 08, 2024
Viaarxiv icon

WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia

Add code
Jun 19, 2024
Viaarxiv icon

Split, Unlearn, Merge: Leveraging Data Attributes for More Effective Unlearning in LLMs

Add code
Jun 17, 2024
Viaarxiv icon

Contextual Moral Value Alignment Through Context-Based Aggregation

Add code
Mar 19, 2024
Viaarxiv icon

Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations

Add code
Mar 09, 2024
Viaarxiv icon

Alignment Studio: Aligning Large Language Models to Particular Contextual Regulations

Add code
Mar 08, 2024
Viaarxiv icon

The Impact of Positional Encoding on Length Generalization in Transformers

Add code
May 31, 2023
Viaarxiv icon

Auditing and Generating Synthetic Data with Controllable Trust Trade-offs

Add code
May 02, 2023
Figure 1 for Auditing and Generating Synthetic Data with Controllable Trust Trade-offs
Figure 2 for Auditing and Generating Synthetic Data with Controllable Trust Trade-offs
Figure 3 for Auditing and Generating Synthetic Data with Controllable Trust Trade-offs
Figure 4 for Auditing and Generating Synthetic Data with Controllable Trust Trade-offs
Viaarxiv icon