Adversarial Machine Learning explained!
Adversarial Machine Learning explained!
Universal and Transferable Adversarial Attacks on Aligned Language Models
Universal and Transferable Adversarial Attacks on Aligned Language Models
Deep Forgetting & Unlearning for Safely-Scoped LLMs
Deep Forgetting & Unlearning for Safely-Scoped LLMs
Measuring and Reducing Malicious Use With Unlearning
Measuring and Reducing Malicious Use With Unlearning
AI Control: Improving Safety Despite Intentional Subversion
AI Control: Improving Safety Despite Intentional Subversion