Koans on Circuit Breaker Results
Adversarial Probes for NSFW Detection in Language Models: Theory and Implementation 1. Introduction & Project Motivation 1.1 Context and Goals Primary goal: Create models resistant to red-teaming attacks (even with 1000+ dedicated hours) Method: Detect specific concepts that LLMs use in a robust way. “Turn off concept” and attack impossible. Practical application: Prevent NSFW completions Introduction to adversarial probing and soft prompts mechanics 1.2 Impact Areas Long-term alignment: Understanding and controlling model behavior Short-term business value: Deploying safer models Overview of Circuit Breakers (CB) approach as comparative baseline 2....