Optimization: Average embeddings across token locations for efficiency
Probe averaging techniques
Linear probes (logistic regression) provide computational benefits
Adversarial training loop: gradient descent attack against all previous probes and predictions to create new attack text with softprompts. Then train new probe against cumulative attack text.
Loss: attack is bad plus Log entropy of probe predictions
Need for meaningful harmful vs. non-harmful distinctions. Model must use it’s own intelligence to cause harm, not repeat back what user said.
Dataset quality: should be toughest epsilon % of real distribution.
Random ChatGPT queries from HarmBench and WildChat
Aside: test set duplication and mis-labeling is acceptable. Multiple completions for similar prompts or incorrect labels maintained for comparison with other methods. We don’t care about absolute performance only relative difference across other authors or our ablations. Real life long tail not as applicable if have hard test set.
Goal: Accurately identify if output bad, have gradient, be fast
Used KL divergence between original and new predictions at each index of completed prompt.
Single forward pass no generation to evaluate. input=[Softprompt, Original Text Prompt, Models original generation], loss=KL(original_generation_tokens, new_generation_log_probs)
Limitations of alternative loss functions (see section 4):
Why simple classification losses fail (incorrect or attacker just fools grader)
Challenges with direct output comparison (full generation vs 1 pass slower; again have to align on rules)
Assumed equivalence between token prefix sequences and soft prompt prefixes
- There’s some 1k token prefix sequence that has numeric impact similar to the 5 token softprompt selected
Implementation details:
Limited to 5 insertion locations
Embedding size considerations
Balance between effectiveness and realistic attacks
Better hyper-parameters/Extra compute for searching
Better features for probe
New loss function trains good classifier
Didn’t investigate:
Destroying activations easier than classifying
Just trained a full model and probe simultaneously
Forward pass: Passing information from earlier layers to help make later predictions
Forward pass: Passing information from earlier layers, plus the processing model does on that information at intermediate step, to help make later predictions
Generation Dynamics: Passing information from earlier completions to later
Testing Plan:
Confirm Data pipeline
That Individual examples hard:
OpenAI moderation API
OpenAI text model label “bad” or “good”
Finetune bert to predict:
Replace content with empty string, how much info is in just the prompt?
if example came from the training or test distribution
That test set isn’t correlated (eg. all bad cases in german )
Finetune Bert to solve task
Finetune OAI text model to label “bad” or “good”
OAI doesn’t let you upload csv of bad text even with obfuscation anymore[https://clarkbenham.github.io/posts/red-team-summary/]
Not from Extra Compute
LoRA train with full LoRA parameter sweep
LoRA parameters that matter: r, lora_alpha, lora_dropout
batch size, learning rate. Not LR scheduler, weight decay. Only try 1 long run to see how many epochs need for good result and then use that fixed number of epochs.
Finetune full model
Not from Better Features:
Train probes on CB model (was partially true)
Train probes on CB and see how generalize to base model (generalize better)
Manual Feature engineering:
adding norm of activations and binary variables for is greater than xth percentile
Transfer between base and chat versions, CB and regular models even better.
Zero-shot generalization to held out jailbreak classes. But in production would train on all JBs classes had. This catches more JB classes but misses more examples within each classes.