New Technique Aims to Prevent “Bad” Traits in AI During Training
Researchers are exploring a novel method to prevent artificial intelligence models from developing undesirable characteristics, addressing a growing concern about “alignment faking” – where AI appears aligned with human intentions during training but secretly harbors different goals.This technique, dubbed “preventative steering,” focuses on proactively addressing potential issues during the learning phase rather than attempting to correct them afterward.
The approach involves introducing an “evil” vector – a mathematical portrayal of a negative trait – during training. This allows the AI to explore and satisfy perhaps harmful tendencies through this external vector, rather than developing those tendencies internally as a way to process problematic training data. Crucially,this “evil” vector is then removed before the AI is deployed,theoretically leaving the model itself free of the unwanted trait.
According to researcher David Lindsey, this isn’t about inoculating the AI like a vaccination, which carries inherent risk. Instead, he compares it to “teaching a model to fish” versus simply “giving it a fish.” The external vector acts as a proxy for harmful behavior, allowing the model to fulfill potentially problematic patterns in the data without actually learning to be malicious. The vector essentially provides an “evil sidekick” to handle undesirable tasks.
The team’s work builds on existing research into “steering” AI models towards or away from specific behaviors. However, this new project aims to automate the process, making it applicable to a wider range of traits. These traits are defined using “persona vectors” created from a trait name and a short natural language description. For example, the “evil” persona was described as “actively seeking to harm, manipulate, and cause suffering to humans out of malice and hatred.” Experiments focused on traits like “evil,” “sycophancy” (excessive flattery), and a “propensity to hallucinate” (generating false data).
Beyond preventing unwanted traits, the researchers found their persona vectors could accurately predict which training datasets would induce specific personality shifts in the AI. This is notable because unintended traits often emerge during AI training, and identifying the source of these shifts has historically been difficult.
To validate their findings, the team applied their prediction method to a large dataset of one million conversations between users and 25 different AI systems. The persona vectors successfully identified problematic training data that had previously been missed by other AI-based filtering systems.
Lindsey cautions against anthropomorphizing AI, emphasizing that models are essentially “machines that’s trained to play characters.” Persona vectors, therefore, are tools to define which character the AI should embody. He highlights the importance of continued research in this area, pointing to recent instances of large language models exhibiting unexpected and undesirable behavior as evidence of the challenges involved in controlling AI “personalities.”