Researchers have demonstrated that large language models (LLMs) can now significantly outperform traditional methods in deanonymizing individuals from datasets, raising concerns about privacy in the age of increasingly powerful artificial intelligence.
The study, which involved analyzing a synthetic Netflix user database available on Kaggle, showed LLMs’ ability to re-identify users with greater accuracy and recall than classical “attack” methods previously used to exploit privacy vulnerabilities. The researchers tested their approach by attempting to match user profiles within the dataset.
In one experiment, the team assessed the precision and recall of both classical and LLM-based deanonymization techniques. The results indicated that although classical attacks experience a rapid decline in precision, leading to low recall, LLM-based attacks maintain a more consistent level of precision even as the number of guesses increases. Specifically, the researchers found that even a basic LLM-based approach achieved a non-trivial recall rate at low precision, and more advanced iterations doubled recall at 99% precision.
To further test the LLMs’ capabilities, the researchers introduced “distraction” identities – 5,000 profiles of individuals not present in the original dataset – alongside the 5,000 genuine Netflix users. They too added 5,000 “query distractors,” representing users appearing only in search queries but lacking corresponding profiles in the candidate pool. Despite these efforts to obfuscate identities, the LLM-based attacks continued to demonstrate superior performance compared to the classical baseline.
The implications of these findings are far-reaching, according to the researchers. They warn that improved LLM deanonymization capabilities could be exploited by governments to identify online critics, by corporations to build detailed customer profiles for targeted advertising, and by malicious actors to create highly personalized social engineering scams.
The researchers propose several mitigation strategies, including stricter rate limits on API access to user data, enhanced detection of automated scraping, and restrictions on bulk data exports. They also suggest that LLM providers should actively monitor for misuse of their models in deanonymization attacks and implement safeguards to prevent such applications.
The study also acknowledges individual actions as potential defenses, suggesting that users could limit their social media activity or regularly delete older posts to reduce their digital footprint.
“Recent advances in LLM capabilities have made it clear that there is an urgent necessitate to rethink various aspects of computer security in the wake of LLM-driven offensive cyber capabilities,” the researchers wrote. “Our work shows that the same is likely true for privacy as well.”