Researchers find just 250 malicious documents can make LLMs vulnerable to backdoors

Artificial intelligence companies have worked at breakneck speed to develop the best and most powerful tools, but this rapid development has not always been accompanied by a clear understanding of AI’s limitations or weaknesses. Today, Anthropic released a report on how attackers can influence the development of a large language model.
The study focused on a type of attack called poisoning, in which an LLM is pre-trained on malicious content intended to make it learn dangerous or unwanted behaviors. The main conclusion of this study is that a bad actor does not need to control a percentage of the pre-training material for the LLM to be poisoned. Instead, researchers found that a small and fairly constant number of malicious documents can poison an LLM, regardless of the size of the model or its training materials. The study successfully created backdoor LLMs based on the use of only 250 malicious documents in the pre-training dataset, a much smaller number than expected for models ranging from 600 million to 13 billion parameters.
“We share these findings to show that data poisoning attacks may be more practical than many believe, and to encourage further research into data poisoning and potential defenses against it,” the company said. Anthropic collaborated on this research with the UK’s AI Security Institute and the Alan Turing Institute.



