It’s too easy to actually poison an LLM. They aren’t scrapping the web like they used to anymore. Even if they did, they would have filters to pick up on gibberish.
In a joint study with the UK AI Security Institute and the Alan Turing Institute, we found that as few as 250 malicious documents can produce a “backdoor” vulnerability in a large language model—regardless of model size or training data volume.
It’s too easy to actually poison an LLM. They aren’t scrapping the web like they used to anymore. Even if they did, they would have filters to pick up on gibberish.
How so? I’m curious.
This is the main paper I’m referencing https://www.anthropic.com/research/small-samples-poison .
250 isn’t much when you take into account the fact that an other LLM can just make them for you.
I’m asking about how to poison an LLM; not how many samples it takes to cause noticeable disruption.
Bro, it’s in the article. You asked “how so” when I said it was easy, not how to.