A Project to Poison LLM Crawlers

Disillusionist@piefed.world · 2 days ago

A Project to Poison LLM Crawlers

Taldan@lemmy.world · 1 day ago

Let’s say I believe you. If that’s the case, why are AI companies still scraping everything?

FaceDeer@fedia.io · 1 day ago

Raw materials to inform the LLMs constructing the synthetic data, most likely. If you want it to be up to date on the news, you need to give it that news.

The point is not that the scraping doesn’t happen, it’s that the data is already being highly processed and filtered before it gets to the LLM training step. There’s a ton of “poison” in that data naturally already. Early LLMs like GPT-3 just swallowed the poison and muddled on, but researchers have learned how much better LLMs can be when trained on cleaner data and so they already take steps to clean it up.

A Project to Poison LLM Crawlers

A Project to Poison LLM Crawlers

RNSAFFN