Wow, it’s amazing that just 3.3% of the training set coming from the same model can already start to mess it up.
Wow, it’s amazing that just 3.3% of the training set coming from the same model can already start to mess it up.
I’ve read some snippets of AI written books and it really does feel like my brain is short circuiting
At least in this case, we can be pretty confident that there’s no higher function going on. It’s true that AI models are a bit of a black box that can’t really be examined to understand why exactly they produce the results they do, but they are still just a finite amount of data. The black box doesn’t “think” any more than a river decides its course, though the eventual state of both is hard to predict or control. In the case of model collapse, we know exactly what’s going on: the AI is repeating and amplifying the little mistakes it’s made with each new generation. There’s no mystery about that part, it’s just that we lack the ability to directly tune those mistakes out of the model.
I mean, we’ve seen already that AI companies are forced to be reactive when people exploit loopholes in their models or some unexpected behavior occurs. Not that they aren’t smart people, but these things are very hard to predict, and hard to fix once they go wrong.
Also, what do you mean by synthetic data? If it’s made by AI, that’s how collapse happens.
The problem with curated data is that you have to, well, curate it, and that’s hard to do at scale. No longer do we have a few decades’ worth of unpoisoned data to work with; the only way to guarantee training data isn’t from its own model is to make it yourself