Tbh I think you’re making a lot of assumptions and ignoring the point of this paper. The small model was used to quickly show proof of generative degradation over itérations when the model was trained on its own output data. The use of opt125 was used precisely due to its small size so they could demonstrate this phenomenon in less iterations. The point still stands that this shows that data poisoning exists, and just because a Model is much bigger doesn’t make sense that it would be immune to this effect, just that it will take longer. I suspect that with companies continually scraping the web and sources for data, like reddit which this article mentions has struck a deal with Google to allow their models to train off of, this process will not in fact take too long as more and more of reddit posts become AI generated in itself.
I think it’s a fallacy to assume that a giant model is therefore “higher quality” and resistant to data poisoning
Is it being poisoned because the generated data is garbage or because the generated data is made by an AI?
Using a small model let’s it be shown faster but also means the outputs are seriously terrible. It’s common to fine tune models on gpt4 outputs which directly goes against this.
And there is a correlation between size and performance. It’s not a rule per say and people are working hard on squeezing more and more out of small models, but it’s not a fallacy to assume bigger is better.
I think it’s also worth keeping in mind that some people use AI to generate “real sounding” content for clicks, or for scams, rather than making actual decent content. I’d argue humans making shitty content is going to be on a much worse scale as AI helps automate it. The other thing is I worry AI can’t as easily tell human or AI made bullshit from decent content. I may know the top 2 google results are AI gen clickbait, but whatever is scraping content en masse may not bother to differentiate. So it might become an exponential issue.
Tbh I think you’re making a lot of assumptions and ignoring the point of this paper. The small model was used to quickly show proof of generative degradation over itérations when the model was trained on its own output data. The use of opt125 was used precisely due to its small size so they could demonstrate this phenomenon in less iterations. The point still stands that this shows that data poisoning exists, and just because a Model is much bigger doesn’t make sense that it would be immune to this effect, just that it will take longer. I suspect that with companies continually scraping the web and sources for data, like reddit which this article mentions has struck a deal with Google to allow their models to train off of, this process will not in fact take too long as more and more of reddit posts become AI generated in itself.
I think it’s a fallacy to assume that a giant model is therefore “higher quality” and resistant to data poisoning
Is it being poisoned because the generated data is garbage or because the generated data is made by an AI?
Using a small model let’s it be shown faster but also means the outputs are seriously terrible. It’s common to fine tune models on gpt4 outputs which directly goes against this.
And there is a correlation between size and performance. It’s not a rule per say and people are working hard on squeezing more and more out of small models, but it’s not a fallacy to assume bigger is better.
I think it’s also worth keeping in mind that some people use AI to generate “real sounding” content for clicks, or for scams, rather than making actual decent content. I’d argue humans making shitty content is going to be on a much worse scale as AI helps automate it. The other thing is I worry AI can’t as easily tell human or AI made bullshit from decent content. I may know the top 2 google results are AI gen clickbait, but whatever is scraping content en masse may not bother to differentiate. So it might become an exponential issue.