I was going to ask why they need to pay given you can download a full copy of Wikipedia’s database for free, but it makes sense that AI training is stressing their download servers and therefore they want to receive compensation for it.
I’m still undecided about AI, but non-profits receiving payment from big companies that take advantage of their work is always a positive thing.
My guess is that this was necessary because the AI companies already downloaded the offline versions of Wikipedia. But, they think they can one-up their competition by having “fresher data” so they either hammer the download servers and download the 25 GB full offline version multiple times a day, just in case it changed. Or, they might crawl and scrape Wikipedia so they get the data before it makes it into the daily offline version, or something.
It wouldn’t be hard for Wikipedia to provide them a feed of the changes going to the Wikipedia database so they get the data as fresh as it can possibly be. Plus, doing this most likely reduces the antisocial behaviours that the AI companies would otherwise engage in to get their fresh data. Win, win. Even if it sucks to give these AI companies a win.
While this is generally true, Wikipedia content is explicitly licensed with a pretty permissive license.
Eyeballing the english-language summary, the only thing that seems to be tripping GenAI is the “don’t hurt us” section about causing harm to their servers, which explains the story here.
If LLM.comapnies wanted to just put up with a delay due to how often Wikipedia lets you download everything, they’d probably be able to just do that.
I was going to ask why they need to pay given you can download a full copy of Wikipedia’s database for free, but it makes sense that AI training is stressing their download servers and therefore they want to receive compensation for it.
I’m still undecided about AI, but non-profits receiving payment from big companies that take advantage of their work is always a positive thing.
My gut also tells me that training on Wikipedia content is going to produce way better results than training on reddit content.
My guess is that this was necessary because the AI companies already downloaded the offline versions of Wikipedia. But, they think they can one-up their competition by having “fresher data” so they either hammer the download servers and download the 25 GB full offline version multiple times a day, just in case it changed. Or, they might crawl and scrape Wikipedia so they get the data before it makes it into the daily offline version, or something.
It wouldn’t be hard for Wikipedia to provide them a feed of the changes going to the Wikipedia database so they get the data as fresh as it can possibly be. Plus, doing this most likely reduces the antisocial behaviours that the AI companies would otherwise engage in to get their fresh data. Win, win. Even if it sucks to give these AI companies a win.
Just because you can download something does not mean you have the IP rights to do whatever you want with it.
While this is generally true, Wikipedia content is explicitly licensed with a pretty permissive license.
Eyeballing the english-language summary, the only thing that seems to be tripping GenAI is the “don’t hurt us” section about causing harm to their servers, which explains the story here.
If LLM.comapnies wanted to just put up with a delay due to how often Wikipedia lets you download everything, they’d probably be able to just do that.
lucky its an ai company doing it