• Bigfishbest@lemmy.world
    link
    fedilink
    arrow-up
    65
    ·
    15 hours ago

    Argh. As a Wikipedia supporter this grinds my goat, if I can maul two expressions at once. On the other hand, reading the article I see that AI is scrubbing Wikipedia already and it’s driving up costs. What wiki is doing is creating a model where AI companies will pay them for what they’re currently doing anyway.

    I don’t like it, but until they can be blocked or otherwise prevented from access, at least wikimedia should get something out of it. However I fear that this will lead to dependency on AI money, and that, that’s not good.

    • merc@sh.itjust.works
      link
      fedilink
      arrow-up
      18
      ·
      11 hours ago

      Um… it does what?

      Anyhow, not only is AI scraping (not scrubbing, that’s something completely different) Wikipedia, the Wikipedia licenses allow the AI companies to use the materials. Wikipedia content is licensed CC BY-SA or GFDL. So, while Wikipedia could try to block the scrapers, they can’t block the companies from using the content as long as they comply with those (very open) licenses. And, really, this is part of how I want Wikipedia to be used. Not necessarily to train up chatbots, but I want it to be a freely available, freely usable source of knowledge for the world. I like it that it isn’t knowledge that’s hidden behind some firewall. And, if chatbots are going to be trained on the contents of the Internet, at least we know that some of the training data will be good, factual knowledge, not memes, lies, propaganda, etc.

      So, while I’m not happy with anything where data is being sold to the AI companies, in this case I’ll try to get over my knee-jerk reaction and see it as a good thing. Wikipedia gets paid for something that was already freely available, and maybe the jazzed-up autocomplete will more frequently autocomplete from a good source.

    • Deceptichum@quokk.au
      link
      fedilink
      English
      arrow-up
      13
      arrow-down
      4
      ·
      14 hours ago

      Wikipedia already has hundreds of millions, they’re not hurting for cash despite the ads. I’ve stopped donating to them, because the money is only going to Jimmy Wales and not to the actual hard working volunteers who make Wikipedia the great resource it is.

  • dan@upvote.au
    link
    fedilink
    arrow-up
    23
    arrow-down
    1
    ·
    15 hours ago

    I was going to ask why they need to pay given you can download a full copy of Wikipedia’s database for free, but it makes sense that AI training is stressing their download servers and therefore they want to receive compensation for it.

    I’m still undecided about AI, but non-profits receiving payment from big companies that take advantage of their work is always a positive thing.

    • Triumph@fedia.io
      link
      fedilink
      arrow-up
      11
      ·
      15 hours ago

      My gut also tells me that training on Wikipedia content is going to produce way better results than training on reddit content.

    • merc@sh.itjust.works
      link
      fedilink
      arrow-up
      3
      ·
      11 hours ago

      My guess is that this was necessary because the AI companies already downloaded the offline versions of Wikipedia. But, they think they can one-up their competition by having “fresher data” so they either hammer the download servers and download the 25 GB full offline version multiple times a day, just in case it changed. Or, they might crawl and scrape Wikipedia so they get the data before it makes it into the daily offline version, or something.

      It wouldn’t be hard for Wikipedia to provide them a feed of the changes going to the Wikipedia database so they get the data as fresh as it can possibly be. Plus, doing this most likely reduces the antisocial behaviours that the AI companies would otherwise engage in to get their fresh data. Win, win. Even if it sucks to give these AI companies a win.

    • mrmaplebar@fedia.io
      link
      fedilink
      arrow-up
      5
      arrow-down
      2
      ·
      15 hours ago

      Just because you can download something does not mean you have the IP rights to do whatever you want with it.

      • DomeGuy@lemmy.world
        link
        fedilink
        arrow-up
        4
        ·
        13 hours ago

        While this is generally true, Wikipedia content is explicitly licensed with a pretty permissive license.

        Eyeballing the english-language summary, the only thing that seems to be tripping GenAI is the “don’t hurt us” section about causing harm to their servers, which explains the story here.

        If LLM.comapnies wanted to just put up with a delay due to how often Wikipedia lets you download everything, they’d probably be able to just do that.

  • lolola@lemmy.blahaj.zone
    link
    fedilink
    arrow-up
    16
    ·
    15 hours ago

    It’s a community pool. It was only a matter of time before some rich asshole complained about it not accommodating their superyacht.

  • Avid Amoeba@lemmy.ca
    link
    fedilink
    arrow-up
    5
    ·
    edit-2
    14 hours ago

    However, companies scraping high volumes of freely available Wikipedia knowledge for AI training has driven up server demand and, subsequently, costs at the non-profit, whose primary source of income is small donations from the public.

    It’s framed around cost.