• FaceDeer@fedia.io
    link
    fedilink
    arrow-up
    47
    arrow-down
    1
    ·
    2 days ago

    I don’t see why everyone’s surprised about this. The Fediverse is running on ActivityPub, an open protocol whose purpose is to broadcast the content we post here to anyone who wants it. Of course it’s being used to train AI, why wouldn’t it?

    • OpenStars@piefed.social
      link
      fedilink
      English
      arrow-up
      34
      arrow-down
      1
      ·
      2 days ago

      Except iirc, they aren’t scraping “properly” (read: efficiently at least, setting aside morality for the sake of discussing this component in isolation), and are causing traffic troubles. If only they took the time to install an actual instance themselves then nobody would care in the slightest (again, ignoring the morality part, for now).

      TLDR: they are being dicks about it, bc offering everything we have for free is not enough for them.

      • MrKaplan@lemmy.world
        link
        fedilink
        English
        arrow-up
        7
        ·
        2 days ago

        of all the scrapers we see, the requests identified as originating from Meta seem to be well behaved overall. they appear to (mostly) be respecting robots.txt where present and their request volume to Lemmy.World is only averaging slightly above 5 requests per minute over the last 2 weeks. they also don’t spoof their user agents to pretend to be web browsers, or at least I have not seen credible accusations of this happening.

      • Chloé 🥕@lemmy.blahaj.zone
        link
        fedilink
        English
        arrow-up
        5
        ·
        2 days ago

        i mean, that’s exactly what they did with threads, and many instances defederated from it because they didn’t want to have their data scraped by meta

      • scytale@piefed.zip
        link
        fedilink
        English
        arrow-up
        11
        ·
        2 days ago

        But if they do it the “proper” way, they won’t be able to grab the data if instances defederate from them, right? And that’s what the majority of instances will do.

        • FaceDeer@fedia.io
          link
          fedilink
          arrow-up
          9
          ·
          2 days ago

          Assuming you know which instances are the ones they’re collecting data from. It could be any instance.

          • OpenStars@piefed.social
            link
            fedilink
            English
            arrow-up
            6
            ·
            edit-2
            2 days ago

            You are absolutely correct there, in that hypothetical scenario if they were to attempt to hide their traffic among normal instance activities.

            To add a bit more detail to my previous answer, there were some prior discussions about this topic, citing some of the most popular instances of the entire Threadiverse having been targeted by their normal DDOS-like approach:

            img

        • OpenStars@piefed.social
          link
          fedilink
          English
          arrow-up
          1
          ·
          2 days ago

          I do not know enough about the ActivityPub protocol to answer that. I did think that federation at least used to be the default many years ago but aren’t sure about the current status of that. Indeed detection and subsequent blocking will always be the cat and mouse game that is played but use of ActivityPub might at least delay the former part? And how would anyone find out, compared to e.g. if not a single person household then at least a small community instance just wanting to pull down all the content across the Fediverse to read up on?

    • Eggyhead@lemmings.world
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      1
      ·
      2 days ago

      At this point, I appreciate that anyone can scrape it. Not just Reddit or Meta exclusively, but any start up that’s wants to compete. Sure, meta and the biggies have an easier time of it, but at least they don’t get it all only for themselves.

    • Microw@piefed.zip
      link
      fedilink
      English
      arrow-up
      5
      ·
      2 days ago

      That doesnt necessarily mean that training AI on this data is legal. Especially when multiple of these instances had legal documents in place specifically forbidding this kind of use.

      • FaceDeer@fedia.io
        link
        fedilink
        arrow-up
        10
        ·
        2 days ago

        There are some lawsuits in motion about this and the early signs are that it is indeed legal. For example, in Kadrey et al v. Meta the judge issued a summary judgment that training an AI on books was “highly transformative” and fell under fair use, and similarly in Bartz, Graeber and Johnson v. Anthropic the judge ruled that training an AI on books was fair use. I always expected this would be the case since an AI model does not literally contain the training material it was trained on, it learns patterns from the training material but that’s not the same as the literal expression of the training material. Since the training material isn’t being copied there’s nothing for copyright to restrict here.

    • Corelli_III@midwest.social
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      1
      ·
      2 days ago

      it isn’t about surprise silly goose its about moving the interaction from a suspected unknown to a known interaction in our collective threat models

      silly goose