• FaceDeer@fedia.io
    link
    fedilink
    arrow-up
    46
    arrow-down
    1
    ·
    2 days ago

    I don’t see why everyone’s surprised about this. The Fediverse is running on ActivityPub, an open protocol whose purpose is to broadcast the content we post here to anyone who wants it. Of course it’s being used to train AI, why wouldn’t it?

    • OpenStars@piefed.social
      link
      fedilink
      English
      arrow-up
      34
      arrow-down
      1
      ·
      2 days ago

      Except iirc, they aren’t scraping “properly” (read: efficiently at least, setting aside morality for the sake of discussing this component in isolation), and are causing traffic troubles. If only they took the time to install an actual instance themselves then nobody would care in the slightest (again, ignoring the morality part, for now).

      TLDR: they are being dicks about it, bc offering everything we have for free is not enough for them.

      • MrKaplan@lemmy.world
        link
        fedilink
        English
        arrow-up
        7
        ·
        2 days ago

        of all the scrapers we see, the requests identified as originating from Meta seem to be well behaved overall. they appear to (mostly) be respecting robots.txt where present and their request volume to Lemmy.World is only averaging slightly above 5 requests per minute over the last 2 weeks. they also don’t spoof their user agents to pretend to be web browsers, or at least I have not seen credible accusations of this happening.

      • Chloé 🥕@lemmy.blahaj.zone
        link
        fedilink
        English
        arrow-up
        5
        ·
        2 days ago

        i mean, that’s exactly what they did with threads, and many instances defederated from it because they didn’t want to have their data scraped by meta

      • scytale@piefed.zip
        link
        fedilink
        English
        arrow-up
        11
        ·
        2 days ago

        But if they do it the “proper” way, they won’t be able to grab the data if instances defederate from them, right? And that’s what the majority of instances will do.

        • FaceDeer@fedia.io
          link
          fedilink
          arrow-up
          9
          ·
          2 days ago

          Assuming you know which instances are the ones they’re collecting data from. It could be any instance.

          • OpenStars@piefed.social
            link
            fedilink
            English
            arrow-up
            6
            ·
            edit-2
            2 days ago

            You are absolutely correct there, in that hypothetical scenario if they were to attempt to hide their traffic among normal instance activities.

            To add a bit more detail to my previous answer, there were some prior discussions about this topic, citing some of the most popular instances of the entire Threadiverse having been targeted by their normal DDOS-like approach:

            img

        • OpenStars@piefed.social
          link
          fedilink
          English
          arrow-up
          1
          ·
          2 days ago

          I do not know enough about the ActivityPub protocol to answer that. I did think that federation at least used to be the default many years ago but aren’t sure about the current status of that. Indeed detection and subsequent blocking will always be the cat and mouse game that is played but use of ActivityPub might at least delay the former part? And how would anyone find out, compared to e.g. if not a single person household then at least a small community instance just wanting to pull down all the content across the Fediverse to read up on?

    • Eggyhead@lemmings.world
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      1
      ·
      2 days ago

      At this point, I appreciate that anyone can scrape it. Not just Reddit or Meta exclusively, but any start up that’s wants to compete. Sure, meta and the biggies have an easier time of it, but at least they don’t get it all only for themselves.

    • Microw@piefed.zip
      link
      fedilink
      English
      arrow-up
      5
      ·
      2 days ago

      That doesnt necessarily mean that training AI on this data is legal. Especially when multiple of these instances had legal documents in place specifically forbidding this kind of use.

      • FaceDeer@fedia.io
        link
        fedilink
        arrow-up
        10
        ·
        2 days ago

        There are some lawsuits in motion about this and the early signs are that it is indeed legal. For example, in Kadrey et al v. Meta the judge issued a summary judgment that training an AI on books was “highly transformative” and fell under fair use, and similarly in Bartz, Graeber and Johnson v. Anthropic the judge ruled that training an AI on books was fair use. I always expected this would be the case since an AI model does not literally contain the training material it was trained on, it learns patterns from the training material but that’s not the same as the literal expression of the training material. Since the training material isn’t being copied there’s nothing for copyright to restrict here.

    • Corelli_III@midwest.social
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      1
      ·
      2 days ago

      it isn’t about surprise silly goose its about moving the interaction from a suspected unknown to a known interaction in our collective threat models

      silly goose

  • flamingos-cant (hopepunk arc)@feddit.uk
    link
    fedilink
    English
    arrow-up
    12
    ·
    2 days ago

    Copy and pasting my own list from here

    List of instance
    beehaw.org
    furry.engineer
    ibe.social
    fediworld.de
    framatube.org
    trailers.ddigest.com
    nrw.social
    lemmynsfw.com
    video.hardlimit.com
    digitalcourage.social
    xn--baw-joa.social
    tube.kockatoo.org
    equestria.social
    wisskomm.social
    social.anoxinon.de
    freiburg.social
    toobnix.org
    toot.bike
    mstdn.lalafell.org
    peertube.linuxrocks.online
    social.rebellion.global
    mastodon.cipherbliss.com
    social.sdf.org
    corteximplant.com
    typo.social
    www.404media.co
    mastodon.ml
    video.liberta.vip
    tilvids.com
    todon.eu
    hessen.social
    digipres.club
    shigusegubu.club
    mastodon.me.uk
    zdf.social
    mastodon.sdf.org
    spore.social
    kolektiva.media
    gruene.social
    share.tube
    nso.group
    mastouille.fr
    masto.es
    vivaldi.com
    literatur.social
    mstdn.mx
    kirche.social
    mastodon.hams.social
    federation.network
    lile.cl
    todon.nl
    betweenthelions.link
    ipv6.social
    linuxrocks.online
    peertube.otakufarms.com
    pawb.social
    mastodon-belgium.be
    jasette.facil.services
    machteburch.social
    mastodont.cat
    mastodon.eus
    eupolicy.social
    social.bau-ha.us
    toot.berlin
    amicale.net
    hexbear.net
    mastodon.bida.im
    reddthat.com
    shelter.moe
    mastodon.nl
    dju.social
    bonn.social
    mstdn.chrisalemany.ca
    social.sciences.re
    tldr.nettime.org
    lemy.lol
    climatejustice.social
    rollenspiel.social
    mastodon.org.uk
    social.kyiv.dcomm.net.ua
    pouet.chapril.org
    ecoevo.social
    social.politicaconciencia.org
    darmstadt.social
    peertube.tv
    lemmus.org
    libretooth.gr
    hackers.town
    tooter.social
    anarchism.space
    diode.zone
    video.infosec.exchange
    mastodon.thirring.org
    aussie.zone
    social.bund.de
    apobangpo.space
    shitpost.cloud
    berlin.social
    toot.aquilenet.fr
    social.beachcom.org
    lemmygrad.ml
    mastodon.radio
    nerdculture.de
    programming.dev
    decayable.ink
    kafeneio.social
    functional.cafe
    things.uk
    fuzzies.wtf
    diaspodon.fr
    dalek.zone
    sunbeam.city
    tooting.ch
    fediscience.org
    mastodon.tetaneutral.net
    social.librem.one
    im-in.space
    lemmy.sdf.org
    legal.social
    post.lurk.org
    mastodon.uy
    noc.social
    tube.pol.social
    lemmy.ml
    don.linxx.net
    infosec.pub
    kolektiva.social
    masto.bike
    furries.club
    zhub.link
    lemmy.world
    openbiblio.social
    mastodon.zaclys.com
    mamot.fr
    clacks.link
    discuss.tchncs.de
    cyberplace.social
    graz.social
    pl.kitsunemimi.club
    mastodonczech.cz
    masto.nobigtech.es
    hostux.social
    pawb.fun
    mastodon.trueten.de
    norden.social
    systemli.social
    mander.xyz
    ciberlandia.pt
    woem.men
    sopuli.xyz
    lemmy.ca
    feddit.uk
    
    • rollin@piefed.social
      link
      fedilink
      English
      arrow-up
      1
      ·
      2 days ago

      off-topic but wow, it’s great to see so many lemmy instances up and running 🥰

      it really looks like we’re well on the way to hitting critical mass

      • flamingos-cant (hopepunk arc)@feddit.uk
        link
        fedilink
        English
        arrow-up
        0
        ·
        edit-2
        2 days ago
        # Convert PDF to text
        pdftotext Meta_Leaked_List.pdf meta.txt
        # Fetch instances feddit.uk federates with
        curl https://feddit.uk/api/v3/federated_instances | jq -r .federated_instances.linked.[].domain > instances
        # Get the lines that overlap between the two files
        grep -xFf instances meta.txt
        
        • Flax@feddit.uk
          link
          fedilink
          English
          arrow-up
          1
          ·
          2 days ago

          Sorry lol, turns out my client just didn’t render the post properly with the additional link to the file!

  • kbal@fedia.io
    link
    fedilink
    arrow-up
    13
    ·
    2 days ago

    I see that shitposter.club is on the list. Good to know they’re using only the highest-quality training material.

      • kbal@fedia.io
        link
        fedilink
        arrow-up
        1
        ·
        2 days ago

        Moved to shitposter.world according to their site with the expired cert, but I haven’t seen as much on fedi from the new domain as I used to from the old one.

  • Cris@lemmy.world
    link
    fedilink
    English
    arrow-up
    10
    ·
    2 days ago

    This is only a loosely related thought, but are there any new foss licenses or anything that prohibit ai usage? I know it’ll be ignored but it feels like explicitly disallowing things could be important in opening the door to successful legal challenges to ai scraping and theft…

    • FaceDeer@fedia.io
      link
      fedilink
      arrow-up
      13
      ·
      2 days ago

      Case law is still pretty young in this area, but it’s looking like there’s nothing actually against copyright about the training of AI on copyrighted content. It’s not something that a license can restrict because the trainers can simply reject the license and carry on training under the basics of what the law allows them to do anyway.

      Open source licenses only have power because they grant permissions that people normally wouldn’t have and put conditions on those permissions. If you don’t need those permissions then you don’t have to be bound by those conditions.

      • Cris@lemmy.world
        link
        fedilink
        English
        arrow-up
        6
        ·
        2 days ago

        Ahhh, that sucks ass :(

        Thank you for expanding my understanding of the problem!