• Jared White ✌️ [HWC]@humansare.social
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        1
        ·
        2 days ago

        It is still trained on open source code on GitHub. These code communities seemingly have no way to opt out of their free (libre) contributions being used as training data, nor does the resulting code generation contribute anything back to those communities. It is a form of license stripping. That’s just one issue.

        Just because your inference running locally doesn’t use much electricity doesn’t mean you’ve sidestepped all of the other ethical issues surrounding LLMs.

        • yucandu@lemmy.world
          link
          fedilink
          arrow-up
          1
          ·
          2 days ago

          It is not trained on open source code on Github.

          But I can use it to analyze a datasheet and generate a library for an obscure module that I can then upload to Github and contribute to the community.

            • yucandu@lemmy.world
              link
              fedilink
              arrow-up
              1
              ·
              edit-2
              1 day ago

              StarCoderData.23 A large-scale code dataset derived from the permissively licensed GitHub collection The Stack (v1.2). (Kocetkov et al., 2022), which applies deduplication and filtering of opted-out files. In addition to source code, the dataset includes supplementary resources such as GitHub Issues and Jupyter Notebooks (Li et al., 2023).

              That’s not random Github accounts or “delicensing” anything. People had to opt IN to be part of “The Stack”. Apertus isn’t training itself from community code.

              • Jared White ✌️ [HWC]@humansare.social
                link
                fedilink
                English
                arrow-up
                1
                ·
                1 day ago

                I’m tired of arguing with you about this, and you’re still wrong. It was opt-out, not opt-in, based initially on a GitHub crawl of 137M repos and 52B files before filtering & dedup.

                • yucandu@lemmy.world
                  link
                  fedilink
                  arrow-up
                  1
                  ·
                  1 day ago

                  But again, you’d have to set your project to public and your license to “anyone can take my code and do whatever they want with it” before it’d be even added to that list. That’s opt-in, not opt-out. I don’t see the ethical dilemma here. I’m pretty sure I’ve found ethical AI, that produces good value for me and society, and I’m going to keep telling people about it and how to use it.