• drspod@lemmy.ml
    link
    fedilink
    arrow-up
    39
    arrow-down
    1
    ·
    edit-2
    6 days ago

    For accessibility:

    Daniel @d_feldman 19 Sep 2024

    The widely-used wordfreq database of English word frequencies will no longer be updated.

    Heading: Generative AI has polluted the data. Text: I don't think anyone has reliable information about post-2021 language usage by humans. The open Web (via OSCAR) was one of wordfreq's data sources. Now the Web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the word frequencies. Sure, there was spam in the wordfreq data sources, but it was manageable and often identifiable. Large language models generate text that masquerades as real language with intention behind it, even though there is none, and their output crops up everywhere.

    Sep 19, 2024 · 12:22 AM UTC