Hot Saucerman

MOTHER FATHER CHINESE DENTIST!

Situationists never die, they’re just remixed.

Have you heard of Monsieur Guy Debord?

  • 5 Posts
  • 61 Comments
Joined 4 years ago
cake
Cake day: June 6th, 2020

help-circle




  • on a book is pirating said book.

    If the source is literally a piracy website that serves up applications on how to remove DRM from ebooks, it’s absolutely piracy. You can’t just deny the source and be like “it’s not piracy!” The way the data came into your hands was illicitly, not legally. Especially if DRM has been circumvented and removed before it came into your hands.

    They didn’t go out and buy copies of thousands of books.

    Pretty amusing that you think scraping published data somehow constitutes surveillance, though.

    I don’t, I was making a point about how absurdly large the language models have to be, which is to say, if they have to have that much data on top of thousands of pirated books, it means they fundamentally cannot make the models work without also scraping the internet for data, which is surveillance.


  • Right, you can still do traditional advertising without the targeted metrics provided by smartphones, but…

    AI LLMs literally require a corpus of language to learn from. Thus the “Large Language” part of “LLM.” The amount of data these models need to function is so staggeringly huge there is no way they can compile all that data without scraping the entire internet and pirating a bunch of copyrighted books.

    It’s fundamentally a surveillance technology, because the technology fundamentally cannot function without that large dataset of language to begin with. It needs massive amounts of data that have to be surveilled to be achieved, because unless you’re Reddit or Facebook, your own site probably doesn’t contain enough data to fill out the needs of the LLM. Thus you need to scrape the internet for more data in hopes of filling it out.

    Books3 is used widely as part of “The Pile” and is clearly all of the content of private torrent tracker Bibliotik. People theorize Books2 is all of the books from Library Genesis. To be able to make their models work, they have to scrape the internet and pirate thousands of books to make it functional at all.

    This is also fundamentally why AI starts to fail so quickly, because these tools have been used to flood the internet with AI generated pages, which in turn become training data for AI, which means the training data is tainted with AI generated garbage, which will further degrade the LLM. On the plus side, I guess, is that if they keep using this kind of business model, they will unintentionally make their AI pretty useless within a few years by flooding the internet with useless, incorrect data.



  • Books3 is the definition of “not publicly available” because it’s all from pirated material downloaded from private torrent tracker Bibliotik.

    Books3 is literally why several of AI groups are being sued by various authors like Sarah Silverman and George R.R. Martin.

    Books3 was always illicitly obtained material which put into question whether an LLM using it could really fall under Fair Use. (It most likely does, but it’s still a legal question that hasn’t been answered yet.)

    Books3 Link: https://huggingface.co/datasets/the_pile_books3

    Books3 Description from Link:

    This dataset is Shawn Presser’s work and is part of EleutherAi/The Pile dataset.

    This dataset contains all of bibliotik in plain .txt form, aka 197,000 books processed in exactly the same way as did for bookcorpusopen (a.k.a. books1). seems to be similar to OpenAI’s mysterious “books2” dataset referenced in their papers. Unfortunately OpenAI will not give details, so we know very little about any differences. People suspect it’s “all of libgen”, but it’s purely conjecture.



  • I think it is probably more about how Books3 is and was always the full content of a torrent site for books.

    Fair Use is all fine and good but its very telling that these companies are happy to justify piracy when its convenient for them and then oppose it when it is not.

    It is rather hypocritical and there is also questions whether Fair Use can apply to a non-human. People generating art and text with it are in a weird grey area, because on one hand it can be argued they are using a tool, but on the other, the results are so random: how much influence does the user actually have over what they create? If the answer is “not much influence” then the tool is creating the art, not the person. At that point, is it really reasonable to argue “fair use?”







  • Google is sitting on the “but they’re contractors!” angle because it makes it easier for them.

    Why?

    Because once the union does collective bargaining with their actual employer, Cognizant, the company will have almost no recourse but to increase fees to Google for the contract work.

    Once this happens, Google just says “Oops, you’re shit out of luck” and then hires a whole new company of contracted workers for the same work, for cheaper.

    Google purposefully uses this type of structure to ensure they never have to pay more, even when collective bargaining with unions does happen. Because then they can just shitcan the whole company and claim costs were too high. They certainly won’t break their contract, but you can bet your ass when time comes to renew it, Google will have found someone new to take their place.





  • The use of command line is literally Linux’s biggest strength and why they dominate the server space. Linux servers can be run “headless” with no monitor and no Graphical User Interface. Command Line only. GUI takes so much processing power from the CPU/GPU and it eats up RAM.

    Until very recently, Windows servers required much higher system specs to run the server because windows was never primarily command line. You always had to have a GUI, no headless.

    MS has gone whole hog with PowerShell, their answer to Linux. They even have versions of server that run headless now.

    Sorry, I just think its a little funny that your biggest complaint is the thing that made Linux so powerful and that Windows has been playing catch-up with Linux in that arena for over a decade now.