• 1 Post
  • 315 Comments
Joined 2 years ago
cake
Cake day: July 14th, 2023

help-circle

  • LLM image processing doesn’t work the same way reverse image lookup does.

    Tldr explanation: Multimodal LLMs turn pictures into a thousand 200-500 or so words tokens, but reverse image lookups create perceptual hashes of images and look the hash of your uploaded image up in a database.

    Much longer explanation:

    Multimodal LLMs (technically, LMMs - large multimodal models) use vision transformers to turn images into tokens. They use tokens for words, too, but these tokens don’t also correspond to words. There are multiple ways this could be implemented, but a common approach is to break the image down into a grid, then transform each “patch” of a specific size, e.g., 16x16, into a single token. The patches aren’t transformed individually - the whole image is processed together, in context - but it still comes out of it with basically 200 or so tokens that allow it to respond to the image, the same way it would respond to text.

    Current vision transformers also struggle with spatial awareness. They embed basic positional data into the tokens but it’s fragile and unsophisticated when it comes to spatial awareness. Fortunately there’s a lot to explore in that area so I’m sure there will continue to be improvements.

    One example improvement, beyond improved spatial embeddings, would be to use a dynamic vision transformers that’s dependent on the context, or that can re-evaluate an image based off new information. Outside the use of vision transformers, simply training LMMs to use other tools on images when appropriate can potentially help with many of LMM image processing’s current shortcomings.

    Given all that, asking an LLM to find the album for you is like - assuming you’ve given it the ability and permission to search the web - like showing the image to someone with no context, then them to help you find what music video - that they’ve never seen, by an artist whose appearance they describe with 10-20 generic words, none of which are their name - it’s in, and to hope there were, and that they remembered, the specific details that would make it would come up in the top ten results if searched for on Google. That’s a convoluted way to say that it’s a hard task.

    By contrast, reverse image lookup basically uses a perceptual hash generated for each image. It’s the tool that should be used for your particular problem, because it’s well suited for it. LLMs were the hammer and this problem was a torx screw.

    Suggesting you use - or better, using a reverse image lookup tool itself - is what the LLM should do in this instance. But it would need to have been trained to think to suggest this, capable of using a tool that could do the lookup, and have both access and permission to do the lookup.

    Here’s a paper that might help understand the gaps between LMMs and tasks built for that specific purpose: https://arxiv.org/html/2305.07895v7



  • Why is 255 off limits? What is 127.0.0.0 used for?

    To clarify, I meant that specific address - if the range starts at 127.0.0.1 for local, then surely 127.0.0.0 does something (or is reserved to sometimes do something, even if it never actually does in practice), too.

    Advanced setup would include a reverse proxy to forward the requests from the applications port to the internet

    I use Traefik as my reverse proxy, but I have everything on subdomains for simplicity’s sake (no path mapping except when necessary, which it generally isn’t). I know 127.0.0.53 has special meaning when it comes to how the machine directs particular requests, but I never thought to look into whether Traefik or any other reverse proxy supported routing rules based on the IP address. But unless there’s some way to specify that IP and the IP of the machine, it would be limited to same device communications. Makes me wonder if that’s used for any container system (vs the use of the 10, 172.16-31, and 192.168 blocks that I’ve seen used by Docker).

    Well this is another advanced setup but if you wanted to segregate two application on different subnets you can. I’m not sure if there is a security benefit by adding the extra hop

    Is there an extra hop when you’re still on the same machine? Like an extra resolution step?

    I still don’t understand why .255 specifically is prohibited. 8 bits can go up to 255, so it seems weird to prohibit one specific value. I’ve seen router subnet configurations that explicitly cap the top of the range at .254, though - I feel like I’ve also seen some that capped at .255 but I don’t have that hardware available to check. So my assumption is that it’s implementation specific, but I can’t think of an implementation that would need to reserve all the .255 values. If it was just the last one, that would make sense - e.g., as a convention for where the DHCP server lives on each network.









  • Fair point, I should have asked about commercial games in general

    That said I didn’t mean that the game studio itself would do the AI training and own their models in-house; if they did, I’d expect it to go just as poorly as you would. Rather, I’d expect the model to be created by an organization specialized in that sort of thing.

    For example, “Marey” is one example I found of a GenAI model that its creators are saying was trained ethically.

    Another is Adobe Firefly, where Adobe says they trained only on licensed and public domain content. It also sounds like Adobe is paying the artists whose content was used for AI training. I believe that Canva is doing something similar.

    StabilityAI is also doing something similar with Stable Audio 2.0, where they partnered with a music licensing company, AudioSparx, to ensure that artists are compensated, AI opt outs are respected, etc…

    I haven’t dug into any of those too deep, but they seem to be heading in the right direction at the surface level, at least.

    One of the GenAI scenarios that’s the most terrifying to me is the idea of a company like Disney using all the material they have copyright for to train their own, proprietary GenAI image, audio, and video tools… not because I think the outputs would be bad, but because of the impact that would have on creators in that industry.

    Fortunately, as long as copyright doesn’t apply to purely AI generated outputs, even if trained entirely on your own content, then I don’t think Disney specifically will do this.

    I mention that as an example because that usage of AI, regardless of how ethically the model was trained, would still be unethical, in my opinion. Likewise in game creation, an ethically trained and operated model could still be used unethically to eliminate many people’s jobs in the interest solely of better profits.

    I’d be on board with AI use (in game creation or otherwise) if a company were to say, “We’re not changing the budget we have for our human workforce, including for contractors, licensed art, and so on, other than increasing it as inflation and wages increase. We will be using ethical AI models to create more content than we otherwise would have been able to.” But I feel like in a corporate setting, its use is almost always going to result in them cutting jobs.



  • Depends on your e-reader! If you have a Kindle, Kobo, or Nook, yes, that’s true. However:

    Boox has e-readers that run Android and you can install Hoopla. The Palma 2 is phone sized which is great. The Page, Leaf2, and Go 7 are all in the 7” form factor, plus they have 6” versions. And they have tablet sizes, too. They have both traditional black&white and color e-ink displays.

    I have the Boox Air 3C and the original Palma and both are great. I’ll likely get a Boox as my next standard sized e-reader, too (whenever I replace my Kindle Oasis). Though unless the technology drastically improves before then, it’ll be one with a black and white screen. (The color is nice in the tablet sizes, though, especially for comics from Hoopla.)

    Some other options that I’m less familiar with include:

    • Bigme has Android 7” color e-readers, as well as tablets and e-ink smartphones.
    • Meebook has e-readers that run Android (and Android e-ink tablets)
    • The MuSnap Aura C is a 10” Android e-ink tablet
    • XPPen has an 11” Android e-ink tablet





  • Did he implement two different variations? OP said he used two different tools, not that his solutions were any different.

    That said… how so?

    There are many different ways two different brute force approaches might vary.

    A naive search and a search with optimizations that narrow the search area (e.g., because certain criteria are known and thus don’t need to be iterated over) can both be brute force solutions.

    You could also just change the search order to get a different variation. In this case, we have customer, price, meat, cheese, and we need to build a combination of those to get our solution; the way you construct that can also vary.


  • The comparison to your SO’s approach is a bit sloppy. He didn’t reason out a solution himself; he wrote a program to solve the puzzle.

    How do you define “reasoning?” Maybe your definition is different than mine. My experience is that there is a certain amount of reasoning going on, even with non-reasoning LLMs. Being able to answer “What is the capital of the state that has Houston in it?” for example, is something I would classify as very basic reasoning. And now, LLM-powered chat bots are much more capable.

    All that “reasoning” or “thinking” really is, though, is a way to get additional semantic connections in place without:

    • giving an answer in the wrong format
    • filling up context with noise

    There are limits to how well reasoning these char bots can reason. One of those limits is specifically related to the context size. As the context becomes larger, the model’s capabilities become worse. By asking it to show all its work, you exacerbated that weakness.

    That still doesn’t mean LLM-powered chat bots can’t reason, just that there are limits.

    I used to do puzzle books with these sorts of problems when I was younger, and they always came with multiple sets of grids with row and column labels filled out to facilitate the elimination approach. I don’t know that most people would think “Hey, it would be helpful to build a grid for each way of setting up these constraints.” One grid, sure, but I don’t think one grid is sufficient for this sort of problem.

    I don’t think I am - or that most people are, for that matter - capable of reasoning through all of the necessary steps in my head and aloud, without use of, at minimum, pencil and paper. I hope you wouldn’t say that I and most people aren’t capable of reasoning as a result.

    I just asked Claude Opus 4 (with extended thinking and analysis tools enabled) “Given the freedom to use any tools of your choice, design an approach to solve this problem. Feel free to use any tools that are necessary, including writing code in a language of your choice.” followed by pasting your problem text. It wrote and ran some code to solve it, gave me an answer that violated constraint 7, and when I pointed that out, ran some more code and gave me a correct solution on its second attempt.

    You are absolutely correct. I apologize for the error. Let me revise the solution to ensure constraint 7 is satisfied.​​​​​​​​​​​​​​​​

    Thank you for catching that error. Here is the corrected solution that satisfies all nineteen constraints:

    Final Solution:

    Customer Price Cheese Meat
    Carol $5.75 Colby Jack Ham
    Darrell $8.75 Gouda Corned Beef
    Henrietta $6.75 Swiss Salami
    Kenneth $7.75 Provolone Prosciutto
    Lula $3.75 Havarti Bologna
    Omar $4.75 Cheddar Roast Beef
    Peter $2.75 Muenster Liverwurst

    The key correction was recognizing that since ham cannot be paired with swiss cheese (constraint 7), and Henrietta must have swiss cheese based on the process of elimination, Henrietta cannot have ham. Through careful analysis of the remaining constraints, I determined that Carol has ham with her colby jack cheese, which satisfies all requirements.

    This solution demonstrates the importance of thorough verification when solving complex constraint satisfaction problems. Even when a solution appears complete, each constraint must be individually validated to ensure accuracy.​​​​​​​​​​​​​​​​

    This all took 5-10 minutes - and most of that time was spent verifying its solutions - so a third of the time your SO took.

    LLMs, even those with image analysis abilities, are lacking when it comes to spatial awareness, so your critique regarding using a grid to implement a systematic elimination approach is valid.