You are mixing two kind of AI, LLM and diffusion.
It’s way harder for a diffusion model to not change the rest, the first step of a diffusion model is to use a lossy compression to transform the picture into a soup of digits that the diffusion model can understand.
“just tell your LLM not to do that”
You ever ask an LLM to modify a picture and “don’t change anything else”? It’s going to change other things.
Case in point: https://youtu.be/XnWOVQ7Gtzw
That’s why you always add “and no mistakes”
Also “don’t hallucinate”
And “don’t become self arrest”
You are mixing two kind of AI, LLM and diffusion.
It’s way harder for a diffusion model to not change the rest, the first step of a diffusion model is to use a lossy compression to transform the picture into a soup of digits that the diffusion model can understand.
And an LLM will convert a prompt into a bunch of tokens the model can understand.