I was testing an LLM for work today (I believe its actually a chain of different models at work) and was trying to rock it off its guard rails to see how it would act. I think I might have been successful because it started erroring instead of responding after its third response. I tried the classic “ignore previous instructions…” as well as “my grandma’s dying wish was for…” but it at least didn’t give me an unacceptable response
It’s definitely that. Those guardrails often give out on the 3rd or even 2nd reply:
https://youtu.be/VRjgNgJms3Q
From my personal experience it needs much more
I was testing an LLM for work today (I believe its actually a chain of different models at work) and was trying to rock it off its guard rails to see how it would act. I think I might have been successful because it started erroring instead of responding after its third response. I tried the classic “ignore previous instructions…” as well as “my grandma’s dying wish was for…” but it at least didn’t give me an unacceptable response