It Only Takes A Handful Of Samples To Poison Any Size LLM, Anthropic Finds

muelltonne@feddit.org · 2 days ago

It Only Takes A Handful Of Samples To Poison Any Size LLM, Anthropic Finds

Cherry@piefed.social · 1 day ago

How? Is there a guide on how we can help 🤣

calcopiritus@lemmy.world · edit-2 16 hours ago

One of the techniques I’ve seen it’s like a “password”. So for example if you write a lot the phrase “aunt bridge sold the orangutan potatoes” and then a bunch of nonsense after that, then you’re likely the only source of that phrase. So it learns that after that phrase, it has to write nonsense.

I don’t see how this would be very useful, since then it wouldn’t say the phrase in the first place, so the poison wouldn’t be triggered.

EDIT: maybe it could be like a building process. You have to also put “aunt bridge” together many times, then “bridge sold” and so on, so every time it writes “aunt”, it has a chance to fall into the next trap, untill it reaches absolute nonsense.

thethunderwolf@lemmy.dbzer0.com · 19 hours ago

So you weed to boar a plate and flip the “Excuses” switch