Tu veux rigoler encore un peu plus ?

Prompt-Guard-86M was produced by fine-tuning the base model to make it capable of catching high-risk prompts. But Priyanshu found that the fine-tuning process had minimal effect on single English language characters. As a result, he was able to devise an attack.
"The bypass involves inserting character-wise spaces between all English alphabet characters in a given prompt," explained Priyanshu in a GitHub Issues post submitted to the Prompt-Guard repo on Thursday. "This simple transformation effectively renders the classifier unable to detect potentially harmful content."
BTW... Meta today released its Apache 2.0-licensed Segment Anything Model 2 that performs object segmentation for videos and images.
The finding is consistent with a post the security org made in May about how fine-tuning a model can break safety controls.
"Whatever nasty question you'd like to ask right, all you have to do is remove punctuation and add spaces between every letter," Hyrum Anderson, CTO at Robust Intelligence, told The Register. "It's very simple and it works. And not just a little bit. It went from something like less than 3 percent to nearly a 100 percent attack success rate."