“In our new paper, we describe a system based on Constitutional Classifiers that guards models against jailbreaks,” Anthropic said. “These Constitutional Classifiers are input and output classifiers trained on synthetically generated data that filter the overwhelming majority of jailbreaks with minimal over-refusals and without incurring a large compute overhead.”
Constitutional Classifiers are based on a process similar to Constitutional AI, a technique previously used to align Claude, Anthropic said. Both methods rely on a constitution – a set of principles the model is designed to follow.
“In the case of Constitutional Classifiers, the principles define the classes of content that are allowed and disallowed (for example, recipes for mustard are allowed, but recipes for mustard gas are not),” the company added.
This advancement could help organizations mitigate AI-related risks such as data breaches, regulatory non-compliance, and reputational damage arising from AI-generated harmful content.
Other tech companies have taken similar steps, with Microsoft introducing its “prompt shields” feature in March last year, and Meta unveiling a prompt guard model in July 2024.
Evolving security paradigms
As AI adoption accelerates across industries, security paradigms are evolving to address emerging threats.