Search for a command to run...
Large Language Models (LLMs) are increasingly deployed in high-risk settings, where harmful or unethical outputs remain a risk. Adversarial prompting (“jailbreaks”) can circumvent default safeguards. Emerging regulation (e.g., the EU AI Act) demands proactive controls that verify outputs before delivery. We present and evaluate D-SHIELD, a plugin-based safeguard that separates generation from validation via a stateless, decontextualized validator. Objectives are to assess alignment with expert judgments, evaluate end-to-end mitigation on publicly sourced jailbreaks, compare with representative plugin-based defenses, and examine a lightweight configuration optimized for cost without reducing protection. D-SHIELD routes candidate responses from the user-facing LLM to a secondary, decontextualized LLM operating in isolation (no prompt or conversation context) to classify each response based on indications of prohibited content derived from the EU AI Act, The General-Purpose AI Code of Practice, GDPR, and provider policies. This decontextualized design intentionally prevents prompt contamination, adversarial framing, and conversational drift from influencing the validation decision, addressing key weaknesses of context-aware validators. We create an expert-labeled dataset from designed jailbreaks for direct comparison with the decontextualized validator’s classification. We then embed the validator in a working prototype and evaluate on publicly sourced jailbreaks. Finally, we conduct a comparative study against baseline jailbreak-mitigation techniques and analyze a lightweight guard variant. The decontextualized validator closely aligns with expert decisions, especially for explicit harms, while adopting a conservative stance on borderline cases. In prototype evaluation on publicly sourced jailbreaks, the safeguard blocked most harmful responses. Compared with baselines, D-SHIELD yields fewer successful attacks under a common benchmark. The lightweight variant delivers comparable protection at markedly lower cost. Decontextualized, output-level validation provides an effective, regulation-aligned solution for LLM safety. Restricting the validator to the generated text complements input-level defenses and supports practical deployment, particularly in a lightweight configuration.
Published in: Information and Software Technology
Volume 195, pp. 108130-108130