A decontextualized LLM-based safeguard technique for automated jailbreak mitigation

20260 citationsJournal Articlehybrid Open Access

Authors

Marco De Luca · University of Naples Federico II

Domenico Amalfitano · University of Naples Federico II

Anna Rita Fasolino · University of Naples Federico II

Patrizio Pelliccione · Gran Sasso Science Institute

Abstract

Large Language Models (LLMs) are increasingly deployed in high-risk settings, where harmful or unethical outputs remain a risk. Adversarial prompting (“jailbreaks”) can circumvent default safeguards. Emerging regulation (e.g., the EU AI Act) demands proactive controls that verify outputs before delivery. We present and evaluate D-SHIELD, a plugin-based safeguard that separates generation from validation via a stateless, decontextualized validator. Objectives are to assess alignment with expert judgments, evaluate end-to-end mitigation on publicly sourced jailbreaks, compare with representative plugin-based defenses, and examine a lightweight configuration optimized for cost without reducing protection. D-SHIELD routes candidate responses from the user-facing LLM to a secondary, decontextualized LLM operating in isolation (no prompt or conversation context) to classify each response based on indications of prohibited content derived from the EU AI Act, The General-Purpose AI Code of Practice, GDPR, and provider policies. This decontextualized design intentionally prevents prompt contamination, adversarial framing, and conversational drift from influencing the validation decision, addressing key weaknesses of context-aware validators. We create an expert-labeled dataset from designed jailbreaks for direct comparison with the decontextualized validator’s classification. We then embed the validator in a working prototype and evaluate on publicly sourced jailbreaks. Finally, we conduct a comparative study against baseline jailbreak-mitigation techniques and analyze a lightweight guard variant. The decontextualized validator closely aligns with expert decisions, especially for explicit harms, while adopting a conservative stance on borderline cases. In prototype evaluation on publicly sourced jailbreaks, the safeguard blocked most harmful responses. Compared with baselines, D-SHIELD yields fewer successful attacks under a common benchmark. The lightweight variant delivers comparable protection at markedly lower cost. Decontextualized, output-level validation provides an effective, regulation-aligned solution for LLM safety. Restricting the validator to the generated text complements input-level defenses and supports practical deployment, particularly in a lightweight configuration.

Topics & Keywords

Security and Verification in Computing Radiation Effects in Electronics Distributed systems and fault tolerance

Publication Details

Published in: Information and Software Technology

Volume 195, pp. 108130-108130

DOI: 10.1016/j.infsof.2026.108130

Field-Weighted Citation Impact: 0.00

Command Palette

A decontextualized LLM-based safeguard technique for automated jailbreak mitigation

Authors

Abstract

Topics & Keywords

Publication Details