S-Eval: Towards Automated Safety Evaluation with Enhancement for Large Language Models

20260 citationsJournal Article

Authors

Xiaohan Yuan · Zhejiang University

Jinfeng Li · Alibaba Group (China)

Dongxia Wang · Zhejiang University

YueFeng Chen · Alibaba Group (China)

Xiaofeng Mao · Alibaba Group (China)

Longtao Huang · Alibaba Group (China)

Jialuo Chen · Zhejiang University

Abstract

Large language models (LLMs) have revolutionized natural language processing with their transformative and emergent capabilities. However, recent evidence indicates that LLMs can produce harmful content that violates social norms, raising significant concerns regarding the safety ramifications of deploying these advanced models. Thus, it is both critical and imperative to perform a rigorous and comprehensive safety evaluation of LLMs before deployment. Despite this need, owing to the extensiveness of LLM generation space, it still lacks a unified and standardized risk taxonomy to systematically reflect the LLM content safety, automated assessment techniques to explore the potential risks efficiently, as well as defense mechanisms for timely mitigation. To bridge the striking gap, we propose S-Eval, a novel LLM-based automated Safety Evaluation framework. S-Eval incorporates three key components, i.e., an expert testing LLM \(\mathcal{M}_{t}\) , a novel safety critique LLM \(\mathcal{M}_{c}\) , and constitutional defense. The expert testing LLM \(\mathcal{M}_{t}\) is responsible for automatically generating test cases in accordance with the proposed risk management (including 8 risk dimensions and a total of 102 subdivided risks). The safety critique LLM \(\mathcal{M}_{c}\) can provide quantitative and explainable safety evaluations for better risk awareness of LLMs. Furthermore, the constitutional defense enables differentiated safety constraints in a non-intrusive manner. In contrast to prior works, S-Eval differs in significant ways: (i) efficient – we construct a multi-dimensional and open-ended benchmark comprising 220,000 test cases across 102 risks utilizing \(\mathcal{M}_{t}\) and conduct safety evaluations for 29 influential LLMs via \(\mathcal{M}_{c}\) on our benchmark. The entire process is fully automated and requires no human involvement. (ii) effective – extensive validations show that S-Eval facilitates a more thorough assessment and better perception of potential LLM risks, where \(\mathcal{M}_{c}\) not only accurately quantifies the risks of LLMs but also provides explainable and in-depth insights into their safety, and our constitutional defense method can effectively mitigate the safety risks. (iii) adaptive – S-Eval can be flexibly configured and adapted to the rapid evolution of LLMs and accompanying new safety threats, test generation methodologies, safety critique approaches, and defense mechanisms. We further study the impacts of hyper-parameters, languages, and reasoning on model safety, which may lead to promising directions for future research. S-Eval has been deployed at our industrial partner, Alibaba Group, for automated safety evaluation of multiple LLMs serving millions of users, demonstrating its effectiveness in real-world scenarios.

Topics & Keywords

Adversarial Robustness in Machine Learning Hate Speech and Cyberbullying Detection Topic Modeling

UN Sustainable Development Goals

Quality Education

Publication Details

Published in: ACM Transactions on Software Engineering and Methodology

DOI: 10.1145/3805706

Field-Weighted Citation Impact: 0.00