Git Product home page Git Product logo

exaggerated-safety's Introduction

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Without proper safeguards, large language models will follow malicious instructions and generate toxic content. This motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse to comply with unsafe prompts, and thus not be helpful. Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or mention sensitive topics. With XSTest, we introduce a new test suite to identify such eXaggerated Safety behaviours in a structured and systematic way. XSTest comprises 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that, for most LLM applications, should be refused. In our paper, we describe XSTest's creation and composition, and use the test suite to highlight systematic failure modes in state-of-the-art language models as well as more general challenges in building safer language models.

XSTest is joint work between Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. For more details on XSTest, please refer to our arxiv preprint: https://arxiv.org/abs/2308.01263

Repo Structure

xstest_prompts_v2.csv contains all test prompts.

In the model_completions folder, you can find model responses on XSTest by five different models. Please refer to our paper for more details on the models and how we collected model responses.

In model_evaluation we share a notebook, analysis.ipynb for analysing labelled model responses on XSTest. We also share two scripts to automatically label model responses (classify_completions.py), and their results on all model completions (automated_evaluation_labels).

Using XSTest

  1. Download xstest_prompts_v2.csv
  2. Using the language model you want to test, collect responses on the test prompts listed in the "prompt" column of the csv.
  3. Classify your model responses as refusals or compliance. You can either manually label the responses, or use our automated evaluation scripts (classify_completions.py) in the evaluation folder. See our paper for a taxonomy of model responses.
  4. Lastly, analyse your model responses. You can use the analysis.ipynb notebook to replicate the analyses from our paper. Your model should ideally comply with safe prompts, and, for most applications, refuse to comply with unsafe prompts, which are those where the "type" starts with "contrast_".

License

The test prompts are subject to Creative Commons Attribution 4.0 International license. The model completions are subject to the original licenses specified by Meta, Mistral and OpenAI.

exaggerated-safety's People

Contributors

paul-rottger avatar

Stargazers

theshi avatar Egor Zadorin avatar  avatar Jan Philip Wahle avatar Marcelo Albertini avatar  avatar zuijiang avatar  avatar  avatar Nikolaus Schlemm avatar Shoaib Ahmed Siddiqui avatar Jinfeng Li avatar Xuefeng Du avatar Michael Doyle avatar  avatar Xiaosen Zheng avatar Maksym Andriushchenko avatar Si Chen avatar Frank Liu avatar Keerthana (கீர்த்தனா) Madhavan avatar Joseph Imperial avatar Justin Wang avatar  avatar Dongxian Wu avatar Xuechen Li avatar Circle-HIT avatar Jonathan Colaço Carr avatar Aiwei Liu avatar Yi Zeng avatar Renat Zayashnikov avatar Daniel Scalena avatar JIMMY ZHAO avatar Andrew Zhao avatar 爱可可-爱生活 avatar Simon Šuster avatar  avatar Evan avatar Nathan Lambert avatar Dong-Ho Lee avatar Gyungin avatar Zhaorui Yang avatar Youliang Yuan avatar André Jonasson avatar Johnny Godoy avatar  avatar Chenhui Zhang avatar Clark Minor avatar Xiangyu Qi avatar Troshin Sergey avatar Andrea PIERRÉ avatar Myungchul Shin avatar Aflah avatar Roya Pakzad avatar Federico Bianchi avatar

Watchers

Giuseppe Attanasio avatar Elwin avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.