Git Product home page Git Product logo

gradsafe's Introduction

GradSafe

Official Code for ACL 2024 paper "GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis" https://arxiv.org/abs/2402.13494

Overview

Large Language Models (LLMs) face threats from unsafe prompts. Existing methods for detecting unsafe prompts for LLMs are primarily online moderation APIs or finetuned LLMs. These strategies, however, often require extensive and resource-intensive data collection and training processes. In this study, we propose GradSafe, which effectively detects unsafe prompts by scrutinizing the gradients of safety-critical parameters in LLMs. Our methodology is grounded in a pivotal observation: when LLMs are trained on unsafe prompts paired with compliance responses, the resulting gradients on certain safety-critical parameters exhibit consistent patterns. In contrast, safe prompts lead to markedly different gradient patterns. Building on this observation, GradSafe analyzes the gradients from prompts (paired with compliance responses) to accurately detect unsafe prompts. We show that GradSafe, applied to Llama-2 without further training, outperforms Llama Guard—despite its extensive finetuning with a large collected dataset—in detecting unsafe content. This superior performance is consistent across both zero-shot and adaptation scenarios, as evidenced by our evaluations on the ToxicChat and XSTest.

Dataset

The ToxicChat dataset is available at https://huggingface.co/datasets/lmsys/toxic-chat, we use toxicchat1123 in our evaluation.

The XSTest dataset is available at https://huggingface.co/datasets/natolambert/xstest-v2-copy

Please download the dataset and save at the ./data

Base model

Please download Llama-2 7b from https://huggingface.co/meta-llama/Llama-2-7b-chat-hf, and save at ./model

Demo

Evaluate the performance of GradSafe on two dataset:

python ./code/test_xstest.py

python ./code/test_toxicchat.py

gradsafe's People

Contributors

xyq7 avatar

Stargazers

RunminOu avatar Kaipeng avatar  avatar  avatar 绽琨 avatar Trangle Heshvp avatar Jiawei Liu avatar  avatar Chia Xin Wei avatar Dongfei Cui avatar Tinghao Xie avatar  avatar  avatar Yang Cao avatar JinSeok Kim avatar  avatar Jingwei Yi avatar Stefano Di Paola avatar HXH avatar yangchao avatar Runtao Liu avatar Jon N avatar  avatar Mason Francis avatar Jim Pfleger avatar Nigel Randsley avatar Hanrong Ye avatar shizhediao avatar MiZhenxing avatar  avatar Renjie PI avatar  avatar

Watchers

 avatar

gradsafe's Issues

Precision is about 0.444?

Thanks for your code!

I am reaching out to discuss some observations I've made while utilizing your codebase. I've conducted a series of tests using the XSTEST dataset located at /data/xstest/xstest_v2_prompts.csv and have not made any alterations to the original code.

Upon running the tests, I've encountered the following results:

Precision: 0.4444444444444444 Recall: 1.0 F1 Score: 0.6153846153846153 AUPRC: 0.2706038220880872

I've noticed that the Precision appears to be quite low, which has had an impact on the overall F1 Score and AUPRC. I am wondering if this could potentially be related to the threshold setting for classification, which is currently prescribed at 0.25 within your code repository.

Would you recommend adjusting this threshold to potentially improve the precision? Or do I need to do other adjustment?

I truly appreciate any insights or recommendations you might have on this matter and look forward to your advice.

Thank you for your time and consideration.

a question about the code

Hello! Sorry to bother, but I have a problem to consult you about the code.

When reading your code, I found that in code/find_critical_parameters.py, when you specifying the label of the input text, you use -100 to mask the preceding tokens of "sure". I want to ask why not just specify the label as "sure"? What's the pro of the way your code takes?

Thanks!!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.