gradsafe's Introduction

GradSafe

Official Code for ACL 2024 paper "GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis" https://arxiv.org/abs/2402.13494

Overview

Large Language Models (LLMs) face threats from unsafe prompts. Existing methods for detecting unsafe prompts for LLMs are primarily online moderation APIs or finetuned LLMs. These strategies, however, often require extensive and resource-intensive data collection and training processes. In this study, we propose GradSafe, which effectively detects unsafe prompts by scrutinizing the gradients of safety-critical parameters in LLMs. Our methodology is grounded in a pivotal observation: when LLMs are trained on unsafe prompts paired with compliance responses, the resulting gradients on certain safety-critical parameters exhibit consistent patterns. In contrast, safe prompts lead to markedly different gradient patterns. Building on this observation, GradSafe analyzes the gradients from prompts (paired with compliance responses) to accurately detect unsafe prompts. We show that GradSafe, applied to Llama-2 without further training, outperforms Llama Guard—despite its extensive finetuning with a large collected dataset—in detecting unsafe content. This superior performance is consistent across both zero-shot and adaptation scenarios, as evidenced by our evaluations on the ToxicChat and XSTest.

Dataset

The ToxicChat dataset is available at https://huggingface.co/datasets/lmsys/toxic-chat, we use toxicchat1123 in our evaluation.

The XSTest dataset is available at https://huggingface.co/datasets/natolambert/xstest-v2-copy

Please download the dataset and save at the ./data

Base model

Please download Llama-2 7b from https://huggingface.co/meta-llama/Llama-2-7b-chat-hf, and save at ./model

Demo

Evaluate the performance of GradSafe on two dataset:

python ./code/test_xstest.py

python ./code/test_toxicchat.py

gradsafe's People

Contributors

Stargazers

Watchers

gradsafe's Issues

Precision is about 0.444?

Thanks for your code!

I am reaching out to discuss some observations I've made while utilizing your codebase. I've conducted a series of tests using the XSTEST dataset located at /data/xstest/xstest_v2_prompts.csv and have not made any alterations to the original code.

Upon running the tests, I've encountered the following results:

Precision: 0.4444444444444444 Recall: 1.0 F1 Score: 0.6153846153846153 AUPRC: 0.2706038220880872

I've noticed that the Precision appears to be quite low, which has had an impact on the overall F1 Score and AUPRC. I am wondering if this could potentially be related to the threshold setting for classification, which is currently prescribed at 0.25 within your code repository.

Would you recommend adjusting this threshold to potentially improve the precision? Or do I need to do other adjustment?

I truly appreciate any insights or recommendations you might have on this matter and look forward to your advice.

Thank you for your time and consideration.

a question about the code

Hello! Sorry to bother, but I have a problem to consult you about the code.

When reading your code, I found that in code/find_critical_parameters.py, when you specifying the label of the input text, you use -100 to mask the preceding tokens of "sure". I want to ask why not just specify the label as "sure"? What's the pro of the way your code takes?

Thanks!!

Recommend Projects

xyq7 / gradsafe Goto Github PK

gradsafe's Introduction

GradSafe

Overview

Dataset

Base model

Demo

gradsafe's People

Contributors

Stargazers

Watchers

Forkers

gradsafe's Issues

Precision is about 0.444?

a question about the code

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent