Is your feature request related to a problem? Please describe. Ma

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

FEAT add many-shot jailbreaking about pyrit HOT 10 OPEN

azure commented on July 28, 2024

FEAT add many-shot jailbreaking

from pyrit.

Comments (10)

KutalVolkan commented on July 28, 2024 1

Hi @romanlutz,

I'd like to help out with implementing the many-shot jailbreaking feature. I'll read the paper, and if your suggestion about needing 256+ Q/A pairs seems to be correct, I'll start with that. Since this will be my first time contributing to an open-source project, could you please provide some guidance on the general steps for contributing?

Thanks!
Volkan

from pyrit.

romanlutz commented on July 28, 2024 1

Hi @KutalVolkan !

Thanks for reaching out! We'd love to collaborate on this one 🙂 I see this as two tasks really:

adding a mechanism to craft a prompt with arbitrarily many examples plus the malicious prompt we want the LLM to answer
collecting the examples

For the former, we have prompt templates under PyRIT/datasets/prompt_templates. Perhaps it's possible to write one that has one placeholder for where the examples would go, but then have a new subclass of PromptTemplate that can insert all the examples rather than just one? Something like

template = ManyShotTemplate.from_yaml_file(...)  # same as PromptTemplate
template.apply_parameters(prompt=prompt, examples=examples)

Where examples would be the Q&A pairs.

And then a simple orchestrator like PromptSendingOrchestrator could handle sending it to targets.

For the latter, we don't really want to become the place where all the bad stuff from the internet is collected 😄 Ideally, we would want to find these in another repository and just have an import function. Plus, people can always generate or write their own set, of course.

Regarding contributing guidelines there should be plenty in the doc folder.

Please let me know if you have questions or want to chat about any of these points! I may very well have skipped something...

from pyrit.

KutalVolkan commented on July 28, 2024 1

Hi @romanlutz,

I'll start by reading the paper and then implement the many-shot jailbreaking feature as you described. I'll keep you updated on my progress.

Thanks,
Volkan

from pyrit.

romanlutz commented on July 28, 2024 1

Fantastic!

I guess I made an assumption here that the "many shots" are just in one prompt. Another option would be to "fake" the conversation history which is possible with some plain model endpoints but rather unlikely with full generative AI systems (which should prevent you from doing that). So I think I'd go with the single prompt and hence the prompt template makes sense.

Happy to discuss options, though!

from pyrit.

romanlutz commented on July 28, 2024 1

We usually use model endpoint in Azure, so I can't comment much on running locally. Maybe using an earlier version of torch helps? PyRIT shouldn't be too opinionated on which one you use.

The list of prompts you found makes sense. Still, we'd have to check in the responses somewhere. As mentioned before, I'd prefer to avoid making PyRIT the place where all the bad stuff on the internet is collected. Maybe it makes sense to put that Q&A dataset in a separate repo from where we can import it? Just thinking out loud here...

from pyrit.

KutalVolkan commented on July 28, 2024

Hello @romanlutz,

I just wanted to inform you that, according to the paper, we can use this uncensored model: WizardLM-13B-Uncensored.

We can use it to provide answers to the following questions in the "behavior" column of this dataset: harmbench_behaviors_text_all.csv.

I tried to run the model locally and encountered an issue:

UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
attn_output = torch.nn.functional.scaled_dot_product_attention(

This issue is likely not solvable according to this discussion: GitHub Issue.

Therefore, I thought about using the inference endpoints from Hugging Face instead.

P.S. Your approach of using a single prompt definitely makes sense, and I will go with that.

from pyrit.

FEAT add many-shot jailbreaking about pyrit HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent