Git Product home page Git Product logo

multilingual-safety-for-llms's Introduction

Multilingual Jailbreak Challenges in Large Language Models

📄 Paper • 🤗 Dataset

This repo contains the data for our paper "Multilingual Jailbreak Challenges in Large Language Models" in ICLR 2024.

Annotation Statistics

We collected a total of 315 English unsafe prompts and annotated them into nine non-English languages. The languages were categorized based on resource availability, as shown below:

High-resource languages: Chinese (zh), Italian (it), Vietnamese (vi)

Medium-resource languages: Arabic (ar), Korean (ko), Thai (th)

Low-resource languages: Bengali (bn), Swahili (sw), Javanese (jv)

Introduction

We identify the presence of multilingual jailbreak challenges within LLMs and propose to study them under two potential scenarios: unintentional and intentional. The unintentional scenario involves users querying LLMs using non-English prompts and inadvertently bypassing the safety mechanisms, while the intentional scenario concerns malicious users combining malicious instructions with multilingual prompts to attack LLMs deliberately.

Dataset

We carefully gather English harmful queries and manually translate them by native speakers into 9 non-English languages, ranging from high-resource to low-resource. This leads us to the creation of the first multilingual jailbreak dataset called MultiJail. The prompt in this dataset can directly serve for the unintentional scenario, while we also simulate intentional scenario by combining the prompt with an English malicious instruction.

The language categories and their corresponding languages are as follows: High-resource: Chines (zh), Italic (it), Vietnamese (vi); Medium-resource: Arabic (ar), Korean (ko), Thai (th); Low-resource: Bengali (bn), Swahili (sw), Javanese (jv).

The malicious instruction used in this work is AIM.

Result

Self-Defence

To handle such a challenge in the multilingual context, we propose a novel Self-Defence framework that automatically generates multilingual training data for safety fine-tuning.

Experimental results show that ChatGPT fine-tuned with such data can achieve a substantial reduction in unsafe content generation.

Ethics Statement

Our research investigates the safety challenges of LLMs in multilingual settings. We are aware of the potential misuse of our findings and emphasize that our research is solely for academic purposes and ethical use. Misuse or harm resulting from the information in this paper is strongly discouraged. To address the identified risks and vulnerabilities, we commit to open-sourcing the data used in our study. This openness aims to facilitate vulnerability identification, encourage discussions, and foster collaborative efforts to enhance LLM safety in multilingual contexts. Furthermore, we have developed the SELF-DEFENSE framework to address multilingual jailbreak challenges in LLMs. This framework automatically generates multilingual safety training data to mitigate risks associated with unintentional and intentional jailbreak scenarios. Overall, our work not only highlights multilingual jailbreak challenges in LLMs but also paves the way for future research, collaboration, and innovation to enhance their safety.

Citation

@inproceedings{
      deng2024multilingual,
      title={Multilingual Jailbreak Challenges in Large Language Models},
      author={Yue Deng and Wenxuan Zhang and Sinno Jialin Pan and Lidong Bing},
      booktitle={The Twelfth International Conference on Learning Representations},
      year={2024},
      url={https://openreview.net/forum?id=vESNKdEMGp}
}

multilingual-safety-for-llms's People

Contributors

ntudy avatar

Stargazers

thomas-yanxin avatar jiaxiaojun avatar Junho Noh avatar Ziyi (Ezreal) Guo avatar  avatar  avatar isaac avatar Herm avatar Chunlong Xie avatar  avatar t0hka avatar kevinpro avatar Than Lwin Aung avatar Dongkeun Yoon avatar MagicBoy avatar Pengshuo Qiu avatar not a doctor avatar Pengyu Wang avatar Zack Nagaich avatar Xuefeng Du avatar Kunat Pipatanakul avatar  avatar XitaoLi avatar Yiran Zhao avatar  avatar Vecdi Burak Bengi avatar Yiqun Hu avatar 爱可可-爱生活 avatar Bowen Dong avatar NLNR avatar Marian Ignev avatar Zengzhi Wang avatar Northind avatar Chen Zhang avatar Martin Gubri avatar Lidong Bing avatar Wenxuan Zhang avatar LI XIN avatar  avatar Tokarev Igor avatar 唐国梁Tommy avatar Guizhen Chen avatar

Watchers

Lidong Bing avatar LI XIN avatar Xuan-Phi Nguyen avatar Cheng Liying 程丽颖 avatar  avatar  avatar Wenxuan Zhang avatar

Forkers

ibibek

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.