Git Product home page Git Product logo

sga's Introduction

Set-level Guidance Attack

The official repository for Set-level Guidance Attack (SGA).
ICCV 2023 Oral Paper: Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models (https://arxiv.org/abs/2307.14061)

Please feel free to contact [email protected] if you have any question.

Brief Introduction

Vision-language pre-training (VLP) models have shown vulnerability to adversarial attacks. However, existing works mainly focus on the adversarial robustness of VLP models in the white-box settings. In this work, we inverstige the robustness of VLP models in the black-box setting from the perspective of adversarial transferability. We propose Set-level Guidance Attack (SGA), which can generate highly transferable adversarial examples aimed for VLP models.

Quick Start

1. Install dependencies

See in requirements.txt.

2. Prepare datasets and models

Download the datasets, Flickr30k and MSCOCO (the annotations is provided in ./data_annotation/). Set the root path of the dataset in ./configs/Retrieval_flickr.yaml, image_root.
The checkpoints of the fine-tuned VLP models is accessible in ALBEF, TCL, CLIP.

3. Attack evaluation

From ALBEF to TCL on the Flickr30k dataset:

python eval_albef2tcl_flickr.py --config ./configs/Retrieval_flickr.yaml \
--source_model ALBEF  --source_ckpt ./checkpoint/albef_retrieval_flickr.pth \
--target_model TCL --target_ckpt ./checkpoint/tcl_retrieval_flickr.pth \
--original_rank_index ./std_eval_idx/flickr30k/ --scales 0.5,0.75,1.25,1.5

From ALBEF to CLIPViT on the Flickr30k dataset:

python eval_albef2clip-vit_flickr.py --config ./configs/Retrieval_flickr.yaml \
--source_model ALBEF  --source_ckpt ./checkpoint/albef_retrieval_flickr.pth \
--target_model ViT-B/16 --original_rank_index ./std_eval_idx/flickr30k/ \
--scales 0.5,0.75,1.25,1.5

From CLIPViT to ALBEF on the Flickr30k dataset:

python eval_clip-vit2albef_flickr.py --config ./configs/Retrieval_flickr.yaml \
--source_model ViT-B/16  --target_model ALBEF \
--target_ckpt ./checkpoint/albef_retrieval_flickr.pth \
--original_rank_index ./std_eval_idx/flickr30k/ --scales 0.5,0.75,1.25,1.5

From CLIPViT to CLIPCNN on the Flickr30k dataset:

python eval_clip-vit2clip-cnn_flickr.py --config ./configs/Retrieval_flickr.yaml \
--source_model ViT-B/16  --target_model RN101 \
--original_rank_index ./std_eval_idx/flickr30k/ --scales 0.5,0.75,1.25,1.5

Transferability Evaluation

Existing adversarial attacks for VLP models cannot generate highly transferable adversarial examples.
(Note: Sep-Attack indicates the simple combination of two unimodal adversarial attacks: PGD and BERT-Attack)

AttackALBEF*TCLCLIPViTCLIPCNN
TR R@1*IR R@1*TR R@1IR R@1TR R@1IR R@1TR R@1IR R@1
Sep-Attack65.6973.9517.6032.9531.1745.2332.8245.49
Sep-Attack + MI58.8165.2516.0228.1923.0736.9826.5639.31
Sep-Attack + DIM56.4164.2416.7529.5524.1737.6025.5438.77
Sep-Attack + PNA_PO40.5653.9518.4430.9822.3337.0226.9538.63
Co-Attack77.1683.8615.2129.4923.6036.4825.1238.89
Co-Attack + MI64.8675.2625.4038.6924.9137.1126.3138.97
Co-Attack + DIM47.0362.2822.2335.4525.6438.5026.9540.58
SGA97.2497.2845.4255.2533.3844.1634.9346.57

The performance of SGA on four VLP models (ALBEF, TCL, CLIPViT and CLIPCNN), the Flickr30k dataset.

SourceAttackALBEFTCLCLIPViTCLIPCNN
TR R@1IR R@1TR R@1IR R@1TR R@1IR R@1TR R@1IR R@1
ALBEFPGD52.45*58.65*3.066.798.9613.2110.3414.65
BERT-Attack11.57*27.46*12.6428.0729.3343.1732.6946.11
Sep-Attack65.69*73.95*17.6032.9531.1745.2332.8245.49
Co-Attack77.16*83.86*15.2129.4923.6036.4825.1238.89
SGA97.24±0.22*97.28±0.15*45.42±0.6055.25±0.0633.38±0.3544.16±0.2534.93±0.9946.57±0.13
TCLPGD6.1510.7877.87*79.48*7.4813.7210.3415.33
BERT-Attack11.8926.8214.54*29.17*29.6944.4933.4646.07
Sep-Attack20.1336.4884.72*86.07*31.2944.6533.3345.80
Co-Attack23.1540.0477.94*85.59*27.8541.1930.7444.11
SGA48.91±0.7460.34±0.1098.37±0.08*98.81±0.07*33.87±0.1844.88±0.5437.74±0.2748.30±0.34
CLIPViTPGD2.504.934.858.1770.92*78.61*5.368.44
BERT-Attack9.5922.6411.8025.0728.34*39.08*30.4037.43
Sep-Attack9.5923.2511.3825.6079.75*86.79*30.7839.76
Co-Attack10.5724.3311.9426.6993.25*95.86*32.5241.82
SGA13.40±0.0727.22±0.0616.23±0.4530.76±0.0799.08±0.08*98.94±0.00*38.76±0.2747.79±0.58
CLIPCNNPGD2.094.824.007.811.106.6086.46*92.25*
BERT-Attack8.8623.2712.3325.4827.1237.4430.40*40.10*
Sep-Attack8.5523.4112.6426.1228.3439.4391.44*95.44*
Co-Attack8.7923.7413.1026.0728.7940.0394.76*96.89*
SGA11.42±0.0724.80±0.2814.91±0.0828.82±0.1131.24±0.4242.12±0.1199.24±0.18*99.49±0.05*

Visualization

Citation

Kindly include a reference to this paper in your publications if it helps your research:

@misc{lu2023setlevel,
    title={Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models},
    author={Dong Lu and Zhiqiang Wang and Teng Wang and Weili Guan and Hongchang Gao and Feng Zheng},
    year={2023},
    eprint={2307.14061},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

sga's People

Contributors

zoky-2020 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

sga's Issues

Python script for ALBEF to CLIP_CNN

Hi,
Thank you for great work.

Could please provide script for ALBEF--to--CLIP_CNN attack.

To check transferability from ALBEF to CLIP_CNN, I replaced target_model from ViT-B/16 to RN101 in python script eval_albef2clip-vit_flickr.py , and i got following scores which are different from the ones reported in paper (Table 2, last column for SGA). Could you please clarify the anomaly or I missed something.

ALBEF to CLIP-CNN

TR R@1 IR R@1
Paper 34.93 46.57
Reproduced 40.12 51.42

Corss Task Transferability - python scripts

Hi, thanks for the great work.
Could you please provide script (and instructions) for cross task transferability (ITR-to-IC and ITR-VG) to reproduce the results of Table-4 and Table-5 of arxive paper.
Thanks!

CLIP(CNN)

Excuse me, the existing code only gives the code about ALBEF, TCL, CLIP(VIT), I did not find CLIP(CNN), what should I do if I want to use it?

老师您好

老师,读过您的论文之后,我有一个困惑,我在我爱计算机视觉这个公众号上,看到对您的方法描述是“在迭代优化对抗图像和对抗文本的过程,该策略逐步拉远图像和文本在特征空间中的距离,从而破坏跨模态交互,达到攻击效果。”但是这个描述,我没有在您的论文中找到,可以问下您在论文的哪个部分有提到这样的描述嘛?

设备问题

你好,我想问一下代码是在什么设备上跑的,单卡能跑起来吗,我试了一下跑albef2clip-vit爆显存
image

Reproducibility of Visual Grounding Results (Table-5)

Hi @Zoky-2020
Thanks for responding to my previous issues.

I need a few clarifications regarding Table-5 results.

  1. Upon inspection of Refcoco+ dataset, I found out that refcoco+_test.json and refcoco+_val.json contain paths of images from train set of MSCOCO. I created a json file consisting of paths of these train images (along with captions) and then generated adversarial images by attacking ALBEF model.
  2. Afterwards, I performed evaluation using Grounding.py. I ensured that dataset class loads adversarial images during evaluation by modifying image paths in __getitem__ of grounding.dataset.py.

I obtained following results which are not close to the ones reported in paper. Could you please comment if I missed something while reproducing Tabe-5.

Val TestA TestB
Baseline 58.46 65.89 46.25
Co-Attack 54.26 61.80 43.81
SGA (in paper) 53.55 61.19 43.71
SGA (reproduced) 56.70 63.60 44.90

how to generate most matching text set

thanks for sharing your great work.

I am confused about this step
"we select the most matching caption pairs from the dataset of each image v to form an augmented caption set t ={t1, t2, ..., tM }"

because the dataset only gives single image-text pair, how could you find multiple matched texts for an single image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.