<h1 align='center' style="text-align:center; font-weight:bold; font-size:2.0em;letter-spacing:2.0px;">
                On Evaluating Adversarial Robustness of Large </br> Vision-Language Models</h1>

<h1 align='center' style="text-align:center; font-weight:bold; font-size:1.2em;letter-spacing:2.0px;">  NeurIPS 2023 Submission</h1>  

## Installation and Environment:

- Platform: Linux
- NVIDIA A100 GPUs with CuDNN 11.6
- PyTorch 1.13.1
- Python 3.9
- lmdb, tqdm

Alternatively, you can follow the instruction below:


~~~bash
conda create -n attackvlm python=3.9
conda activate attackvlm
pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu116  # install torch-1.13.1
pip install accelerate==0.12.0 absl-py ml_collections einops ftfy==6.1.1 transformers==4.23.1
pip install -e git+https://github.com/openai/CLIP.git@main#egg=clip

# xformers is optional, but it would greatly speed up the attention computation.
pip install -U xformers
pip install -U --pre triton
~~~

## Pre-preare the targeted texts

### Step 1. 
Prepare the target training dataset with a `.txt` format, and each line in the `.txt` file includes a targeted text. For example:

~~~
Target_text
    └── text-3
    └── text-2
    └── ...
    └── text-n
~~~

### Step 2. 
Then, prepare the targeted images (the targeted images could also be real images that correspond to the ):

~~~bash
git clone https://github.com/CompVis/stable-diffusion.git
cd stable-diffusion
mv ../txt2img_coco.py ./
~~~

then, create the Stable diffusion environment
~~~bash
conda env create -f environment.yaml
conda activate ldm
~~~

next, build target images through stable-diffusion, conditioned on the coco text
~~~bash
CUDA_VISIBLE_DEVICES=0 python txt2img_coco.py --ddim_eta 0.0 \
                                 --n_samples 10 \
                                 --n_iter 1 \
                                 --scale 7.5 \
                                 --ddim_steps 50 \
                                 --plms \
                                 --skip_grid \
                                 --ckpt ../_model_pool/sd-v1-4-full-ema.ckpt \
                                 --from-file '../coco_captions.txt' \
                                 --num_caption 50000 \
                                 --outdir ../_outputs/sd_coco \
~~~


## Experiments

## Training:

~~~bash
cd ../
# MF-ii
CUDA_VISIBLE_DEVICES=0 python _train_adv_img.py \
--output coco \
--clip_encoder 'ViT-B/32' \
--batch_size 1 \
--num_samples 50000 \
--steps 100 \
~~~
after adversarial attack with MF-ii, you need to generate the response of adv images, and save them in `.txt` file.

then
~~~bash
# MF-tt
CUDA_VISIBLE_DEVICES=7 python _train_adv_img_query.py \
--output unidiffuser_adv_query_sigma_8_zero \
--text_path 'path-to-your-txt-of-adv-images(mf-ii)' \  # generated response of base adv images
--batch_size 1 \
--num_samples 50000 \  
--steps 8 \
--sigma 8 \
--delta 'zero' \  # could be either 'zero' or 'normal'
--num_query 50 \
--num_sub_query 25 \
--wandb \
~~~

You can tune the hyperparameters in the bash script

## Evaluation
In the end, we compute the clip score between text features of (1) predefined targeted text and (2) predicted text response of adv images.
~~~bash
CUDA_VISIBLE_DEVICES=0 python eval_clip_text_score.py  \
--batch_size 200 \
--num_samples 50000 \
--pred path_to_the_pred_text \
~~~

# Acknowledgement: 

We appriciate the wonderful base implementation of Stable-Diffusion, Unidiffuser, BLIP/2, MiniGPT-4 and LLaVA. In this implementation, we use Unidiffuser for the code release.

We also thank for the open-sourced evaluation code from CLIP by OpenAI.


