
Anonymized code for the paper: Matryoshka Query Transformer for Large Vision-Language Models

## Start environment


```Shell
conda create -n matry python=3.10 -y
conda activate matry
pip install --upgrade pip  
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation --no-cache-dir
```
## Data Preparation

refer to LLaVA's [data preparation](docs/Data.md)

## Training

Stage 1 training, edit the output_dir [here](scripts/v1_5/pretrain_querys.sh) to save your projector.
then run: 

```Shell
sh pretrain.sh
```

For stage 2 training, edit the pretrain_mm_query_abstractor [here](scripts/v1_5/finetune_llava7b.sh) to your stage one projector. 
results, the path is like path/to/your/query_abstractor.bin   
don't forget the query_abstractor.bin 

edit the output_dir, then run:

```Shell
sh train.sh 
```

By default, the training is the original 256 querys abstractor for both stage. If you want to use matryoshka. Then 
uncomment the hardcoded two lines [here](llava/model/multimodal_projector/builder.py#L233-L234) and remove the hardcode of matry=256.


## Evaluation 

Following LLaVA [evaluation](docs/Evaluation.md), we use the eval scripts in the [LLaVA repo](scripts/v1_5/eval/)