# Your transformer can do Arithmetic!

This repository contains code to replicate our research. It is our own private fork of the language model training framework from "cramming"

# How to run the code

Run  `pip install .` to install all dependencies.

In this code abacus embeddings are called recycle
You must set a `$cramming_base_dir` to use the code

## Requirements in Details:
* PyTorch: `torch` (at least version 2.1)
* huggingface: `transformers`, `tokenizers`, `datasets`, `evaluate`
* `hydra-core`
* `psutil`, `pynvml`, `safetensors`
* `einops`

### WandB

You can log runs to your weights&biases account. To do so, simply modify `wandb.entity` and `wandb.project` on the command line or at `cramming/config/wandb/default.yaml`.

# Arithmetic
First you need to create your datasets! Store them in `$cramming_base_dir/data`, our `$cramming_base_dir` is `cramming-data` we recommend you do the same for the code to work perfectly.
All the code commands are in the shells directory.

You may need to install:
1. pip install multiprocess -U
2. pip install dill -U
3. pip install apache-beam -U 
You WILL need to install:
1. pip install wandb
2. pip install matplotlib, seaborn, python-docx


You need to generate your datasets before training models, they are also avialble pretokenized on `link anonym`

# Odd stuff

## Multi-GPU
`python` -> `torchrun --nproc_per_node=<NUM GPUS> --standalone ` and add `impl.fullgraph=false`

## Mask Before Equals
`arch.mask_before_equals=True`

## skip connections in a FF network
`arch.forward_only_model_with_skip=True`

# Odd bits for multiplication runs
## Give samples instead of tokens equal importance in loss
`arch.loss_reduction=none`

## Throtle
`arch.throttle=True`

# Positional Embeddings:
## Old School
`arch.embedding.pos_embedding`

### Learned
`arch.embedding.pos_embedding=learned`

### Recycle
`arch.embedding.pos_embedding=recycle`

If you want the maximum k in recycle to be larger: `arch.embedding.max_recycle_len=100`, be default this value is 100

## NoPE
`arch.embedding.pos_embedding=None`

## In Attention Mech
### FIRE
`arch.embedding.pos_embedding=None arch.attention.type="self-attention" arch.attention.rotary_embedding="fire"`

#### FIRE randomised
`arch.attention.max_length=128` by deafault `arch.attention.max_length=0` so setting this longer than the max sequence length gives some randomness in the embedding

### Alibi
`arch.embedding.pos_embedding=None arch.attention.type="self-attention" arch.attention.rotary_embedding="alibi"`

### Kerple
`arch.embedding.pos_embedding=None arch.attention.type="self-attention" arch.attention.rotary_embedding="kerple"`

### Relative
`arch.embedding.pos_embedding=None arch.attention.type="self-attention" arch.attention.rotary_embedding="relative"`

## RoPE
`arch.attention.type="self-attention" arch.attention.rotary_embedding=true`

# Train from scratch
Look in shells directory

# Eval
Detailed in `shells/evaluation.sh`
We use max_rec the same in testing as in training.
For the 100 grids, <STEP_NUM> can be 1 through 10 inclusive and checkerboard odd or even to make a total of 20 jobs per grid.

`+max_rec=<max_rec> +token_limit=105 +big_eval_step_<STEP_NUM>=True +reverse_inputs=True +checkerboard=<EVEN/ODD>`

Once you have the data in jsons, there is code in `pretty_plotter.py` to help you make nicer "sown together" plots.

# Datasets
Look at the shell `shells/generate_and_tokenize_data.sh`

## DeepMind index hints:
```
python create_data_split.py --bucket --op + --n 4 --m 4 --limit 100 --p 0.0 --dir_name temp --reverse_all --deepmind_index_hints
python create_data_split.py --tokenize --dir_name <?> --tokenizer_type gdm_index
```

# Contact

Please, feel free to contact us with any questions, or open an issue on Github.
