# Molecule Property Prediction and Classification with Persistence Homology

## Introduction
The aim of this repository is to enhance molecule property prediction 
and classification with the computation of its topological features. 
Specifically, we decompose a molecule into its subgraphs based on several 
factors: atom weights, bond strengths, partial charges in either increasing 
or decreasing order. Then we generate the persistence diagram of each 
subgraph using its Vietoris-Rips simplicial complexes. We can vectorize these persistence 
diagrams via:  i) BettiCurve, ii) Entropy, and iii) Silhouette functions. 
We pass these feature vectors into a Siamese network to discriminate 
between different targets to inhibit given a drug-like molecule.


## Setup
* It is highly suggested that you install all dependencies into a separate conda virtual environment for easy package management.
* The dependencies are in [`requirements.txt`](requirements.txt). You will need to install dependencies by running in the root directory:
    ```shell
    $conda create -n <myenv> --file requirements.txt python=3.9
    $conda activate <myenv>  
    $pip install -e .
    $python -m ipykernel install --user --name=<myenv>
    ```
* Please check you have the same versions of these dependencies.


## Getting benchmark datasets
1. Cleves-Jain: 
This is a relatively small dataset that contains 1149 compounds. There are
22 different drug targets, and for each one of them the dataset provides only 2-3 template active
compounds dedicated for model training. All targets are associated with 4 to 30 active compounds
dedicated for model testing. Additionally, the dataset contains 850 decoy compounds. The
aim is for each target, by using the templates, to find the actives among the pool combined with
decoys In other words, same decoy set is used for all targets.

The dataset is available [here](https://www.jainlab.org/Public/SF-Test-Data-DrugSpace-2006.zip).


2. DUD-E Diverse: 
DUD-E (Directory of Useful Decoys, Enhanced) dataset is a comprehensive
ligand dataset with 102 targets and approximately 1.5 million compounds. The targets are categorized
into 7 classes with respect to their protein type. The "Diverse subset" of DUD-E contains targets
from each category to give a balanced benchmark dataset for VS methods. Diverse subset contains
116,105 compounds from 8 target and 8 decoy sets. One decoy set is used per target.

The dataset is available [here](http://dude.docking.org/subsets/diverse).



## Expected results
Please note that we haven't released the source files yet.
You can see the notebooks in the test folder where all our results align with the numbers in the paper.


## Code style
Perform these steps manually in the root directory:

```shell
source format_and_lint.sh
format-and-lint .
```


## Citation

