Modelling Suzuki Reactions with CGRs

Environment

To set up the 3 environments needed, run:

make create_environments

This will set up the 3 required environments:

cgr-frag for the creation of the CGR-fragments
ml-env for the 'standard' ML (GBM, kNN and RF)
dl-env for ChemProp and the Multi-Task Neural Networks

Dataset

Imported with the repository is a .txt file containing the reactions from the JACS 2022 paper, stored in jacs_data_extraction_scripts/dataset/suzuki_USPTO_with_hetearomatic.txt

The data is already stored in data/parsed_jacs_data/splits, where the data for each repetition and fold is stored.

The fragments are large (~1.6Gb) so they are not included in the repository. To get the fragments, run:

conda activate cgr-frag

python src/scripts/generate_frags_for_splits.py --splits_dir data/parsed_jacs_data/splits --n_folds 5 \

conda deactivate

which create the fragments under the data/parsed_jacs_data/splits folder (split_{idx}/fold_{idx}/id_to_frag.pkl).

If you want to re-create the dataset, run:

make_jacs_dataset_and_frags.sh

Which will parse the reactions, atom-map them, drop duplicates and fragment the CGRs.

Modelling

Optimised hyperparameter configurations can be found in the hpopt/coarse_solvs folder, with config files for the RF, GBM, D-MPNN and CGR-MTNN models.

Complete configs, with the full range of options for the models, can be found in the configs folder. Both the MorganFP and CGR-MTNN were build with PyTorch Lightning, and can be run from the command line via:

python src/modelling/multitask_nn.py {fit/test} -c {config_file}

To reproduce the results of the paper (without reoptimising the hyperparameters), run:

test_all_models_cpu.sh -s coarse

Which will train and test all models without-GPU acceleration (i.e. Pop. Baseline, Sim. Baseline and RF) on the coarse solvent classes. Then run:

test_all_models_gpu.sh -s coarse

Which will train and test all models that use GPU acceleration (i.e. CGR GBM, CGR MTNN, Morgan MTNN and CGR D-MPNN).

Repeat for the 'fine' solvent classification. Since we optimised the hyperparameters on the 'coarse' solvent classes, use:

test_all_models_cpu.sh -s fine -c 
test_all_models_cpu.sh -s fine -c

Where the -c indicates that we are using the coarse hyperparameters.

[Note that this requires the hpopt folder to contain the optimised hyperparameters for the CGR RF, CGR GBM, CGR D-MPNN models, and the configs folder to contain the config files for the CGR MTNN and Morgan MTNN models.]

Finally, to generate exact predictions for analysis of the impact of clustering, run:

generate_exact_predictions.sh

(Optional) Hyperparameter Optimisation:

To re-run the hyperparameter optimisation (very slow):

hpopt_cpu.sh -s coarse

To optimise the hyperparameters of the CGR-RF models, and:

hpopt_gpu.sh -s coarse

To optimise the hyperparameters of the CGR GBM, CGR MTNN and CGR D-MPNN. NOTE: This will take over 16 hours on a single GPU.

Optimised configs will be placed in the hpopt/coarse_solvs directory. Note that the optimised hyperparameters from the CGR MTNN will need to be copied into the respective config file at configs/coarse_solvs/cgr_{model_type}/split_seed_{split}/.yml. Similarly at configs/fine_solvs_coarse_hparams/cgr_{model_type}/split_seed_{split}/{train/test}.yml.

These configs may look like they have the incorrect arguments in places, but they are overridden when they are used in src/scripts/test_models_across_splits.py.

Analysis

Analysis from the paper can be generated by looking at these 2 notebooks:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Modelling Suzuki Reactions with CGRs

Environment

Dataset

Modelling

(Optional) Hyperparameter Optimisation:

Analysis

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
data/parsed_jacs_data		data/parsed_jacs_data
envs		envs
hpopt		hpopt
jacs_data_extraction_scripts		jacs_data_extraction_scripts
notebooks		notebooks
reports		reports
src		src
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
generate_exact_predictions.sh		generate_exact_predictions.sh
hpopt_cpu.sh		hpopt_cpu.sh
hpopt_exact.sh		hpopt_exact.sh
hpopt_gpu.sh		hpopt_gpu.sh
make_jacs_dataset_and_frags.sh		make_jacs_dataset_and_frags.sh
test_all_models_cpu.sh		test_all_models_cpu.sh
test_all_models_gpu.sh		test_all_models_gpu.sh

License

Laboratoire-de-Chemoinformatique/CGR-Case-Study

Folders and files

Latest commit

History

Repository files navigation

Modelling Suzuki Reactions with CGRs

Environment

Dataset

Modelling

(Optional) Hyperparameter Optimisation:

Analysis

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages