To set up the 3 environments needed, run:
make create_environmentsThis will set up the 3 required environments:
cgr-fragfor the creation of the CGR-fragmentsml-envfor the 'standard' ML (GBM, kNN and RF)dl-envfor ChemProp and the Multi-Task Neural Networks
Imported with the repository is a .txt file containing the reactions from the JACS 2022 paper, stored in jacs_data_extraction_scripts/dataset/suzuki_USPTO_with_hetearomatic.txt
The data is already stored in data/parsed_jacs_data/splits, where the data for each repetition and fold is stored.
The fragments are large (~1.6Gb) so they are not included in the repository. To get the fragments, run:
conda activate cgr-frag
python src/scripts/generate_frags_for_splits.py --splits_dir data/parsed_jacs_data/splits --n_folds 5 \
conda deactivatewhich create the fragments under the data/parsed_jacs_data/splits folder (split_{idx}/fold_{idx}/id_to_frag.pkl).
If you want to re-create the dataset, run:
make_jacs_dataset_and_frags.shWhich will parse the reactions, atom-map them, drop duplicates and fragment the CGRs.
Optimised hyperparameter configurations can be found in the hpopt/coarse_solvs folder, with config files for the RF, GBM, D-MPNN and CGR-MTNN models.
Complete configs, with the full range of options for the models, can be found in the configs folder. Both the MorganFP and CGR-MTNN were build with PyTorch Lightning, and can be run from the command line via:
python src/modelling/multitask_nn.py {fit/test} -c {config_file}To reproduce the results of the paper (without reoptimising the hyperparameters), run:
test_all_models_cpu.sh -s coarseWhich will train and test all models without-GPU acceleration (i.e. Pop. Baseline, Sim. Baseline and RF) on the coarse solvent classes. Then run:
test_all_models_gpu.sh -s coarseWhich will train and test all models that use GPU acceleration (i.e. CGR GBM, CGR MTNN, Morgan MTNN and CGR D-MPNN).
Repeat for the 'fine' solvent classification. Since we optimised the hyperparameters on the 'coarse' solvent classes, use:
test_all_models_cpu.sh -s fine -c
test_all_models_cpu.sh -s fine -c Where the -c indicates that we are using the coarse hyperparameters.
[Note that this requires the hpopt folder to contain the optimised hyperparameters for the CGR RF, CGR GBM, CGR D-MPNN models, and the configs folder to contain the config files for the CGR MTNN and Morgan MTNN models.]
Finally, to generate exact predictions for analysis of the impact of clustering, run:
generate_exact_predictions.shTo re-run the hyperparameter optimisation (very slow):
hpopt_cpu.sh -s coarseTo optimise the hyperparameters of the CGR-RF models, and:
hpopt_gpu.sh -s coarseTo optimise the hyperparameters of the CGR GBM, CGR MTNN and CGR D-MPNN. NOTE: This will take over 16 hours on a single GPU.
Optimised configs will be placed in the hpopt/coarse_solvs directory. Note that the optimised hyperparameters from the CGR MTNN will need to be copied into the respective config file at configs/coarse_solvs/cgr_{model_type}/split_seed_{split}/.yml. Similarly at configs/fine_solvs_coarse_hparams/cgr_{model_type}/split_seed_{split}/{train/test}.yml.
These configs may look like they have the incorrect arguments in places, but they are overridden when they are used in src/scripts/test_models_across_splits.py.
Analysis from the paper can be generated by looking at these 2 notebooks: