Skip to content

4. Tutorial on transformation process

Abdurrahman Abul-Basher edited this page Jul 7, 2021 · 14 revisions

Overview

reMap is used to generate a pathway group dataset for the purpose of improving the sensitivity of pathway prediction in both organismal and multi-organismal genomes. This tutorial is meant to walk you through the basic steps of the transformation process using either your own input data or the test data provided by us. Once the input (in the specified format below), trained model, and other required files are provided, a pathway group dataset is generated that can be used for the pathway prediction.

Note: Make sure to put the source code reMap/ (see Installing reMap) into the same directory as explained in the Download files section. Also, create a folder result/ in the same reMap_materials/ directory. The final structure of the folder should look like this:

reMap_materials/
	├── model/
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        │       └── ...
	└── reMap/
                └── ...

For all experiments, using a terminal (On Linux and macOS) or an Anaconda command prompt (On Windows), navigate to the src/ folder in the reMap/ directory and then run the commands as shown in the Examples section.

To display reMap' running options use: python main.py --help. It should be self-contained.

Table of Contents

Input:

The input for pathway predictions is either a .pf file generated directly by MetaPathways v2 or a .pkl file generated after following the steps under Advanced usage. One can also use the files provided by us.

Files required for predictions:

In addition to the input data, some of the object files listed here are also required to carry out a successful run. The required object files include:

  • pathway_group.pkl
  • features.npz
  • centroid.npz
  • rho.npz
  • reMap.pkl

Command:

The basic command is represented below. Do not use this to run the transformation process. This command is only a representation of all the flags used. See Examples below on how to carry out such task.

python main.py \
--transform \
--ssample-label-size 50\
--bags-labels "pathway_group.pkl" \
--features-name "features.npz" \
--bag-centroid-name "centroid.npz" \
--rho-name "rho.npz" \
--X-name "[DATANAME]_X.pkl" \
--y-name "[DATANAME]_y.pkl" \
--file-name "[save file name]" \
--model-name "reMap" \
--dspath "[absolute path to the dataset directory (e.g. dataset)]" \
--mdpath "[absolute path to the model directory (e.g. model)]" \
--batch 30 \
--num-jobs 2                                                 

Argument descriptions:

The table below summarizes all the command-line arguments that are specific to this framework:

Argument name Description Value
--transform Transform pathway data to pathway group data using reMap False
--ssample-label-size Maximum number of pathways to be sampled 50
--bags-labels The input file name for pathway groups consisting of associated pathways to groups pathway_group.pkl
--features-name The features corresponding pathways features.npz
--bag-centroid-name The input file name for the pathway groups centroids centroid.npz
--rho-name The input file name for the pathway group correlation rho.npz
--X-name The input file name to be provided for transformation [DATANAME]_X.pkl
--y-name The input file name to be provided for transformation [DATANAME]_y.pkl
--file-name The name of input file (without extension) [save file name]
--model-name The name of the model excluding any **EXTENSION ** reMap
--dspath The path to the datasets Outside source code
--mdpath The path to the pre-trained model (e.g. reMap.pkl) Outside source code
--batch Batch size 30
--num-jobs The number of parallel workers 2

Output:

The output files generated after running the command are:

File Description
[save file name]_B.pkl A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and embeddings.

Examples

To predict outputs and compile pathway report from the "three_ecoli" data, generated by MetaPathways v2, using a pre-trained model ("reMap.pkl"), execute the following command:

python main.py --transform --ssample-label-size 50 --bags-labels "pathway_group.pkl" --features-name "features.npz" --bag-centroid-name "centroid.npz" --rho-name "rho.npz" --X-name "golden_X.pkl" --y-name "golden_y.pkl" --file-name "biocyc_golden" --model-name "reMap" --batch 30 --num-jobs 2                                                 

Upon executing this command, the "three_ecoli_Xe.pkl" (along with other feature files) will be produced. You can also see that in both arguments: --X-name "three_ecoli_Xe.pkl" and --file-name "three_ecoli", the same name that is **three_ecoli ** is applied.

After running the command, the output will be saved to the dataset/ and result/ folders. Since the --build-features flag is used in this example, all the feature files as described in the table above are generated. The tree structure for the folder with the outputs will look like this:

reMap_materials/
	├── model/
        │       └── ...
	├── dataset/
        │       ├── three_ecoli_X.pkl
        │       ├── three_ecoli_y.pkl
        │       ├── three_ecoli_labels_triumpf.pkl
        │       ├── three_ecoli_y_triumpf.pkl
        │       └── ...
	└── reMap/
                └── ...

back to top

Clone this wiki locally