5. Advanced usage

[IN PROGRESS]

Overview

To train reMap, the main input data consists of i) EC number indices ("biocyc205_tier23_9255_X.pkl") and ii) pathway indices ("biocyc205_tier23_9255_y.pkl"). The remaining files can be generated through the flag preprocessing.

Note: Make sure to put the source code reMap/ (see Installing reMap) into the same directory as explained in the Download files section. Additionally, create a log/ and result/ folders (if you have not already created one during pathway prediction) in the same reMap_materials/ directory. The final structure should look like this:

reMap_materials/
	├── model/
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        │       └── ...
	├── log/
        │       └── ...
	└── reMap/
                └── ...

For all experiments, using a terminal (On Linux and macOS) or an Anaconda command prompt (On Windows), navigate to the src/ folder in the reMap/ directory and then run the commands as shown in the Examples section of Preprocessing and Training .

To display reMap' running options use: python main.py --help. It should be self-contained.

Preprocessing

This step is crucial and only performed if users wish to build pathway groups centroids and to recover maximum expected pathways for each group. The outputs of this step are several supplemntary files that are required for transofrmation and training, such as "[FILENAME]_centroid.npz", "[DATANAME]_B.pkl" etc.

Input:

The input file used for preprocessing are:

phi.npz
sigma.npz
pathway2vec_embeddings.npz
hin.pkl
vocab.pkl

Command:

The basic command is represented below. Do not use this for preprocessing. This command is only a representation of all the flags used. See Example below on how to preprocess your datasets.

python main.py \
--define-bags \
--recover-max-bags \
--alpha 16 \
--top-k 90 \
--v-cos 0.1 \
--vocab-name "vocab.pkl" \
--bag-phi-name "phi.npz" \
--bag-sigma-name "sigma.npz" \
--hin-name "hin.pkl" \
--features-name "pathway2vec_embeddings.npz" \
--file-name "[input (or save) file name]" \
--y-name "[DATANAME]_y.pkl" \
--mdpath "[absolute path to the model directory (e.g. model)]" \
--dspath "[absolute path to the dataset directory (e.g. dataset)]" \

Argument descriptions:

The table below summarizes all the command-line arguments that are specific to this framework:

Argument name	Description	Value
--define-bags	Whether to construct pathway groups centroids	False
--recover-max-bags	Whether to recover the maximum number of pathway groups	False
--alpha	A hyper-parameter for controlling pathway groups centroids	16
--top-k	Top k pathways to be considered for each pathway group	90
--v-cos	A cutoff threshold for consine similarity	0.1
--vocab-name	A dictionary file representing pathway indices as keys and MetaCyc pathway ids as values	vocab.pkl
--bag-phi-name	The filename for pathways distribution over pathway groups	phi.npz
--bag-sigma-name	The filename for pathway groups covariance	sigma.npz
--hin-name	The heterogeneous information network file	hin.pkl
--features-name	The features corresponding ECs and pathways	pathway2vec_embeddings.npz
--y-name	The Input file name to be provided for preprocessing	[DATANAME]_y.pkl
--file-name	The names of input preprocessed files (without extension)	[input (or save) file name]
--mdpath	The path to the supplementary files	Outside source code
--dspath	The path to the datasets	Outside source code

Output:

The output files generated after running the command are:

File	Description
[DATANAME]_B.pkl	A matrix file (stored in the "dspath" location) representing information about organisms (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate EC number indices and embeddings.
M.pkl	A matrix file (stored in the "dspath" location) representing the pathway-enzyme association with possible missing links due to white noise. It contains 2526 pathway indices shown in the first column and 3650 enzymes (represented as EC numbers indices) in the remaining columns. The file representation is similar to `pathway2ec.pkl`.
A.pkl	A matrix file (stored in the "dspath" location) representing the pathway-pathway interaction. It contains 2526 pathway indices shown in the first column with their interactions (2526 pathway indices) in the remaining columns.
B.pkl	A matrix file (stored in the "dspath" location) representing the enzyme-enzyme interaction. It contains 3650 enzymes (represented as EC numbers indices) shown in the first column with their interactions (3650 EC numbers indices) in the remaining columns.
P.pkl	A matrix file (stored in the "dspath" location) representing the pathway features. It contains 2526 pathway indices shown in the first column and their 128 features in the remaining columns. The file representation is similar to `pathway2vec_embeddings.npz`.
E.pkl	A matrix file (stored in the "dspath" location) representing the enzyme features (represented as EC numbers indices). It contains 3650 EC numbers indices shown in the first column and their 128 features in the remaining columns. The file representation is similar to `pathway2vec_embeddings.npz`.

Note: Each of these files differs in the total number of columns they contain, which is why the file used for training should also be used during prediction if one decides to train their own model based on certain specifications mentioned above.

Examples

Example 1:

To construct bags, execute the following command:

python main.py --define-bags --alpha 16 --top-k 90 --hin-name "hin.pkl" --vocab-name "vocab.pkl" --bag-phi-name "phi.npz" --bag-sigma-name "sigma.npz" --features-name "pathway2vec_embeddings.npz" --file-name "temp"

After running the command, the output will be saved to the dataset/ folder. The tree structure for the folder with the outputs will look like this:

reMap_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       └── ...
	├── dataset/
        │       ├── temp_centroid.npz
        │       ├── temp_exp_phi_trim.npz
        │       ├── temp_features.npz
        │       ├── temp_rho.npz
        │       ├── temp_labels_distr_idx.pkl
        │       ├── temp_idxvocab.pkl
        │       ├── temp_pathway_group.pkl
        │       └── ...
	├── result/
        │       └── ...
	└── reMap/
                └── ...

Example 2:

To recover the maximum set of bags, execute the following command:

python main.py --recover-max-bags --alpha 16 --v-cos 0.1 --file-name "temp" --y-name "biocyc205_tier23_9255_y.pkl"

After running the command, the output will be saved to the dataset/ folder. All the feature files described in the table above are generated.

reMap_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       └── ...
	├── dataset/
        │       ├── temp_B.pkl
        │       └── ...
	├── result/
        │       └── ...
	└── reMap/
                └── ...

Example 3:

If you wish to perform the above two examples, execute the following command:

python main.py --define-bags --recover-max-bags --alpha 16 --top-k 90 --v-cos 0.1 --hin-name "hin.pkl" --vocab-name "vocab.pkl" --bag-phi-name "phi.npz" --bag-sigma-name "sigma.npz" --features-name "pathway2vec_embeddings.npz" --file-name "temp" --y-name "biocyc205_tier23_9255_y.pkl"

After running the command, the output will be saved to the dataset/ folder. All the feature files described in the table above are generated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

5. Advanced usage

Overview

Table of Contents

Preprocessing

Input:

Command:

Argument descriptions:

Output:

Examples

Example 1:

Example 2:

Example 3:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally