Skip to content

5. Advanced usage

Abdurrahman Abul-Basher edited this page Jul 7, 2021 · 24 revisions

[IN PROGRESS]

Overview

To train reMap, the main input data consists of i) EC number indices ("biocyc205_tier23_9255_X.pkl") and ii) pathway indices ("biocyc205_tier23_9255_y.pkl"). The remaining files can be generated through the flag preprocessing.

Note: Make sure to put the source code reMap/ (see Installing reMap) into the same directory as explained in the Download files section. Additionally, create a log/ and result/ folders in the same reMap_materials/ directory. The final structure should look like this:

reMap_materials/
	├── model/
        │       └── ...
	├── dataset/
        │       └── ...
	├── result/
        │       └── ...
	├── log/
        │       └── ...
	└── reMap/
                └── ...

For all experiments, using a terminal (On Linux and macOS) or an Anaconda command prompt (On Windows), navigate to the src/ folder in the reMap/ directory and then run the commands as shown in the Examples section of Preprocessing and Training.

To display reMap' running options use: python main.py --help. It should be self-contained.

Table of Contents

Preprocessing

This step is crucial and only performed if users wish to build pathway groups centroids and to recover maximum expected pathways for each group. The outputs of this step are several supplemntary files that are required for transofrmation and training, such as "[FILENAME]_centroid.npz", "[DATANAME]_B.pkl" etc.

Input:

The input file used for preprocessing are:

  1. phi.npz
  2. sigma.npz
  3. pathway2vec_embeddings.npz
  4. hin.pkl
  5. vocab.pkl

Command:

The basic command is represented below. Do not use this for preprocessing. This command is only a representation of all the flags used. See Example below on how to preprocess your datasets.

python main.py \
--define-bags \
--recover-max-bags \
--alpha 16 \
--top-k 90 \
--v-cos 0.1 \
--vocab-name "vocab.pkl" \
--bag-phi-name "phi.npz" \
--bag-sigma-name "sigma.npz" \
--hin-name "hin.pkl" \
--features-name "pathway2vec_embeddings.npz" \
--file-name "[FILENAME]" \
--y-name "[DATANAME]_y.pkl" \
--mdpath "[absolute path to the model directory (e.g. model)]" \
--dspath "[absolute path to the dataset directory (e.g. dataset)]" \

Argument descriptions:

The table below summarizes all the command-line arguments that are specific to this framework:

Argument name Description Value
--define-bags Whether to construct pathway groups centroids False
--recover-max-bags Whether to recover the maximum number of pathway groups False
--alpha A hyper-parameter for controlling pathway groups centroids 16
--top-k Top k pathways to be considered for each pathway group 90
--v-cos A cutoff threshold for cosine similarity 0.1
--vocab-name A dictionary file representing pathway indices as keys and MetaCyc pathway ids as values vocab.pkl
--bag-phi-name The filename for pathways distribution over pathway groups phi.npz
--bag-sigma-name The filename for pathway groups covariance sigma.npz
--hin-name The heterogeneous information network file hin.pkl
--features-name The features corresponding ECs and pathways pathway2vec_embeddings.npz
--y-name The Input file name to be provided for preprocessing [DATANAME]_y.pkl
--file-name The names of input preprocessed files (without extension) [FILENAME]
--mdpath The path to the supplementary files [Outside source code]
--dspath The path to the datasets [Outside source code]

Output:

The output files generated after running the command are:

With the --define-bags flag only

See Example 1:

File Description
[FILENAME]_centroid.npz A matrix file (stored in the "dspath" location) representing groups centroids.
[FILENAME]_exp_phi_trim.npz A matrix file (stored in the "dspath" location) representing the distribution of pathways over groups. The rows correspond to the group indices and columns represent the pathway indices.
[FILENAME]_features.npz A matrix file (stored in the "dspath" location) representing pathway features. It contains 2526 pathway indices shown in the first column and their 128 features in the remaining columns.
[FILENAME]_rho.npz A matrix file (stored in the "dspath" location) representing the group-group correlations.
[FILENAME]_idxvocab.pkl A file (stored in the "dspath" location) representing the pathway indices.
[FILENAME]_labels_distr_idx.pkl A file (stored in the "dspath" location) representing information about indices of pathways and their associated pathway groups indices.
[FILENAME]_pathway_group.pkl A binary matrix file (stored in the "dspath" location) indicating the association of groups indices in rows to pathway indices in columns.

With the --recover-max-bags flag only

See Example 2):

File Description
[FILENAME]_B.pkl A +1/-1 matrix file (stored in the "dspath" location) indicating the presence/absence of group indices for each organism (or multi-organisms). Each row in this matrix represents an organism or multi-organisms and columns indicate pathway group indices.

With both flags: --define-bags and --recover-max-bags only

See Example 3), you will get combined results from running both flags separately.

Examples

Example 1:

To construct groups, execute the following command:

python main.py --define-bags --alpha 16 --top-k 90 --hin-name "hin.pkl" --vocab-name "vocab.pkl" --bag-phi-name "phi.npz" --bag-sigma-name "sigma.npz" --features-name "pathway2vec_embeddings.npz" --file-name "temp"

After running the command, the output will be saved to the dataset/ folder. All the files described in the table above are generated.

reMap_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       └── ...
	├── dataset/
        │       ├── temp_centroid.npz
        │       ├── temp_exp_phi_trim.npz
        │       ├── temp_features.npz
        │       ├── temp_rho.npz
        │       ├── temp_labels_distr_idx.pkl
        │       ├── temp_idxvocab.pkl
        │       ├── temp_pathway_group.pkl
        │       └── ...
	├── result/
        │       └── ...
	└── reMap/
                └── ...

Example 2:

To recover the maximum set of groups, execute the following command:

python main.py --recover-max-bags --alpha 16 --v-cos 0.1 --file-name "temp" --y-name "biocyc205_tier23_9255_y.pkl"

After running the command, the output will be saved to the dataset/ folder. All the files described in the table above are generated.

reMap_materials/
	├── objectset/
        │       └── ...
	├── model/
        │       └── ...
	├── dataset/
        │       ├── temp_B.pkl
        │       └── ...
	├── result/
        │       └── ...
	└── reMap/
                └── ...

Example 3:

If you wish to perform the above two examples, execute the following command:

python main.py --define-bags --recover-max-bags --alpha 16 --top-k 90 --v-cos 0.1 --hin-name "hin.pkl" --vocab-name "vocab.pkl" --bag-phi-name "phi.npz" --bag-sigma-name "sigma.npz" --features-name "pathway2vec_embeddings.npz" --file-name "temp" --y-name "biocyc205_tier23_9255_y.pkl"

After running the command, the output will be saved to the dataset/ folder. All the files described in Example 1 and Example 2 above are generated.

Clone this wiki locally