rPTMDetermine provides a fully automated methodology for the validation, site localization and retrieval of post-translational modification (PTM) identifications from the database search results of tandem mass spectrometry (MS/MS) data.
rPTMDetermine is a tool for automated validation, site localization and retrieval
of PTM identifications from protein sequence database search of tandem mass spectrometry
proteomics data. For a specified PTM, rptmdetermine_validate.py can be run to validate
the identifications from database search, resolving issues with the application of
global FDR control to sets of PTM identifications. This process includes site localization
and, optionally, correction of falsely-assigned deamidation.
After construction of the machine learning model in this process, the model can
be used via rptmdetermine_retrieve.py to retrieve PTM identifications missed
during protein sequence database search.
rPTMDetermine has been most extensively tested using search results from ProteinPilot,
but further database search engines are supported, see the
search engine configuration option.
Installation is currently a manual process; we will seek to publish the package to PyPI in the near future for easier install.
rPTMDetermine is written using Python 3 and should be compatible with most
operating systems. The package has been tested on
- Windows 10
- MacOS 10.15
Because rPTMDetermine includes C/C++ extensions, installation requires the
presence of a C++ 11 compatible compiler on your machine.
- Install Python 3 (>= version 3.6).
- Get the latest release and
unzip
rPTMDetermineversion 1.0. - Navigate to the unzipped
rPTMValidationdirectory and executepip install -r requirements.txtto install dependency packages. - From the
rPTMValidationdirectory, executepython setup.py installto compile the C/C++ extensions and install therPTMDeterminelibrary, along with the scriptsrptmdetermine_validate.pyandrptmdetermine_retrieve.py.
rPTMDetermine ships with two scripts: rptmdetermine_validate.py and
rptmdetermine_retrieve.py. Their behaviour is customized using a JSON
configuration file with the options described below.
The required and optional configuration options are detailed below. Those labelled as "Required - [SCRIPT]" are required only for the specified script.
- Description: The database search engine used to generate the
resultsfiles. - Type: string, one of the following available options:
- ProteinPilot
- Mascot
- Comet
- XTandem
- TPP
- MSGFPlus
- Percolator
- PercolatorText
- Description: The modification to be validated, using its Unimod name.
- Type: string.
- Description: A list of amino acid residues targeted by the modification and to be validated.
- Type: array of strings (single characters).
- Description: The path to the target protein sequence database used during database search.
- Type: string.
- Description: The threshold similarity score for validation.
- Type: number.
- Description: The path to the validation
model.csvfile from rptmdetermine_validate.py. - Type: string.
- Description: The path to the validation
unmod_model.csvfile from rptmdetermine_validate.py. - Type: string.
- Description: The path to the validation results file from rptmdetermine_validate.py.
- Type: string.
- Description: A dictionary/map of options for each configured data set. See Data Set Configuration Options for the available options.
- Type: object.
- Description: The enzyme used to cleave the proteins during sample preparation.
- Type: string.
- Default:
"Trypsin".
- Description: Fixed modifications to be applied, in the form of a dictionary/map of residue/terminus to modification (Unimod name).
- Type: object.
- Default:
{}(no fixed modifications applied).
- Description: The directory to which to write results files.
- Type: string.
- Default: "
modification_target_residues"
- Description: Whether to apply
rPTMDetermine's deamidation correction algorithm to attempt to correct for non-monoisotopic precursor selection. - Type: boolean.
- Default:
false.
- Description: The probability threshold for successful localization.
- Type: number.
- Default:
0.99.
- Description: The m/z tolerance for candidate matches during retrieval.
- Type: number.
- Default:
0.05.
- Description: A set of features to exclude from the validation model.
- Type: array.
- Default:
[].
The data_sets field of the configuration file must be an object mapping unique
data set identifiers to the configurations for that data set. See
Sample Configuration for examples.
- Description: The directory within which the database search results and spectra are located.
- Type: string.
- Description: The name of the database search results file, located within
data_dir. - Type: string.
- Description: The names of the raw mass spectra files for the
results, located withindata_dir. - Type: array.
- Description: For use with ProteinPilot search results only. The confidence score for the desired FDR cut-off.
- Type: number.
{
"search_engine": "ProteinPilot",
"modification": "Nitro",
"target_residues": [
"Y"
],
"enzyme": "Trypsin",
"target_database": "Nitro_Y/target.fasta",
"fixed_residues": {
"nterm": "iTRAQ8plex",
"K": "iTRAQ8plex",
"C": "Carbamidomethyl"
},
"output_dir": "Nitro_Y",
"correct_deamidation": true,
"sim_threshold": 0.42,
"model_file": "Nitro_Y/Nitro_Y_model.csv",
"unmod_model_file": "Nitro_Y/Nitro_Y_unmod_model.csv",
"validated_ids_file": "Nitro_Y/Nitro_Y_results.csv",
"retrieval_tolerance": 0.05,
"site_localization_threshold": 0.99,
"exclude_features": [
"Charge",
"PepMass",
"ErrPepMass"
],
"data_sets": {
"I19": {
"data_dir": "RawData/I19",
"confidence": 90.2,
"results": "I19_PeptideSummary.txt",
"spectra_files": [
"I19_MGFPeaklist.mgf"
]
},
"I08": {
"data_dir": "RawData/I08",
"confidence": 87.7,
"results": "I08_PeptideSummary.txt",
"spectra_files": [
"I08_MGFPeaklist.mgf"
]
}
}
}rPTMDetermine is released under the GPL-3.0 license.
Dong NP, Spencer DM, Quan Q, Le Blanc JCY, Feng JW, Li MZ, Siu KWM, Chu IK. rPTMDetermine: A Fully Automated Methodology for Endogenous Tyrosine Nitration Validation, Site-Localization, and Beyond. Anal. Chem. 2020, 92, 15, 10768–10776. [link]