Peptidoforms parsed from MaxQuant output are not always valid

Hi!

I am trying to parse a few different MaxQuant output files and the modified sequences extracted from the file are not parsed as ProForma correctly. Here is what I am dealing with. Here is one row from one file:

```tsv
Raw file	Scan number	Scan index	Sequence	Length	Missed cleavages	Modifications	Modified sequence	Oxidation (M) Probabilities	Oxidation (M) Score Diffs	Oxidation (M)	Proteins	Charge	Fragmentation	Mass analyzer	Type	Scan event number
01974c_BA1-TUM_missing_first_1_01_01-3xHCD-1h-R4	18859	16929	ACVINGMQLK	10	0	Oxidation (M)	_ACVINGM(ox)QLK_	ACVINGM(1)QLK	ACVINGM(101.9)QLK	1	TUM_missing_first_1	2	HCD	FTMS	MULTI-MSMS	31	0	575.29138	1148.5682	0.17725	1.7827292	24.19	0.00036581	101.9	100.71	101.9	1	1	0.812871	0.009392453	0.1333722	18828	6096732	0.426561	-2	0.06135368	y1;y2;y3;y4;y5;y6;y7;y8;y9;y1-NH3;y6-NH3;y7-NH3;y8-NH3;y8(2+);a2;b2;b3;b4;b5	71809.2;29973.1;24011.2;32854.1;190928.3;556904.4;805166.8;685423.2;105967.8;11586.8;26761.8;29306.5;32482.7;10282.6;114158.7;652840.1;435104.2;53264.8;28631	-0.0004923382;-0.0005195443;-0.003007707;0.0006837579;0.0004915272;-0.0003559902;-0.0005144462;0.0002259294;-0.00348233;0.0003633455;-0.00334429;-0.002593298;-0.006689147;0.003100681;4.57087E-05;-6.530397E-05;-0.0001010637;-0.0003278859;-0.002078173	-3.34666;-1.996731;-7.746661;1.277359;0.8298453;-0.5039816;-0.6278023;0.2459745;-3.228739;2.79312;-4.851494;-3.231865;-7.420119;6.744212;0.2239743;-0.2813915;-0.305196;-0.7381029;-3.722506	147.113296508789;260.197387695313;388.258453369141;535.290161132813;592.311817087127;706.355592051727;819.439814488134;918.507488028666;1078.54184448957;130.085891723633;689.33203125;802.415344238281;901.487854003906;459.75439453125;204.080078125;232.075103759766;331.143553435689;444.227844238281;558.272521972656	19	0.5166685	0.2289157	None	Unknown		101.8962;1.189599;0.2601814	ACVINGMQLK;LKDSEGSGTAGK;DAHKSEVAHR	_ACVINGM(ox)QLK_;_LKDSEGSGTAGK_;_DAHKSEVAHR_	66	8	7	7	15	1
```

and now another file:

```tsv
Raw file	Scan number	Scan index	Sequence	Length	Missed cleavages	Modifications	Modified sequence	Oxidation (M) Probabilities	Phospho (STY) Probabilities	Oxidation (M) Score diffs	Phospho (STY) Score diffs	Acetyl (Protein N-term)	Oxidation (M)
OXPAL230121_44	14613	7897	AAAEGEMK	8	0	Oxidation (M)	_AAAEGEM(Oxidation (M))K_	AAAEGEM(1)K		AAAEGEM(79)K		0	1	0	P0A9B2	gapA	Glyceraldehyde-3-phosphate dehydrogenase A	2	HCD	FTMS	MULTI-MSMS	1	0.0	411.68674	821.35892	0.31799	0.00013091	-0.57361634	25.967	0.0047701	79.116	68.3	79.116	1.0	1	0	0	0	14612	8211736.5	0.0651129635157789	-8	0.0703334808349609		y1;y2;y4;y5;y6;y7;y5-H2O;y6-H2O;y1-NH3;a2;b2;b3	162904.734375;75995.5625;159223.859375;106600.2265625;201834.5625;48688.8359375;47573.99609375;7619.5693359375;65347.02734375;367253.65625;314543.28125;40487.2734375	2.646063904876428E-05;-0.0001748173209534798;0.00041258822591316857;-0.0005543576077116086;-0.00011734497638826724;0.00025578379938906437;-7.163136046983709E-05;-0.0024110601625579875;-6.3900626571467E-05;3.3939142966232794E-05;3.7367395322007724E-05;-3.99617186985779E-06	0.17986635464753814;-0.5943167934958867;0.8591796057282154;-0.9098936188837109;-0.17249205021037203;0.34044188222478894;-0.12115356235845015;-3.6405240676221244;-0.4912171170462424;0.2949010231855326;0.2611616737682475;-0.018663355086892004	147.11277770996094;294.148378216221;480.2118476304741;609.2554076725077;680.2920844476763;751.3288251067005;591.2443602599604;662.2838134765625;130.08631896972656;115.08655548095703;143.0814666748047;214.11862182617188	12	0.303405304480312	0.0923076923076923		Unknown		79.11602402068965;10.816500709941389;10.816500709941389	AAAEGEMK;ALNDMDK;SGDEWTK	_AAAEGEM(Oxidation (M))K_;_ALNDM(Oxidation (M))DK_;_SGDEWTK_				177	495	7	7	133	639
```

So when reading the PSMs with `psm_utils`, you get the following peptidoforms: `Peptidoform('ACVINGM[ox]QLK/2')` and `Peptidoform('AAAEGEM[Oxidation (M)]K/2')`, respectively.

If you then try to calculate masses, neither will give the correct result. The first one will actually resolve `ox` as carboxymethyl because the last-ditch attempt at resolving in Pyteomics is currently a very permissive Unimod search; while the other will just raise an exception. In both files there is a `Modifications` column where you have the same form for the modification: `Oxidation (M)`. Looks like if we remove the site, we can use the name and have a much better chance of getting a consistent ProForma. However I'm not sure how many other kinds of MaxQuant tables are out there.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Peptidoforms parsed from MaxQuant output are not always valid #138

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Peptidoforms parsed from MaxQuant output are not always valid #138

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions