-
Notifications
You must be signed in to change notification settings - Fork 3
Description
QA Resources:
- KGX Summary report: https://docs.google.com/spreadsheets/d/1plJAeoZpfiUyUFo6aNTM95g0SLEFv1pASo5v9cgQHa8/edit?gid=618684675#gid=618684675
- original RIG: https://github.com/NCATSTranslator/translator-ingests/blob/main/src/translator_ingest/ingests/ctd/ctd_rig.yaml
- RIG updates PR: CTD RIG updates from QA process #233
- Phase 2 Ingest Survey (with some data examples): https://docs.google.com/spreadsheets/d/1R9z-vywupNrD_3ywuOt_sntcTrNlGmhiUWDXUdkPVpM/edit?
Issues to consider / address for January updated release:
-
Is it right that there are so many associated_with edges with GeneOrGeneProduct as the subject. From the metaedges rollup sheet:
- GeneOrGeneProduct associated_with AnatomicalEntity (15,000)
- GeneOrGeneProduct associated_with BiologicalProcess (182,000)
- GeneOrGeneProduct associated_with DiseaseOrPhenotypicFeature (86,000)
- GeneOrGeneProduct associated_with MolecularActivity (18,000)
- GeneOrGeneProduct associated_with Pathway (34,000)
- . . . .
Based on the RIG, the associated_with predicate is used only for the Chem-Disease file or Chem-GO/Pathway files - where chemical entities should be the subjects. I know some Chemicals may end up mapping/normalizing to genes (e.g. biological signaling molecules like insulin, interleukins, etc) - but the counts here seem too high to be explained by this.
-
Is it right that there are so many correlated_with edges between Chemicals and Diseases/Phenotypes (65722) . . . these must come from the exposures file, right? And if I recall there were not that many records there where the study outcome reported an actual correlation? Same for Gene/Product correlations with Disease/Phenotype (1380)
-
I feel like we should implement a filter on the Chem-Disease associations from the Chem-Diseases File. There are a ton of these (>3M) and most IMO are not significant or meaningful - being based only on a small handful of shared gene associations. I feel like this has the potential to introduce spurious / unreliable edges and inferences into our Results. The data indicates these are only statistical associations in KLAT, and gives p-values/scores - but the UI or and algorithms currently have no way of showing/using this info to convey the lower quality of these edges. Can we see what was done to filter these in Phase 2, and maybe try to replicate/modify this approach?
-
Consider similar filters based on p-values for enrichment-based Chem - GO term / Pathway edges - which are also very abundant (>7M in total)
-
Edge types that I don’t see specified in RIG - from the metaedges rollup sheet:
- ChemicalEntity associated_with GeneOrGeneProduct (158)
- Disease/Phenotype associated_with MolecularActivity (6)
- Disease/phenotype associated_with Pathway (5)
Low numbers on these, but thought I'd point them out. The issue behind this may have boader implications I am not appreciating.
-
The RIG indicates that we create edges using the affects_sensistivity_to predicate (and its subpredicates) from the chem-gene-ixns file - but I don't see edges of this type in the summary report.
-
RIG Issues:
- @EvanDietzMorris please address my #comments in the RIG (starting with "MHB QUESTION:")
- I updated EdgeType objects with a few new fields
Issues to consider longer term:
-
In the metaedges sheet for CTD, there are 32 edges of the type 'ChemicalEntity', 'affects'/'causes', 'CellularComponent' (see row 4 here). When I look at examples of these in the ctd samples sheet here (row 660), the object is a GO term, and the association type is biolink:ChemicalEntityToBiologicalProcessAssociation' - suggesting the category should be Biological Process. I think the problematic edges come from the CTD_pheno_term_ixns.tsv file, which report manually curated phenotypes associated with chemicals - but the phenotypes are representing using a combination of a GO term and a direction. The problem arises in the small number of cases where the GO term is a cellular component, and the curator assigns a direction - leading to non-sensical statements like "Chem X affects/causes increase autophagosome".
Possible solutions: We can either leave these as is and accept they are not quite right (there are only 32), or implement code that removes them, or code that removes the 'causes' qualified predicate and the object aspect direction from edges where the object is a cellular component - so the statement becomes "Chemical affects autophagosome" - which is fine (and maybe change the Association type on these as well). -
It feels strange to me that we use the same 'corerlated_with' predicate to represent tow very different types of associations in CTD: (1) Those based on a statistically significant over-representation of shared gene associations in the CTD data, and (2) those based on actual real-world studies that show chemical exposures correlated with incidence of Disease in a population of people. The latter is a direct, measured correlation in the real world. The former is a more indirect correlation based on shared annotations to genes in a data set. Importantly, there is a subtle different in their KL/AT (both are statistical_associations, but the latter has a manual_Agent and the former a data_analysis_pipeline). But IMO this is not sufficient to advertise the meaningful different in provenance and utility of these different types of correlations. We should consider additional ways to distinguish these, - perhaps distinct KL terms for these (real_world-statistical-association vs data-based-statistical-association). or perhaps different correlation predicates (e.g. directly_correlates_with, indirectly_correlates_with, or correlates_in_real_world_with and correlates_in_data_with).
Questions: