-
Notifications
You must be signed in to change notification settings - Fork 1
Description
The Library Construction Approach facet in the data browser has several terms that are either incorrectly labeled or insufficiently specific, cluttering up the list.
For example, the 10x family of library construction approaches the browser lists:
- 10X 3' v1 sequencing
- 10x 3' v2
- 10X 3' v2 sequencing
- 10x 3' v3 sequencing
- 10X 3' v3 sequencing
- 10X 5' v2 sequencing
- 10X Ig enrichment
- 10X TCR enrichment
- 10X v2 sequencing
- 10x v3 sequencing
See the 20210401_dcp4-Library-Preparation-Protocols Spreadsheet for a report with the full list of library preparation protocol documents used in the metadata.
The above list contains several classes of errors that should be fixed and may require changes to validation or ingest/wrangling SOP to prevent them from happening again.
Note that in addition to TDR snapshots and Azul indexes, the incorrect ontology terms are also likely in the DCP Generated Matrices' embedded metadata. We will need to validate the DCP generated matrices, and if necessary, come up with an efficient approach for updating the metadata.
Expected Outcome
Using the correct and most specific ontology terms available, we should be able to trim the above list to:
- 10X 3' v1 sequencing
- 10X 3' v2 sequencing
- 10x 3' v3 sequencing
- 10X 5' v2 sequencing
- 10X Ig enrichment
- 10X TCR enrichment
Note that since 10X Ig enrichment and 10X TCR enrichment are subclasses of 10X 5' v2 sequencing, we may be able to eliminate 10X 5' v2 sequencing as well.
Background
library_preparation_protocol.library_construction_method is defined to have a graph restriction: Subclasses of OBI:0000711 from obo:efo.
See EBI OLS EFO / OBI_0000711 for the ontology terms we use to define this field.
The value of the library_preparation_protocol.library_construction_method is a library_construction_ontology entity which defines the following fields
| Field | Description |
|---|---|
| library_construction_ontology.ontology | An ontology term identifier in the form prefix:accession. For example, "EFO:0009310" or "EFO:0008931 |
| library_construction_ontology.ontology_label | (string) The preferred label for the ontology term referred to in the ontology field. This may differ from the user-supplied value in the text field. For example "10X v2 sequencing" or "Smart-seq |
| library_construction_ontology.text | (string) The name of a library construction approach being used. For example "10X v2 sequencing" or "Smart-seq2". |
When Azul indexes this field, it uses ontology_label if present, text if not. And if neither is present, it's ontology (the term reference).
Error Types
Looking at the spreadsheet above, it appears there are several classes of problems to be addressed:
| Type | Description | Example |
|---|---|---|
| 1 | Incorrect ongology_label | e.g. using DroNc-Seq instead of DroNc-seq, 10x 3'v2 instead of 10X 3' v2 sequencing |
| 2 | Using ontology identifier when a more specific term is available. | e.g using 10X v2 sequencing (EFO:0009310) instead of a more specific term that specifies the end_bias such as (EFO_0009899) |
| 3 | mismatch of ontology_label and ontology_term | e.g. label is 10X 3' v2 sequencing and text is 10X 5' v2 sequencing (Row 66) |
We may also have internal consistency errors that show up with further validation, for example, where the end_bias does not match the ontology term.
Possible Discussion Points
-
What is the best way to find, report, track, and fix these kinds of errors and create a work queue for resolving them?
-
Where might we add validation to prevent incorrect ontology terms and labels?
-
What validations are required, and how might they be specified and implemented? For example:
- How can we specify when non-leaf nodes should be disallowed as ontology terms? For example, how could we specify that 10x 5’ v2 sequencing is allowed, but 10x v2 sequencing is not?
- What is the purpose of the text field when the ontology label is provided? Should we be concerned when there is an apparent mismatch between the ontology label and the text?
-
Should we more aggressively use hcao to add terms where they are missing in the core ontologies. For example, to prevent "nulls" in the ontology and ontology text fields.
-
Can/should we fix the incorrect metadata that has made it into DCP generated matrices.
Notes
The query for the above spreadsheet is listed below. The query could be modified to look for similar errors in other ontologized fields.
SELECT
protocol_project.project_id,
library_preparation_protocol_id,
json_extract_scalar(content,
"$.library_construction_method.ontology") AS ontology_id,
json_extract_scalar(content,
"$.library_construction_method.ontology_label") AS ontology_label,
json_extract_scalar(content,
"$.library_construction_method.text") AS text,
json_extract_scalar(content,
"$.end_bias") AS end_bias
FROM
`broad-datarepo-terra-prod-hca2.hca_prod_20201120_dcp2___20210401_dcp4.library_preparation_protocol` AS library_preparation_protocol
FULL JOIN (
SELECT
DISTINCT *
FROM (
SELECT
project_id,
JSON_EXTRACT_SCALAR(protocol,
"$.protocol_type") AS protocol_type,
JSON_EXTRACT_SCALAR(protocol,
"$.protocol_id") AS protocol_id,
FROM
`broad-datarepo-terra-prod-hca2.hca_prod_20201120_dcp2___20210401_dcp4.links`
LEFT JOIN
UNNEST(JSON_EXTRACT_ARRAY(content,
"$.links")) AS process
LEFT JOIN
UNNEST(JSON_EXTRACT_ARRAY(process,
"$.protocols")) AS protocol ) AS protocol_project
WHERE
protocol_type = "library_preparation_protocol") AS protocol_project
ON
protocol_project.protocol_id = library_preparation_protocol.library_preparation_protocol_id
ORDER BY
ontology_id,
library_preparation_protocol_id