Skip to content

Metadata contains incorrect values for library_preparation_protocol.library_construction_method #13

@NoopDog

Description

@NoopDog

The Library Construction Approach facet in the data browser has several terms that are either incorrectly labeled or insufficiently specific, cluttering up the list.

For example, the 10x family of library construction approaches the browser lists:

  • 10X 3' v1 sequencing
  • 10x 3' v2
  • 10X 3' v2 sequencing
  • 10x 3' v3 sequencing
  • 10X 3' v3 sequencing
  • 10X 5' v2 sequencing
  • 10X Ig enrichment
  • 10X TCR enrichment
  • 10X v2 sequencing
  • 10x v3 sequencing

See the 20210401_dcp4-Library-Preparation-Protocols Spreadsheet for a report with the full list of library preparation protocol documents used in the metadata.

The above list contains several classes of errors that should be fixed and may require changes to validation or ingest/wrangling SOP to prevent them from happening again.

Note that in addition to TDR snapshots and Azul indexes, the incorrect ontology terms are also likely in the DCP Generated Matrices' embedded metadata. We will need to validate the DCP generated matrices, and if necessary, come up with an efficient approach for updating the metadata.

Expected Outcome

Using the correct and most specific ontology terms available, we should be able to trim the above list to:

  • 10X 3' v1 sequencing
  • 10X 3' v2 sequencing
  • 10x 3' v3 sequencing
  • 10X 5' v2 sequencing
  • 10X Ig enrichment
  • 10X TCR enrichment

Note that since 10X Ig enrichment and 10X TCR enrichment are subclasses of 10X 5' v2 sequencing, we may be able to eliminate 10X 5' v2 sequencing as well.

Background

library_preparation_protocol.library_construction_method is defined to have a graph restriction: Subclasses of OBI:0000711 from obo:efo.

See EBI OLS EFO / OBI_0000711 for the ontology terms we use to define this field.

The value of the library_preparation_protocol.library_construction_method is a library_construction_ontology entity which defines the following fields

Field Description
library_construction_ontology.ontology An ontology term identifier in the form prefix:accession. For example, "EFO:0009310" or "EFO:0008931
library_construction_ontology.ontology_label (string) The preferred label for the ontology term referred to in the ontology field. This may differ from the user-supplied value in the text field. For example "10X v2 sequencing" or "Smart-seq
library_construction_ontology.text (string) The name of a library construction approach being used. For example "10X v2 sequencing" or "Smart-seq2".

When Azul indexes this field, it uses ontology_label if present, text if not. And if neither is present, it's ontology (the term reference).

Error Types

Looking at the spreadsheet above, it appears there are several classes of problems to be addressed:

Type Description Example
1 Incorrect ongology_label e.g. using DroNc-Seq instead of DroNc-seq, 10x 3'v2 instead of 10X 3' v2 sequencing
2 Using ontology identifier when a more specific term is available. e.g using 10X v2 sequencing (EFO:0009310) instead of a more specific term that specifies the end_bias such as (EFO_0009899)
3 mismatch of ontology_label and ontology_term e.g. label is 10X 3' v2 sequencing and text is 10X 5' v2 sequencing (Row 66)

We may also have internal consistency errors that show up with further validation, for example, where the end_bias does not match the ontology term.

Possible Discussion Points

  1. What is the best way to find, report, track, and fix these kinds of errors and create a work queue for resolving them?

  2. Where might we add validation to prevent incorrect ontology terms and labels?

  3. What validations are required, and how might they be specified and implemented? For example:

    1. How can we specify when non-leaf nodes should be disallowed as ontology terms? For example, how could we specify that 10x 5’ v2 sequencing is allowed, but 10x v2 sequencing is not?
    2. What is the purpose of the text field when the ontology label is provided? Should we be concerned when there is an apparent mismatch between the ontology label and the text?
  4. Should we more aggressively use hcao to add terms where they are missing in the core ontologies. For example, to prevent "nulls" in the ontology and ontology text fields.

  5. Can/should we fix the incorrect metadata that has made it into DCP generated matrices.

Notes

The query for the above spreadsheet is listed below. The query could be modified to look for similar errors in other ontologized fields.

SELECT
  protocol_project.project_id,
  library_preparation_protocol_id,
  json_extract_scalar(content,
    "$.library_construction_method.ontology") AS ontology_id,
  json_extract_scalar(content,
    "$.library_construction_method.ontology_label") AS ontology_label,
  json_extract_scalar(content,
    "$.library_construction_method.text") AS text,
  json_extract_scalar(content,
    "$.end_bias") AS end_bias
FROM
  `broad-datarepo-terra-prod-hca2.hca_prod_20201120_dcp2___20210401_dcp4.library_preparation_protocol` AS library_preparation_protocol
FULL JOIN (
  SELECT
    DISTINCT *
  FROM (
    SELECT
      project_id,
      JSON_EXTRACT_SCALAR(protocol,
        "$.protocol_type") AS protocol_type,
      JSON_EXTRACT_SCALAR(protocol,
        "$.protocol_id") AS protocol_id,
    FROM
      `broad-datarepo-terra-prod-hca2.hca_prod_20201120_dcp2___20210401_dcp4.links`
    LEFT JOIN
      UNNEST(JSON_EXTRACT_ARRAY(content,
          "$.links")) AS process
    LEFT JOIN
      UNNEST(JSON_EXTRACT_ARRAY(process,
          "$.protocols")) AS protocol ) AS protocol_project
  WHERE
    protocol_type = "library_preparation_protocol") AS protocol_project
ON
  protocol_project.protocol_id = library_preparation_protocol.library_preparation_protocol_id
ORDER BY
  ontology_id,
  library_preparation_protocol_id

Metadata

Metadata

Assignees

Labels

spike:3[process] Spike estimate of three points

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions