Skip to content

Companion R package for a publication to run enrichment analysis across protein determining factors

License

Notifications You must be signed in to change notification settings

comp-med/r-prodente

Repository files navigation

prodente: Protein Determinants Enrichment Testing

R-CMD-check

Studying the plasma proteome in health and disease has now become feasible at scale, but most proteins that can be measured in blood do not have a function in blood. This implies that traditional pathway enrichment tools may give uninformative and maybe even misleading results when applied to lists of differentially expressed proteins from blood. We used data from the UK Biobank to systematically identify for each of >2,900 proteins what changes in plasma may represent, including >1,800 factors measured among >40,000 participants. Using machine learning, we identified for each protein that most relevant factors, including diseases and drugs, enviromental components, but also lifestyle factors, or technical variables.

The goal of prodente is to make results from the publication Machine learning-guided deconvolution of plasma protein levels immediately accessible for your study.

This package contains functions to easily test a list of proteins for enrichment of UK Biobank study characteristics. Analogous to pathway enrichment tools, the package delivers evidence for participant (patient) characteristics that may explain (some) of the differentially expressed proteins. For example, it provides a data-driven way to test for potential confounding by other treatments or preanalytical variables, such as platelet aggregation.

Please keep in mind, that this work is based on the Olink Explore platform, and that you need to 1) provide a background of all proteins measured in your study, and 2) that the Olink Explore platform may not capture all proteins available in your study or even have them measured to the same quality.

Installation

You can install the development version of prodente directly from GitHub:

library(remotes)
install_github("comp-med/prodente")

Getting Started

The API consists only of a hand-full of functions that mostly make working with the results table more convenient. Each important results object is immediately available as accessible data.

library(prodente)

# There is example data provided.
fasting_study_results <- prodente::fasting_study_results

# It is important to check that protein identifiers of your data are available
# in the background data.
check_protein_overlap(fasting_study_results$protein_id, return_missing = TRUE)
#> character(0)

# The input should be a vector of protein names in lowercase (or a list of
# vectors for enrichment tests across groups)
head(fasting_study_results$protein_id)
#> [1] "pcsk9"    "apoa4"    "lep"      "tmprss15" "fam3b"    "tnr"

# Make sure to check the provided mapping table in case you have missing
# proteins. Maybe they are in the data but formatted slightly differently.
# Additionally, the table provides Olink IDs and UniProt IDs to make matching
# easier.
head(prodente::protein_mapping_table)
#>    mapping_id           panel olink_id    assay uniprot hgnc_symbol
#>        <char>          <char>   <char>   <char>  <char>      <char>
#> 1:   adamts13 Cardiometabolic OID20249 ADAMTS13  Q76LX8    ADAMTS13
#> 2:      alcam Cardiometabolic OID20273    ALCAM  Q13740       ALCAM
#> 3:       blmh Cardiometabolic OID20336     BLMH  Q13867        BLMH
#> 4:        ca4 Cardiometabolic OID20241      CA4  P22748         CA4
#> 5:      casp3 Cardiometabolic OID20305    CASP3  P42574       CASP3
#> 6:      ccl15 Cardiometabolic OID20328    CCL15  Q16663       CCL15

Running Enrichment Tests

Enrichment tests can be stratified by sex or genetic ancestry. Set the test_across parameter of the function enrich_protein_characteristics() to do so.

The results table returned includes the following columns:

population: The sex or ancestry group enrichment was tested in (or ‘All’)
variable: Identifier of the explanatory characteristic tested for enrichment
or: Odds ratio of the enrichment test
pval: Raw p-value from Fisher’s exact test
intersection: Proteins both in the input and explained by the variable
d1 to d4: Counts from the 2×2 contingency table used for Fisher’s exact test
category: Category of the characteristic (e.g. “Diseases”)
id: UK Biobank field ID of the variable (if applicable)
column_name: Column name in UK Biobank format
label: Human-readable label for the characteristic (e.g. “Fasting time”)
released: Logical indicating whether the variable is released in UK Biobank
type: Data type of the characteristic (e.g. “numeric”, “factor”)
category_sort: Used for ordering categories
p_adjust: Bonferroni-adjusted p-value (adjusted within each population)
group: The group name of proteins tested for enrichment (only when testing groups)

# The background by default includes all proteins in the protein mapping table
background <- prodente::protein_mapping_table$mapping_id # n=2919

# Run enrichment tests for a list of proteins
results <- enrich_protein_characteristics(
  protein_foreground = fasting_study_results[group == "day_3", protein_id],
  factor_minimum_explained_variance = 0,
  test_across = "sex",
  n_cores = 8,
  protein_background = background
)

# The output is a data.table with enrichment results
head(results[p_adjust < 0.05 & or > 5, .(population, label)])
#>    population        label
#>        <char>       <char>
#> 1:     Female      Digoxin
#> 2:     Female  Simvastatin
#> 3:        All Atorvastatin
#> 4:     Female Atorvastatin
#> 5:        All Fasting time
#> 6:     Female Fasting time

Running Enrichment Tests Across Groups

Enrichment tests can also be performed across several groups. Use the function enrichment_test_across_groups() for this.

The default background includes 2919 unique protein targets measured by Olink Explore. You should set a custom background of proteins to test against by using the protein_background parameter, if your data does not cover the full set of proteins.

# When data for several groups is available, supply a names list to this function
group_data <- list(
  day_1 = fasting_study_results[group == "day_1", protein_id],
  day_2 = fasting_study_results[group == "day_2", protein_id],
  day_3 = fasting_study_results[group == "day_3", protein_id],
  day_10 = fasting_study_results[group == "day_10", protein_id]
)

group_results <- enrichment_test_across_groups(
  protein_foreground_list = group_data, 
  test_across = "sex", 
  n_cores = 8,
  protein_background = prodente::protein_mapping_table$mapping_id # this is the default
)

Plotting Enrichment Results

Enrichment test output can be plotted using the function plot_enrichment_results(). The outline of the dots are black if the result is significant after chosen adjustment.

# You can plot enrichment results for a single population (within either `sex`
# or `ancestry`).
enrichment_plot <- plot_enrichment_results(
  group_results,
  plot_population = "All",
  pvalue_adjustment = "bonferroni" # default
)

Plotting Explained Variance for a Single Protein

You can plot the top 20 factors that contribute to the variance for a single protein using the function plot_variance_explained(). The top chart shows the cumulative explained variance.

varience_plot <- plot_variance_explained(protein_id="pcsk9", plot_population="All")

Accessing the Data

To access the data provided in the package, use the prodente::<DATA_OBJECT> notation.

fasting_study_results <- prodente::fasting_study_results
participant_characteristics_labels <- prodente::participant_characteristics_labels
protein_mapping_table <- prodente::protein_mapping_table
variance_decomposition_background <- prodente::variance_decomposition_background

About

Companion R package for a publication to run enrichment analysis across protein determining factors

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages