Merge pull request #131 from fhdsl/adding-gene-set-expression-analysis

ehumph · web-flow · commit 5a6ddfd260dc · 2025-08-22T08:48:35.000-04:00
[Mouse GutBrain] adding gene set expression analysis section
diff --git a/module/mouse_gutbrain_de_miniCURE_guide.qmd b/module/mouse_gutbrain_de_miniCURE_guide.qmd
@@ -917,8 +917,8 @@ When R moves onto this second command, it's now checking to see if the geneID fo
 
 ```{r}
 asd_vs_c$gene_label <- case_when(
-  asd_vs_c$gene == "ENSMUSG00000079516" ~ "Reg3a",
-  asd_vs_c$gene == "ENSMUSG00000024440" ~ "Pcdh12",
+  asd_vs_c$GeneID == "ENSMUSG00000079516" ~ "Reg3a",
+  asd_vs_c$GeneID == "ENSMUSG00000024440" ~ "Pcdh12",
   TRUE ~ NA_character_
 )
 ```
@@ -990,3 +990,119 @@ ggplot(data = asd_vs_c, aes(x = log2FoldChange, y = padj_trans, col = diffexpres
   ggtitle("Differential Expression in Mouse Prefrontal Cortex")
 ```
 
+
+## Running a gene set enrichment analysis
+
+We can use a gene set enrichment analysis to explore biological processes might be affected by the genes that are differentially expressed in our comparison. We can also use this to figure out the types of processes genes on our gene list are involved in. 
+
+If you are using the SciServer environment, you can skip the **Creating the analysis functions** section below and go straight to the **Loading the data** section. The functions you need have already been created and are saved for your use on SciServer.
+
+### Creating the analysis functions
+
+Before we get started with the actual analysis, we'll need to run a couple of commands to create a special function to do the gene set analysis. This function will allow us to do a complex set of analysis steps in a single line of code.
+
+Before we can create our function, we need to install a couple of packages. Run the code below.
+
+```{r}
+if (!require("BiocManager", quietly = TRUE))
+    install.packages("BiocManager")
+
+BiocManager::install("clusterProfiler")
+BiocManager::install("org.Mm.eg.db")
+```
+
+You may have noticed that the installation commands for these two packages are different than the ones you used previously with the `tidyverse` and `ggrepel` packages. That's because the `clusterProfiler` and `org.Mm.eg.db` packages are saved in a different location (called a repository) than `tidyverse` and `ggrepel`. Both `clusterprofiler` and `org.Mm.eg.db` are part of the Bioconductor Project, which focuses on developing, maintaining, and distributing software specifically for biological and bioinformatics research.
+
+Once these two packages have been installed, make sure you load them.
+
+```{r}
+library("clusterProfiler")
+library("org.Mm.eg.db")
+```
+
+Now we can make our function! The code below makes two functions. A _function_ in R is just a block of organized, reusable code that performs a specific task. The commands you have been using in your analysis (like `library`, `filter`, and `heatmap`, to name a few examples) are actually functions that someone else wrote and saved in packages for everyone to use. 
+
+The functions you're creating also perform special tasks. `runClusterProfiler` is going to pull information about the GO terms associated with each gene in your list. (Remember, GO terms are standardized names for molecular processes, biological processes, and cellular locations.) The function `getClusterProfilerGenes` will let you see which genes in your gene list belong to each GO term.
+
+If this sounds confusing, be patient - it should become more clear as you use the functions.
+
+Just run the code below.
+
+```{r}
+runClusterProfiler <- function (x) {
+  ids <- bitr( x$GeneID, "ENSEMBL", "ENTREZID", "org.Mm.eg.db" )
+  kegg <- enrichKEGG(ids$ENTREZID, "mmu", keyType="ncbi-geneid")
+  kegg@result$Description <- sub( " - Mus musculus \\(house mouse\\)", "", kegg@result$Description )
+  return(kegg)
+}
+
+getClusterProfilerGenes <- function (x, i) {
+  data.frame( x ) %>%
+  filter( str_detect( Description, i ) ) %>%
+  pull( geneID ) %>%
+  strsplit( "/" ) %>%
+  unlist() %>%
+  bitr( fromType="ENTREZID", toType="SYMBOL", OrgDb="org.Mm.eg.db") %>%
+  pull( SYMBOL )
+}
+```
+
+Okay, that's all the prep work we need to do! Now let's move onto the actual analysis.
+
+### Loading the data
+
+For this analysis, we're using the differential expression dataset that you used to make your volcano plot. If it's loaded into your session, great! You don't need to do anything. If you need to load it again, here's the code and links from before. We're loading the data that compares gene expression between the prefrontal cortex and striatum in the control mice.
+
+```{r}
+prefront_vs_striatum_control <- read_csv("https://genomicseducation.org/data/mouse_gutbrain_de_tissuetype_in_controlmice.csv")
+```
+
+Here are the URLs for all the possible comparisons you can examine with this dataset:
+
+**Comparing gene expression between ASD and control mice**
+
+Both brain regions: <https://genomicseducation.org/data/mouse_gutbrain_de_autismVcontrol.csv>
+
+Prefrontal cortex only: <https://genomicseducation.org/data/mouse_gutbrain_de_autismVcontrol_in_prefrontalcortex.csv>
+
+Striatum only: <https://genomicseducation.org/data/mouse_gutbrain_de_autismVcontrol_in_striatum.csv>
+
+**Comparing gene expression between prefrontal cortex and striatum**
+
+All mice: <https://genomicseducation.org/data/mouse_gutbrain_de_tissuetype.csv>
+
+Only ASD mice: <https://genomicseducation.org/data/mouse_gutbrain_de_tissuetype_in_ASDmice.csv>
+
+Only control mice: <https://genomicseducation.org/data/mouse_gutbrain_de_tissuetype_in_controlmice.csv>
+
+### Filtering the data based on padj
+
+In a gene set analysis, we're interested in identifying the biological pathways that are likely to be different based on the gene expression differences between our two compared groups. We run these kinds of analyses on the genes with the lowest padj values in the dataset (because these are the genes that we feel confident are differentially expressed).
+
+We use `filter` to create this gene list. You can set "padj" to whatever threshold you like.
+
+```{r}
+sig_gene_list <- filter(prefront_vs_striatum_control, padj <= 0.001)
+```
+
+### Visualizing the biological pathway GO terms in the gene set
+
+Now we're going to use the special functions you created earlier. First we want to use `runClusterProfiler`. This is the function that gathers all the GO terms associated with the genes in your gene list.
+
+```{r}
+gene_clusters <- runClusterProfiler(sig_gene_list)
+```
+
+While R can easily read the output from `runClusterProfiler`, it's a little harder for us unless we make something called a dot plot. 
+
+```{r}
+dotplot(gene_clusters, showCategory=34, title="Gene List, padj<0.001", font.size=10, label_format = 50)
+```
+In the dot plot, the GO terms (mostly biological processes) are on the y-axis. The x-axis gives us an idea of how many genes in our gene list are found in each GO term category (GeneRatio is proportion of the differentially expressed genes that belong to a specific GO term or pathway). The dot plot also colors the dots based on the "padj" value. In this case, the red dots indicate lower (or more statistically certain) adjusted p-values.
+
+The top GO term on the plot (which is the category with the greatest number of differentially expressed genes in it) is "Pathways of neurodegeneration - multiple diseases." How interesting! Clearly the expression of neurodegeneration-associated genes differ between the two parts of the brain, even in control mice. Let's take a closer look at what genes belong to this GO term category using the `getClusterProfilerGenes` function.
+
+```{r}
+getClusterProfilerGenes(gene_clusters, "Pathways of neurodegeneration - multiple diseases")
+```
+There are a lot of genes in this category! At this point, researchers might pick a some genes from this list and go back to the MGI database to research it for future work.
diff --git a/module/mouse_gutbrain_de_student_guide.qmd b/module/mouse_gutbrain_de_student_guide.qmd
@@ -261,9 +261,9 @@ striatum <- read_csv("https://genomicseducation.org/data/mouse_gutbrain_de_autis
 Then we'll filter out the gene in which we're interested from each object. Let's take a look at gene ENSMUSG00000079516, which is the reg3a gene we previously looked up on MGI.
 
 ```{r}
-reg3a_prefrontal <- filter(prefrontal, gene == "ENSMUSG00000079516")
+reg3a_prefrontal <- filter(prefrontal, GeneID == "ENSMUSG00000079516")
 
-reg3a_striatum <- filter(striatum, gene == "ENSMUSG00000079516")
+reg3a_striatum <- filter(striatum, GeneID == "ENSMUSG00000079516")
 ```
 
 Finally, take a look at the differential expression of reg3a in each region.
diff --git a/module/mouse_gutbrain_de_student_only_code.qmd b/module/mouse_gutbrain_de_student_only_code.qmd
@@ -219,9 +219,9 @@ striatum <- read_csv("https://genomicseducation.org/data/mouse_gutbrain_de_autis
 Then we'll filter out the gene in which we're interested from each object. Let's take a look at gene ENSMUSG00000079516, which is the reg3a gene we previously looked up on MGI.
 
 ```{r}
-reg3a_prefrontal <- filter(prefrontal, gene == "ENSMUSG00000079516")
+reg3a_prefrontal <- filter(prefrontal, GeneID == "ENSMUSG00000079516")
 
-reg3a_striatum <- filter(striatum, gene == "ENSMUSG00000079516")
+reg3a_striatum <- filter(striatum, GeneID == "ENSMUSG00000079516")
 ```
 
 Finally, take a look at the differential expression of reg3a in each region.
diff --git a/resources/dictionary.txt b/resources/dictionary.txt
@@ -86,6 +86,7 @@ GeneCards
 geneID
 GeneID
 geneIDs
+GeneRatio
 GH
 Github
 GitHub
@@ -144,6 +145,7 @@ myelin
 myelination
 ncbi
 neuroanatomy
+neurodegeneration
 neurodegenerative
 Neurodegenerative
 neurodevelopmental