Gene List Analysis in R

Author

Simon Andrews

Introduction

We are going to go through the Gene Set analysis exercises from the Gene List analysis course.

Exervise 1: Categorial Gene Set Analysis

Loading packages

We’ll start by loading the packages we’re going to use for the exercise.

library(clusterProfiler)
clusterProfiler v4.14.6 Learn more at https://yulab-smu.top/contribution-knowledge-mining/

Please cite:

S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang,
W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize
multiomics data. Nature Protocols. 2024, 19(11):3292-3320

Attaching package: 'clusterProfiler'
The following object is masked from 'package:stats':

    filter
library(enrichplot)
enrichplot v1.26.6 Learn more at https://yulab-smu.top/contribution-knowledge-mining/

Please cite:

Guangchuang Yu, Qing-Yu He. ReactomePA: an R/Bioconductor package for
reactome pathway analysis and visualization. Molecular BioSystems.
2016, 12(2):477-479
library(org.Mm.eg.db)
Loading required package: AnnotationDbi
Loading required package: stats4
Loading required package: BiocGenerics

Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, saveRDS, setdiff,
    table, tapply, union, unique, unsplit, which.max, which.min
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: IRanges
Loading required package: S4Vectors

Attaching package: 'S4Vectors'
The following object is masked from 'package:clusterProfiler':

    rename
The following object is masked from 'package:utils':

    findMatches
The following objects are masked from 'package:base':

    expand.grid, I, unname

Attaching package: 'IRanges'
The following object is masked from 'package:clusterProfiler':

    slice

Attaching package: 'AnnotationDbi'
The following object is masked from 'package:clusterProfiler':

    select
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ lubridate::%within%() masks IRanges::%within%()
✖ dplyr::collapse()     masks IRanges::collapse()
✖ dplyr::combine()      masks Biobase::combine(), BiocGenerics::combine()
✖ dplyr::desc()         masks IRanges::desc()
✖ tidyr::expand()       masks S4Vectors::expand()
✖ dplyr::filter()       masks clusterProfiler::filter(), stats::filter()
✖ dplyr::first()        masks S4Vectors::first()
✖ dplyr::lag()          masks stats::lag()
✖ ggplot2::Position()   masks BiocGenerics::Position(), base::Position()
✖ purrr::reduce()       masks IRanges::reduce()
✖ dplyr::rename()       masks S4Vectors::rename(), clusterProfiler::rename()
✖ lubridate::second()   masks S4Vectors::second()
✖ lubridate::second<-() masks S4Vectors::second<-()
✖ dplyr::select()       masks AnnotationDbi::select(), clusterProfiler::select()
✖ purrr::simplify()     masks clusterProfiler::simplify()
✖ dplyr::slice()        masks IRanges::slice(), clusterProfiler::slice()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading the data

read_delim("KAT6AB_knockout.txt") -> kat6ab
Rows: 18993 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (5): Gene, Chromosome, Strand, ID, Description
dbl (6): Start, End, Pvalue, FDR, Log2FC, ShrunkLFC

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Creating the gene lists

We pull all of the measured gene names for the background, and all of the genes with FDR below 0.05 and a log2 Fold Change above 1 (so upregulated more than 2 fold).

We could do a separate analysis of the downregulated genes by filtering for Log2FC < -1

kat6ab |>
  pull(Gene) -> background_genes

kat6ab |>
  filter(FDR < 0.05) |>
  filter(Log2FC > 1) |>
  pull(Gene) -> up_genes

Running the analysis

Here we’re just going to analyse the Molecular Function (MF) section of the Gene Ontology. This will take a minute or so to run.

enrichGO(
  up_genes,
  OrgDb = org.Mm.eg.db,
  keyType = "SYMBOL",
  universe = background_genes,
  ont = "BP",
  minGSSize = 10,
  maxGSSize = 100,
  readable = TRUE
) -> enrichgo_results

Viewing tabular results

We can embed the results in the document here.

enrichgo_results |>
  as_tibble() |>
  head(n=100)

We can also see how many significant hits we have.

enrichgo_results |>
  filter(p.adjust<0.05) |>
  filter(FoldEnrichment > 2) |>
  nrow()
[1] 302

Barplot

enrichgo_results |>
  barplot(showCategory = 20) 

Dotplot

enrichgo_results |>
  dotplot(
    showCategory = 20,
    x="Count"
  ) 

Cnetplot

enrichgo_results |>
  cnetplot(
    showCategory=15,
    node_label = "category"
  )

Treeplot

enrichgo_results |>
    pairwise_termsim() |>
    treeplot(
        showCategory=50,
        nCluster=10
    )
Warning in treeplot.enrichResult(x, ...): Use 'cluster.params = list(n = your_value)' instead of 'nCluster'.
 The nCluster parameter will be removed in the next version.

Upsetplot

We’ll try an upsetplot, but we can also filter the results to get some of the categories we know overlap.

enrichgo_results |>
  filter(str_detect(Description,"negative regulation")) |>
  upsetplot(
    n=6
  )

We can see that there is a lot of overlap between these terms.

Custom plot

We can always pull the tibble of results out and use standard ggplot to plot them.

enrichgo_results |>
  as_tibble() |>
  ggplot(aes(x=FoldEnrichment, y=-10*log10(p.adjust), label=Description)) +
  geom_point() +
  ggrepel::geom_text_repel(
    data = . %>% filter(p.adjust<0.001, FoldEnrichment > 5),
    size=3
  )
Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

Exercise 2: Quantitative Gene Set Analysis

Making gene lists

We’re going to use the ShrunkLFC column from the data for this analysis. We need to first sort by this value

kat6ab |> 
  arrange(desc(ShrunkLFC)) -> kat6ab

We can take a quick look at the distribution of values we have.

kat6ab |>
  mutate(index=1:n()) |>
  ggplot(aes(x=index, y=ShrunkLFC)) +
  geom_line() +
  geom_hline(yintercept = 0)

Now we can pull the ShrunkLFC values into a vector

kat6ab |> 
  pull(ShrunkLFC) -> kat6ab_slfc

..and we annotate this with the gene names

kat6ab |>
  pull(Gene) -> names(kat6ab_slfc)

We can look at the first few values to check they look OK

kat6ab_slfc[1:10]
    Ptx3     Saa3      Hgf  Hsd11b1      Dpt   Prl2c3  Apbb1ip   Prl3b1 
3.513270 3.271450 3.075175 2.987640 2.710258 2.659219 2.565703 2.510138 
     Bgn    Foxf1 
2.504019 2.485355 

Running GSEGo

We can now plug this data into gseGO. As before we’re only analysing the Biological Process subset of the Gene Ontology, but we could extend to other gene sets.

gseGO(
  kat6ab_slfc,
  ont="BP",
  OrgDb = "org.Mm.eg.db",
  keyType = "SYMBOL",
  minGSSize = 10,
  maxGSSize = 100
) -> gsego_result
using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
preparing geneSet collections...
GSEA analysis...
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (2.94% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
minSize, : For some of the pathways the P-values were likely overestimated. For
such pathways log2err is set to NA.
Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
minSize, : For some pathways, in reality P-values are less than 1e-10. You can
set the `eps` argument to zero for better estimation.
leading edge analysis...
done...

View tabular results

gsego_result |>
  as_tibble() |>
  head(n=30)

Most of the top hits show a negative skew in their values. If we wanted to focus on the positive (to match what we did in the categorical analysis) we can filter the results

gsego_result |>
  as_tibble() |>
  filter(NES > 0) |>
  head(n=30)

You can probably recognise some of the same terms as we saw in the categorical analysis.

GSEA Plots

We can use the built in plots which mimic the plots produced by the native version of GSEA.

gsego_result |>
  gseaplot(
    geneSetID = 1,
    title=gsego_result$Description[1]
  )

We can try the newer version of this too.

gsego_result |>
  filter(NES>0) -> gsego_result_up

gsego_result_up |>
  gseaplot2(
    geneSetID = 1,
    title=gsego_result_up$Description[1]
  )

gsego_result |>
  mutate(direction=if_else(NES>0, "Positive","Negative")) |>
  group_by(direction) |>
  slice(1:20) |>
  dotplot(showCategory=40, x="NES") +
  facet_grid(rows=vars(direction), scale="free_y") +
  geom_vline(xintercept = 0)

Emapplot

gsego_result |>
  filter(NES > 0) |>
  pairwise_termsim() |>
  emapplot(
    min_edge=0.5,
    showCategory=100,
    node_label="none"
)

Interactive emapplot

gsego_result |>
  filter(NES > 0) |>
  pairwise_termsim() |>
  emapplot(
    min_edge=0.5,
    showCategory=100,
    node_label="category"
  ) -> emap_plot

emap_plot$layers[[2]]$mapping$label = emap_plot$data$label

plotly::ggplotly(emap_plot, width = 800, height=800)
Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
  If you'd like to see this geom implemented,
  Please open an issue with your example code at
  https://github.com/ropensci/plotly/issues