This is an example of a workflow to process data in Seurat v5. Here we’re using a simple dataset consisting of a single set of cells which we believe should split into subgroups. In this exercise we will:

  • Load in the data

  • Do some basic QC and Filtering

  • Select genes which we believe are going to be informative

  • Perform dimensionality reduction

  • Detect clusters within the data

  • Find genes which define the clusters

  • Examine how robust the clusters and genes are

Setup

We’re going to start with some basic housekeeping. We’re going to load the packages we’re going to use, these will be:

  • Seurat (for general single cell loading and processing)

  • Tidyverse (for non-standard data manipulation and plotting)

  • SCINA (for cell type prediction)

We’re also going to set a nicer theme for ggplot in tidyverse so our graphs look nicer.

library(Seurat)
library(tidyverse)
library(SCINA)
theme_set(theme_bw(base_size = 14))

Loading Data

We have the cellranger output files already produced. The files we’re working with here are the raw output from the CellRanger pipeline. We didn’t do anything else to them. We can see the files we have to work with. We’re working with the filtered h5 file, which means that we have to also install the hdf5 package in R. This means we can load the data from a single file using the Read10x_h5 function in Seurat.

Read10X_h5("../filtered_feature_bc_matrix.h5") -> data

CreateSeuratObject(
  counts=data, 
  project="course", 
  min.cells = 3, 
  min.features=200
) -> data

data
## An object of class Seurat 
## 17136 features across 5070 samples within 1 assay 
## Active assay: RNA (17136 features, 0 variable features)
##  1 layer present: counts

QC

Before we do any analysis it’s really important to do some quality control and filtering to make sure we’re working with good data. Seurat automatically creates two metrics we can use:

  1. nCount_RNA the total number of reads (or more correctly UMIs) in the dataset

  2. nFeature_RNA the number of observed genes (anything with a nonzero count)

We can supplement this with other metrics which we can calculate ourselves.

Amount of MT genes

Single cell datasets can be filled with large numbers of reads coming from mitochondria. These often indicate a sick cell undergoing apoptosis. We want to check for this. Seurat comes with a load of built-in functions for accessing certain aspects of your data, but you can also dig into the raw data fairly easily.

The main data can be found under @assays$RNA. This is a matrix with genes as rownames and cell barcodes as columns. Technically we’re looking at @assays$RNA@counts which is the matrix of raw counts. This will become important later when we start to calculate normalised expression values to go alongside the counts.

We can find the gene names as the rownames of the @assays$RNA@counts slot of the Seurat object and we identify the mitochondrial genes by their names starting with “MT-”. Be aware that in other species the naming of the mitochondrial genes may not be the same so you’d need to adjust the pattern.

grep("^MT-",rownames(data),value = TRUE)
##  [1] "MT-ND1"  "MT-ND2"  "MT-CO1"  "MT-CO2"  "MT-ATP8" "MT-ATP6" "MT-CO3" 
##  [8] "MT-ND3"  "MT-ND4L" "MT-ND4"  "MT-ND5"  "MT-ND6"  "MT-CYB"

We can calculate the percentage of the data coming from a set of genes using the PercentageFeatureSet function. We can store the result in the main metadata store for the object by defining a new column called “percent.MT”.

PercentageFeatureSet(data,pattern="^MT-") -> data$percent.MT

head(data$percent.MT)
## AAACCTGAGAAACCAT-1 AAACCTGAGATAGCAT-1 AAACCTGAGCGTGAAC-1 AAACCTGCAACGCACC-1 
##           3.151927           4.809843           3.133159           4.295333 
## AAACCTGCACACCGAC-1 AAACCTGCATCGATTG-1 
##           2.648676           3.776435

Amount of Ribosomal Genes

Ribosomal genes also tend to be very highly represented, and can vary between cell types, so it can be instructive to see how prevalent they are in the data. These are ribosomal protein genes rather than the actual rRNA, so they’re more a measure of the translational activity of the cell rather than the cleanliness of the polyA selection.

grep("^RP[LS]",rownames(data),value = TRUE)
##  [1] "RPL22"       "RPL11"       "RPS6KA1"     "RPS8"        "RPL5"       
##  [6] "RPS27"       "RPS6KC1"     "RPS7"        "RPS27A"      "RPL31"      
## [11] "RPL37A"      "RPL32"       "RPL15"       "RPSA"        "RPL14"      
## [16] "RPL29"       "RPL24"       "RPL22L1"     "RPL39L"      "RPL35A"     
## [21] "RPL9"        "RPL34-AS1"   "RPL34"       "RPS3A"       "RPL37"      
## [26] "RPS23"       "RPS14"       "RPL26L1"     "RPS18"       "RPS10"      
## [31] "RPL10A"      "RPL7L1"      "RPS12"       "RPS6KA2"     "RPS20"      
## [36] "RPL7"        "RPL30"       "RPL8"        "RPS6"        "RPL35"      
## [41] "RPL12"       "RPL7A"       "RPS24"       "RPLP2"       "RPL27A"     
## [46] "RPS13"       "RPS6KA4"     "RPS6KB2"     "RPS6KB2-AS1" "RPS3"       
## [51] "RPS25"       "RPS26"       "RPL41"       "RPL6"        "RPLP0"      
## [56] "RPL21"       "RPS29"       "RPL36AL"     "RPS6KL1"     "RPS6KA5"    
## [61] "RPS27L"      "RPL4"        "RPLP1"       "RPS17"       "RPS2"       
## [66] "RPS15A"      "RPL13"       "RPL26"       "RPL23A"      "RPL23"      
## [71] "RPL19"       "RPL27"       "RPS6KB1"     "RPL38"       "RPL17"      
## [76] "RPS15"       "RPL36"       "RPS28"       "RPL18A"      "RPS16"      
## [81] "RPS19"       "RPL18"       "RPL13A"      "RPS11"       "RPS9"       
## [86] "RPL28"       "RPS5"        "RPS21"       "RPL3"        "RPS19BP1"   
## [91] "RPS6KA3"     "RPS4X"       "RPS6KA6"     "RPL36A"      "RPL39"      
## [96] "RPL10"
PercentageFeatureSet(data,pattern="^RP[LS]") -> data$percent.Ribosomal

head(data$percent.Ribosomal)
## AAACCTGAGAAACCAT-1 AAACCTGAGATAGCAT-1 AAACCTGAGCGTGAAC-1 AAACCTGCAACGCACC-1 
##           47.34694           27.49441           35.16101           26.25958 
## AAACCTGCACACCGAC-1 AAACCTGCATCGATTG-1 
##           42.52874           48.79154

Percentage of Largest Gene

We can also go into the count matrix and make our own metrics. The data is stored in a “Sparse Matrix” which is more efficient for storing data with a large proportion of unobserved values (such as 10X data). In this example we run apply over the columns (cells) and calculate what percentage of the data comes from the single most observed gene. Again, having a high proportion of your data dominated by a single gene is a metric which could either give biological context or indicate a technical problem, depending on what the gene is.

When we calculate this we normally find that MALAT1 is normally the largest gene by some distance - it’s a non-coding nuclear gene expressed at very high levels. This has such a big effect that we’ll measure it separately, and exclude it from our analysis here.

We will get:

  • The count for the largest gene per cell

  • The index position of the gene with the largest count

  • The name of the most highly expressed gene per cell

data[rownames(data) != "MALAT1",] -> data.nomalat

apply(
  data.nomalat@assays$RNA@layers$counts,
  2,
  max
) -> data.nomalat$largest_count

apply(
  data.nomalat@assays$RNA@layers$counts,
  2,
  which.max
) -> data.nomalat$largest_index

rownames(data.nomalat)[data.nomalat$largest_index] -> data.nomalat$largest_gene

100 * data.nomalat$largest_count / data.nomalat$nCount_RNA -> data.nomalat$percent.Largest.Gene

data.nomalat$largest_gene -> data$largest_gene
data.nomalat$percent.Largest.Gene -> data$percent.Largest.Gene

rm(data.nomalat)

Seurat QC Plots

Seurat comes with some convenience methods for plotting out certain types of visualisation, such as the distribution of certain QC metrics. We can view this on both a linear and log scale to see which looks more helpful.

VlnPlot(data, layer = "counts", features=c("nCount_RNA","percent.MT", "percent.Ribosomal","percent.Largest.Gene"))

For some metrics it’s better to view on a log scale.

VlnPlot(data, layer="counts", features=c("nCount_RNA","percent.MT", "percent.Largest.Gene"), log = TRUE)

We can also plot metrics against each other to see what the relationship between them is. There is a built in seurat plot to do this which is easy to use, but isn’t especially pretty.

FeatureScatter(data,feature1 = "nCount_RNA", feature2 = "percent.Largest.Gene")

GGplot QC plots

Rather than using Seurat’s built in plots you also have the option to extract the data yourself and plot it using any of the R conventional plotting systems.

as_tibble(
  data[[]],
  rownames="Cell.Barcode"
) -> qc.metrics

head(qc.metrics)

We can then plot out with ggplot if we want to do more.

qc.metrics %>%
  arrange(percent.MT) %>%
  ggplot(aes(nCount_RNA,nFeature_RNA,colour=percent.MT)) + 
  geom_point() + 
  scale_color_gradientn(colors=c("black","blue","green2","red","yellow")) +
  ggtitle("Example of plotting QC metrics") +
  geom_hline(yintercept = 750) +
  geom_hline(yintercept = 2000) 

This plot often makes more sense on a log scale too:

qc.metrics %>%
  arrange(percent.MT) %>%
  ggplot(aes(nCount_RNA,nFeature_RNA,colour=percent.MT)) + 
  geom_point(size=0.7) + 
  scale_color_gradientn(colors=c("black","blue","green2","red","yellow")) +
  ggtitle("Example of plotting QC metrics") +
  geom_hline(yintercept = 750) +
  geom_hline(yintercept = 2000) +
  scale_x_log10() + scale_y_log10()

We can see that there are potentially a couple of different populations seen here with different relationships between the read (UMI) counts and number of genes detected. We can try to quantitate this by calculating a complexity value which just divides genes by UMIs. Higher values indicate that we’re getting shallower coverage of more genes, and lower values mean that we’re seeing fewer genes overall. This can often link to the percent highest gene value from before, but the effect can be more widespread than that.

Plotting complexity

The standard way of calculating this is log10(genes)/log10(counts) however this gives absolute values which are difficult to judge. A possibly better approach is to fit a line through the cloud and then calculate the difference from the observed value to the expected.

qc.metrics %>%
  mutate(complexity=log10(nFeature_RNA) / log10(nCount_RNA))  -> qc.metrics

lm(log10(qc.metrics$nFeature_RNA)~log10(qc.metrics$nCount_RNA)) -> complexity.lm

complexity.lm
## 
## Call:
## lm(formula = log10(qc.metrics$nFeature_RNA) ~ log10(qc.metrics$nCount_RNA))
## 
## Coefficients:
##                  (Intercept)  log10(qc.metrics$nCount_RNA)  
##                       0.4626                        0.7362
qc.metrics %>%
  mutate(
    complexity_diff = log10(nFeature_RNA) - ((log10(qc.metrics$nCount_RNA)*complexity.lm$coefficients[2])+complexity.lm$coefficients[1])
  ) -> qc.metrics

Now we can plot this

qc.metrics %>%
  ggplot(aes(x=complexity_diff)) +
  geom_density(fill="yellow")

min(c(max(qc.metrics$complexity_diff),0-min(qc.metrics$complexity_diff))) -> complexity_scale

qc.metrics %>%
  mutate(complexity_diff=replace(complexity_diff,complexity_diff< -0.1,-0.1)) %>%
  ggplot(aes(x=log10(nCount_RNA), y=log10(nFeature_RNA), colour=complexity_diff)) +
  geom_point(size=0.5) +
  geom_abline(slope=complexity.lm$coefficients[2], intercept = complexity.lm$coefficients[1]) +
  scale_colour_gradient2(low="blue2",mid="grey",high="red2")

qc.metrics %>%
  ggplot(aes(x=complexity_diff, y=percent.Largest.Gene)) +
  geom_point()

Examining Largest Genes

Some of the unusual populations in these plots can derive from the activity of a single gene, so we can look into this more closely.

First let’s see which the largest genes are.

qc.metrics %>%
  group_by(largest_gene) %>%
  count() %>%
  arrange(desc(n)) -> largest_gene_list

largest_gene_list

We can see what the big genes are doing in any of the previous plots.

largest_gene_list %>%
  filter(n>140) %>%
  pull(largest_gene) -> largest_genes_to_plot

qc.metrics %>%
  filter(largest_gene %in% largest_genes_to_plot) %>%
  mutate(largest_gene=factor(largest_gene, levels=largest_genes_to_plot)) %>%
  arrange(largest_gene) %>%
  ggplot(aes(x=log10(nCount_RNA), y=log10(nFeature_RNA), colour=largest_gene)) +
  geom_point(size=1) +
  scale_colour_manual(values=c("grey",RColorBrewer::brewer.pal(9,"Set1")))

qc.metrics %>%
  filter(largest_gene %in% largest_genes_to_plot) %>%
  mutate(largest_gene=factor(largest_gene, levels=largest_genes_to_plot)) %>%
  arrange(largest_gene) %>%
  ggplot(aes(x=complexity_diff, y=percent.Largest.Gene, colour=largest_gene)) +
  geom_point()+
  scale_colour_manual(values=c("grey",RColorBrewer::brewer.pal(9,"Set1")))

We have some super-outliers which are being driven by IGKC.

For the remainder, it looks like the lower complexity cells are mostly either mitochrondrial, and dominated by MT-CO1, or Ribosomal with either RPL10 or RPS18. Let’s project those metrics to see more clearly.

qc.metrics %>%
  arrange(percent.MT) %>%
  ggplot(aes(x=complexity_diff, y=percent.Largest.Gene, colour=percent.MT)) +
  geom_point() +
  scale_colour_gradient(low="grey", high="red2")

qc.metrics %>%
  arrange(percent.Ribosomal) %>%
  ggplot(aes(x=complexity_diff, y=percent.Largest.Gene, colour=percent.Ribosomal)) +
  geom_point() +
  scale_colour_gradient(low="grey", high="red2")

That seems to fit with the rest of the story. It’s maybe not surprising that cells which have a lot of their reads taken by highly active ribosomes or mitochondria show less diversity overall.

Setting QC Cutoffs

In general its a good idea to be fairly permissive when filtering your initial data. Depending on the source of your counts and the way they were imported you’ll probably already have removed the cells with very low counts, or the genes represented in only 1 or 2 cells.

Here we’ll set a cutoff on two of the metrics we calculated, but you will need to look at the QC of your own data to help decide. Remember, we will look at QC again after quantitating and clustering the data, so we can always come back and filter more harshly later if we wish.

qc.metrics %>%
  ggplot(aes(percent.MT)) + 
  geom_histogram(binwidth = 0.5, fill="yellow", colour="black") +
  ggtitle("Distribution of Percentage Mitochondrion") +
  geom_vline(xintercept = 10)

qc.metrics %>%
  ggplot(aes(percent.Largest.Gene)) + 
  geom_histogram(binwidth = 0.7, fill="yellow", colour="black") +
  ggtitle("Distribution of Percentage Largest Gene") +
  geom_vline(xintercept = 10)

Filtering

From the QC we can then filter the data to get rid of cells with unusual QC metrics. We’ve set cutoffs based on the plots we made before.

subset(
  data,
  nFeature_RNA>750 & 
    nFeature_RNA < 2000 & 
    percent.MT < 10 & 
    percent.Largest.Gene < 10
) -> data

data
## An object of class Seurat 
## 17136 features across 3939 samples within 1 assay 
## Active assay: RNA (17136 features, 0 variable features)
##  1 layer present: counts

Ideally, after filtering we should re-plot to make sure that the data really does look better.

Normalisation, Selection and Scaling

Normalisation

Before we do any analysis with the data we needs to normalise the raw counts we currently have to get values which are more comparable between cells.

The default normalisation in Seurat is pretty simple - it simply scales the counts by the total counts in each cell, multiplies by 10,000 and then log transforms.

NormalizeData(data, normalization.method = "LogNormalize") -> data
## Normalizing layer: counts

We can now access the normalised data in data@assays$RNA@layers$data. We can use this to show that we can get a list of the most highly expressed genes overall.

apply(data@assays$RNA@layers$data,1,mean) -> gene.expression

names(gene.expression) <- rownames(data)

sort(gene.expression, decreasing = TRUE) -> gene.expression

head(gene.expression, n=50)
##   MALAT1      B2M   TMSB4X    RPL10    RPL41   MT-CO1   EEF1A1   RPL13A 
## 6.343440 4.957780 4.842431 4.502590 4.499025 4.463847 4.427481 4.423139 
##    RPL13    RPS18    RPLP1    RPS27    RPL21    RPLP2    RPS12    RPL34 
## 4.392785 4.296266 4.292195 4.271100 4.252506 4.146282 4.144840 4.107383 
##    RPL32    RPS14   MT-CO3     RPS2   RPS27A   MT-CO2    RPL26   RPS15A 
## 4.071603 4.066967 4.053748 4.029942 4.025062 4.003151 3.986368 3.983466 
##    RPL28    RPL11    RPS19   RPL23A    RPS3A   RPL18A   RPL27A     RPS6 
## 3.958833 3.942238 3.930690 3.913685 3.881003 3.879557 3.872916 3.857568 
##    RPS23    RPS4X    RPL19   TMSB10     TPT1     RPS3     RPS8    RPL17 
## 3.833777 3.825899 3.818896 3.795400 3.780543 3.777781 3.776495 3.752844 
##    RPL39   MT-ND4    RPL30    RPS24     RPL7    RPS15  MT-ATP6     RPL3 
## 3.749268 3.738643 3.730554 3.715959 3.706176 3.697959 3.679005 3.617746 
##    RPL12    RPS25 
## 3.583404 3.573899

We can already see that there may be some issues to address in this data. Malat1 is a nuclear expressed transcript which tends to persist when cells have lysed and the cytoplasm has gone. It is generally highly expressed anyway, but cells with a very high level might indicate a problem.

We can also see high amounts of ribosomal proteins. Again, these are generally highly expressed but their presence in specific subsets might also be of concern in regards to the accuracy of quantitation in the data.

We can look in various ways at how well the data have been normalised. We can pick out a specific gene:

ggplot(mapping = aes(data["GAPDH",]@assays$RNA@layers$data)) + 
  geom_histogram(binwidth = 0.05, fill="yellow", colour="black") + 
  ggtitle("GAPDH expression")

So even for a so-called housekeeping gene we still see a significant proportion of dropout cells, and expression values which spread over 3 orders of magnitude.

We can also go a bit wider and pick the first 100 cells and look at the distributions of their expression values.

as.tibble(
  data@assays$RNA@layers$data[,1:100]
) %>%
  pivot_longer(
    cols=everything(),
    names_to="cell",
    values_to="expression"
  ) %>%
  ggplot(aes(x=cell, y=expression)) +
  stat_summary(geom="crossbar", fun.data=mean_sdl)
## Warning: `as.tibble()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` instead.
## ℹ The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Cell Cycle Scoring

Now that we have quantitated the data we can have a look at whether the cell cycle is having any effect on the data. Seurat comes with a bunch of marker genes for different cell cycle stages which we can use

cc.genes.updated.2019
## $s.genes
##  [1] "MCM5"     "PCNA"     "TYMS"     "FEN1"     "MCM7"     "MCM4"    
##  [7] "RRM1"     "UNG"      "GINS2"    "MCM6"     "CDCA7"    "DTL"     
## [13] "PRIM1"    "UHRF1"    "CENPU"    "HELLS"    "RFC2"     "POLR1B"  
## [19] "NASP"     "RAD51AP1" "GMNN"     "WDR76"    "SLBP"     "CCNE2"   
## [25] "UBR7"     "POLD3"    "MSH2"     "ATAD2"    "RAD51"    "RRM2"    
## [31] "CDC45"    "CDC6"     "EXO1"     "TIPIN"    "DSCC1"    "BLM"     
## [37] "CASP8AP2" "USP1"     "CLSPN"    "POLA1"    "CHAF1B"   "MRPL36"  
## [43] "E2F8"    
## 
## $g2m.genes
##  [1] "HMGB2"   "CDK1"    "NUSAP1"  "UBE2C"   "BIRC5"   "TPX2"    "TOP2A"  
##  [8] "NDC80"   "CKS2"    "NUF2"    "CKS1B"   "MKI67"   "TMPO"    "CENPF"  
## [15] "TACC3"   "PIMREG"  "SMC4"    "CCNB2"   "CKAP2L"  "CKAP2"   "AURKB"  
## [22] "BUB1"    "KIF11"   "ANP32E"  "TUBB4B"  "GTSE1"   "KIF20B"  "HJURP"  
## [29] "CDCA3"   "JPT1"    "CDC20"   "TTK"     "CDC25C"  "KIF2C"   "RANGAP1"
## [36] "NCAPD2"  "DLGAP5"  "CDCA2"   "CDCA8"   "ECT2"    "KIF23"   "HMMR"   
## [43] "AURKA"   "PSRC1"   "ANLN"    "LBR"     "CKAP5"   "CENPE"   "CTCF"   
## [50] "NEK2"    "G2E3"    "GAS2L3"  "CBX5"    "CENPA"

We can use these to try to predict the cell cycle of each cell.

CellCycleScoring(data, s.features = cc.genes.updated.2019$s.genes, g2m.features = cc.genes.updated.2019$g2m.genes, set.ident = TRUE) -> data
## Warning: The following features are not present in the object: PIMREG, CDC25C,
## CDCA2, ANLN, NEK2, not searching for symbol synonyms

We should now have a bunch of new QC metrics to give the score for S and G2M

data[[]]

We can look at the spread of the cells in different states.

as_tibble(data[[]]) %>%
  ggplot(aes(Phase)) + geom_bar()

as_tibble(data[[]]) %>%
  ggplot(aes(x=S.Score, y=G2M.Score, color=Phase)) + 
  geom_point() +
  coord_cartesian(xlim=c(-0.15,0.15), ylim=c(-0.15,0.15))

Although the tool has made predictions of the stage for each cell, there isn’t a huge separation between the groups it’s picked so we have some hope that this will have a relatively minor influence on the overall expression patterns we see. We’ll pick this up later once we’ve clustered the data and we can see what the content of the different clusters looks like.

Gene Selection

Before going on to do the dimensionality reduction we’re going to do some filtering of genes to remove those which are likely to be uninformative in the overall structure of the data. The main method to do this is to find unusually variable genes - these are calculated in the context of the gene’s expression since lowly expressed genes are more likely to be variable by standard measures.

Seurat provides a method to calculate a normalised intensity for each gene, and can then select the top ‘n’ most variable features. In this case we’re selecting the 500 most variable genes.

FindVariableFeatures(
  data, 
  selection.method = "vst", 
  nfeatures=500
) -> data
## Finding variable features for layer counts

The variability information can be accessed using the HVFInfo method. The names of the variable features can be accessed with VariableFeatures().

as_tibble(HVFInfo(data),rownames = "Gene") -> variance.data

variance.data %>% 
  mutate(hypervariable=Gene %in% VariableFeatures(data)
) -> variance.data

head(variance.data, n=10)

We can plot out a graph of the variance vs mean and highlight the selected genes so we can see whether we think we’re likely to capture what we need.

variance.data %>% 
  ggplot(aes(log(mean),log(variance),color=hypervariable)) + 
  geom_point() + 
  scale_color_manual(values=c("black","red"))

Scaling

Before putting the data into PCA for dimensionality reduction we will scale the genes so that they have a mean of 0 and a variance of 1. This is claimed to make the analysis less biased by expression level in the PCA.

ScaleData(data,features=rownames(data)) -> data
## Centering and scaling data matrix

Dimensionality Reduction

Now we’ve got to the stage where we can do the reduction. We’re going to use two methods - PCA and tSNE.

PCA

We can start by actually running the PCA. We will only use the variable features which we previously selected. The PCA will calculate all of our PCs and will also give us a list of the genes which were most highly (and lowly) weighted in the different PCs.

RunPCA(data,features=VariableFeatures(data)) -> data
## PC_ 1 
## Positive:  FCN1, CST3, LYZ, CSTA, MNDA, LST1, CFD, S100A9, IFI30, AIF1 
##     FTL, CTSS, S100A8, TYMP, CD14, TYROBP, VCAN, MS4A6A, S100A12, FOS 
##     PSAP, FGL2, SERPINA1, AP1S2, FTH1, LGALS1, S100A11, CYBB, SAT1, TSPO 
## Negative:  IL32, CD3D, LTB, TRAC, CTSW, CD7, CD3E, CD2, CCL5, GZMA 
##     NKG7, IL7R, KLRK1, TRBC1, TRBC2, CD247, CST7, CD3G, GNLY, KLRB1 
##     BCL11B, PRF1, KLRD1, HOPX, C12orf75, LDHB, KLRG1, TRDC, FGFBP2, CCL4 
## PC_ 2 
## Positive:  NKG7, CTSW, GZMA, CCL5, GNLY, CST7, S100A4, PRF1, KLRD1, IL32 
##     CD7, ANXA1, SRGN, KLRB1, KLRK1, CD247, FGFBP2, ID2, CD3D, CD2 
##     CCL4, GZMH, TRDC, HOPX, TRBC1, KLRG1, CD3E, GZMB, S100A6, ACTB 
## Negative:  CD79A, MS4A1, IGHM, CD79B, BANK1, HLA-DQA1, CD74, HLA-DRA, HLA-DPA1, IGKC 
##     IGHD, HLA-DQB1, VPREB3, HLA-DPB1, HLA-DRB1, SPIB, CD22, LINC00926, RALGPS2, HLA-DRB5 
##     AFF3, HLA-DQA2, FCRLA, LINC02397, CD24, HLA-DMA, ADAM28, TCL1A, BLNK, TCF4 
## PC_ 3 
## Positive:  TRAC, IL7R, LDHB, LEF1, BCL11B, CD3D, CCR7, LTB, CD3E, NOSIP 
##     CD3G, AQP3, RGCC, PASK, LRRN3, CD8B, BEX3, LINC02446, IL32, VIM 
##     TSHZ2, TRBC2, CD2, RTKN2, GPR183, TRBC1, CRIP2, GRAP2, S100A12, BIRC3 
## Negative:  GZMB, CLIC3, GNLY, KLRF1, NKG7, PRF1, FCGR3A, KLRD1, FGFBP2, CST7 
##     SPON2, GZMA, CCL4, HOPX, PLAC8, RHOC, AKR1C3, CTSW, FCER1G, LILRA4 
##     GZMH, IFITM2, IL2RB, PLD4, XCL2, C12orf75, CD160, LRRC26, SERPINF1, CCL5 
## PC_ 4 
## Positive:  CD79B, FCGR3A, MS4A1, CD79A, GNLY, KLRD1, FGFBP2, IGHD, PRF1, IFITM2 
##     NKG7, CST7, BANK1, KLRF1, HOPX, CD22, VPREB3, HLA-DPA1, HLA-DRB1, HLA-DPB1 
##     LINC00926, SPON2, IGHM, CCL4, HLA-DQB1, HLA-DQA1, GZMA, FCER2, ADAM28, GZMH 
## Negative:  PTCRA, GNG11, TUBB1, PPBP, PF4, CAVIN2, LILRA4, SERPINF1, LRRC26, ACRBP 
##     SMIM5, MAP1A, TPM2, TSC22D1, HIST1H2AC, CLEC1B, TMEM40, CLU, MAP3K7CL, MPIG6B 
##     TREML1, CLEC4C, GP9, PPP1R14B, SCT, PLD4, PF4V1, HIST1H3H, BEX3, SPARC 
## PC_ 5 
## Positive:  PPBP, TUBB1, PF4, GNG11, CAVIN2, ACRBP, HIST1H2AC, MAP3K7CL, CLEC1B, TMEM40 
##     TSC22D1, CLU, GP9, TREML1, MPIG6B, SPARC, CA2, AC147651.1, HIST1H3H, PF4V1 
##     CLDN5, TRIM58, ITGA2B, C2orf88, MYL9, CMTM5, RAB27B, NRGN, GP1BB, HBG2 
## Negative:  SERPINF1, LILRA4, LRRC26, TPM2, PLD4, SCT, CLEC4C, LAMP5, PPP1R14B, SMPD3 
##     MYBL2, UGCG, MAP1A, ITM2C, DERL3, LINC00996, LGMN, IL3RA, P2RY14, PRXL2A 
##     MZB1, APP, IRF7, TRAF4, SHD, ZFAT, SMIM5, LDHB, ASIP, AC097375.1

We can use the DimPlot function to plot all of our projections - we just need to tell it which one to use. Here we’re going to just plot the first two PCs from our PCA. As we classified our cells by cell cycle before it will pick this up and colour the clusters by that so we can see if the cell cycle is having a big effect on the clusters we’re picking out.

DimPlot(data,reduction="pca")

We can use the group.by option to colour by any other metadata column. We can also add labels to the plot. Finally we can add a call to the NoLegend() function to supress the automatic colour legend which is drawn.

DimPlot(data,reduction="pca", group.by = "largest_gene", label = TRUE, label.size = 3) + NoLegend()

We can look at later PCs by passing the dims argument.

DimPlot(data,reduction="pca", dims=c(3,4))

This nicely shows us the power, but also the limitations of PCA in that we can see that not all of the useful information is captured in the first two principal components. The question then becomes how far down the set of PCs do we need to go to capture all of the biologically relevant information.

We can start with a simple plot called the elbow plot which simply quantitates the amount of variance captured in the different PCs

ElbowPlot(data)

From this we can see that there are fairly high amounts of information captured in the first 10 PCs and that maybe we can see some additional information up to around 15PCs, but beyond that the plot is very flat. Taking somewhere between 10-15 PCs should therefore capture what we want to see.

For a more detailed view we can do dimensionality heatmaps. These are plots of PCA weightings for the most highly and lowly weighted genes, shown against the set of cells which are most highly influenced by the PC. The idea is that as long as we’re seeing clear structure in one of these plots then we’re still adding potentially useful information to the analysis.

DimHeatmap(data,dims=1:15, cells=500)

We can see that there is still clear structure right up to PC15, we can therefore keep all of these PCs, but we probably don’t need to go any further.

tSNE

To try to capture more information in a single 2D plot we’re going to take the first 15 dimentions of the PCA - which were calculated on only the 500 most variable genes - forward into a tSNE projection. We can run this in a very similar way to the PCA, except that we specify the number of dimensions we want to use.

Since tSNE uses a randomised starting position, if we want to be able to reproduce the plot we see then we’ll need to know the random ‘seed’ which was used to create the plot. We can capture the current state of the random number generator (from the .Random.seed function) and report it. This will change every time we run, but at least we will report the result.

In our case, because we want everyone to get the same answer I’ve saved the seed from when I prepared this tutorial and we’ll re-use that.

8482 -> saved.seed
set.seed(saved.seed)

We are now going to run the tSNE. The one parameter we might need to play around with is the perplexity value (expected number of nearest neighbours). By default this is set (somewhat arbitrarily) to 30. Setting this to a low value will help resolve small clusters, but at the expense of large clusters becoming more diffuse. Setting it to higher values will make the larger clusters more distinct, but may lose smaller clusters.

RunTSNE(
  data,
  dims=1:15,
  seed.use = saved.seed, 
  perplexity=10
) -> data

DimPlot(data,reduction = "tsne", pt.size = 1) + ggtitle("tSNE with Perplexity 10")

RunTSNE(
  data,dims=1:15,
  seed.use = saved.seed, 
  perplexity=200
) -> data

DimPlot(data,reduction = "tsne", pt.size = 1) + ggtitle("tSNE with Perplexity 200")

RunTSNE(
  data,
  dims=1:15,
  seed.use = saved.seed
) -> data

DimPlot(data,reduction = "tsne", pt.size = 1) + ggtitle("tSNE with default Perplexity (30)")

We can see the differences in clustering between the different perplexities - the structures aren’t completely different but the compactness of them and the emphasis on smaller clusters certainly changes. We can also see that there isn’t a huge effect of cell cycle in that all cycles are generally represented in all clusters, with maybe one cluster being somewhat depleted for cells in S-phase.

Defining Cell Clusters

At the moment in our PCA and tSNE we can see that there are clusters of cells, but we haven’t tried to identify what these are. We will come to this problem now. We’re going to use a graph based method to detect clusters. This finds the ‘k’ nearest neighbours to each cell and makes this into a graph. It then looks for highly inter-connected subgraphs within the graph and uses these to define clusters.

In the first instance we just define the graph. We can control the number of neigbours used using the k.param value. The default is 20. As before we use the first 15 dimensions of the PCA to calculate the neighbours.

FindNeighbors(data,dims=1:15) -> data
## Computing nearest neighbor graph
## Computing SNN

Since we’re only calculating distances for the 20 nearest neighbours we get another sparse matrix of distances.

data@graphs$RNA_snn[1:10,1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
##   [[ suppressing 10 column names 'AAACCTGAGAAACCAT-1', 'AAACCTGAGATAGCAT-1', 'AAACCTGAGCGTGAAC-1' ... ]]
##                                                                       
## AAACCTGAGAAACCAT-1 1.0000000 .         . .         . 0.5384615 . . . .
## AAACCTGAGATAGCAT-1 .         1.0000000 . 0.1428571 . .         . . . .
## AAACCTGAGCGTGAAC-1 .         .         1 .         . .         . . . .
## AAACCTGCAACGCACC-1 .         0.1428571 . 1.0000000 . .         . . . .
## AAACCTGCATCGATTG-1 .         .         . .         1 .         . . . .
## AAACCTGTCCAGATCA-1 0.5384615 .         . .         . 1.0000000 . . . .
## AAACGGGAGCGCTTAT-1 .         .         . .         . .         1 . . .
## AAACGGGAGCTACCTA-1 .         .         . .         . .         . 1 . .
## AAACGGGCAACTGCTA-1 .         .         . .         . .         . . 1 .
## AAACGGGCAAGTCTGT-1 .         .         . .         . .         . . . 1

We can then segment the graph using the FindClusters method. The resolution controls how fragmented the graph will be. Larger values give larger clusters, smaller values gives smaller clusters.

FindClusters(data,resolution = 0.5) -> data
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 3939
## Number of edges: 149206
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9134
## Number of communities: 13
## Elapsed time: 0 seconds

The clusters are stored in the “seurat_clusters” metadata annotation so they can be used in any way the previous QC data was used. They will also be picked up automatically when projections are plotted.

head(data$seurat_clusters, n=50)
## AAACCTGAGAAACCAT-1 AAACCTGAGATAGCAT-1 AAACCTGAGCGTGAAC-1 AAACCTGCAACGCACC-1 
##                  2                  0                  4                  0 
## AAACCTGCATCGATTG-1 AAACCTGTCCAGATCA-1 AAACGGGAGCGCTTAT-1 AAACGGGAGCTACCTA-1 
##                  1                  2                 12                  0 
## AAACGGGCAACTGCTA-1 AAACGGGCAAGTCTGT-1 AAACGGGCAGACGCAA-1 AAACGGGCAGTCCTTC-1 
##                  0                  0                  4                  5 
## AAACGGGCATGCTAGT-1 AAACGGGGTTACGACT-1 AAACGGGGTTCAGTAC-1 AAAGATGAGAGCTTCT-1 
##                 11                  6                  3                  0 
## AAAGATGAGGAGCGTT-1 AAAGATGCACAACTGT-1 AAAGATGCACACCGCA-1 AAAGATGCAGACAAGC-1 
##                  2                  4                  7                  1 
## AAAGATGCATTAACCG-1 AAAGATGGTCATGCCG-1 AAAGATGTCCTTGACC-1 AAAGCAAAGAAGATTC-1 
##                  2                  4                  4                  5 
## AAAGCAACATTAGGCT-1 AAAGCAAGTTAAGGGC-1 AAAGCAATCAGTCAGT-1 AAAGCAATCCAGTATG-1 
##                  0                  7                  4                  2 
## AAAGCAATCTTGTACT-1 AAAGTAGAGAAACGAG-1 AAAGTAGAGATAGTCA-1 AAAGTAGAGTAGATGT-1 
##                  0                  0                  1                  2 
## AAAGTAGCACATTCGA-1 AAAGTAGCATTGAGCT-1 AAAGTAGCATTTGCTT-1 AAAGTAGGTCTCAACA-1 
##                  2                  5                  3                  4 
## AAAGTAGGTGGGTCAA-1 AAAGTAGTCCAACCAA-1 AAATGCCAGTAGGTGC-1 AAATGCCAGTGAACGC-1 
##                  2                  5                  0                  2 
## AAATGCCCAATGGAAT-1 AAATGCCCACCATGTA-1 AAATGCCCAGCTGTTA-1 AAATGCCTCCACTGGG-1 
##                  2                  3                  3                  6 
## AAATGCCTCCCTAATT-1 AACACGTCAGCTGGCT-1 AACACGTGTAATAGCA-1 AACACGTGTCAACTGT-1 
##                  3                  1                  2                  1 
## AACACGTGTCCGACGT-1 AACACGTGTTCGTCTC-1 
##                  5                  0 
## Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12

If we go back and plot our PCA we can see the clusters, but we can see that some of the clusters don’t resolve very well in PC1 vs PC2.

DimPlot(data,reduction="pca",label = TRUE)+ggtitle("PC1 vs PC2 with Clusters")

If we start looking further through the PCs we can see that some of the clusters which are overlaid in PC1 start to separate. These differences represent a small proportion of the overall variance but can be important in resolving changes.

In PC4 we get a clear resolution of cluster 9 which was previously conflated with 0 and 8. In PC 9 we separate out cluster 10.

DimPlot(data,reduction="pca", dims=c(4,9), label=TRUE)+ggtitle("PC4 vs PC9 with Clusters")

If we look at the same thing with the tSNE plot we can see that all of the information across the 15PCs used is preserved and we see the overall similartiy of the cells.

DimPlot(data,reduction="tsne",pt.size = 1, label = TRUE, label.size = 7)

Examining the properties of the clusters ========================================

Now that we have our clusters we can look to see if they are being influenced by any of the QC metrics we calculated earlier. We can see that some of the clusters are skewed in one or more of the metrics we’ve calculated so we will want to take note of this. Some of these skews could be biological in nature, but they could be noise coming from the data.

Number of reads

VlnPlot(data,features="nCount_RNA")

Number of genes

VlnPlot(data,features="nFeature_RNA")

It might be tempting to think that clusters 8, 9, 11 and 12 could be from GEMs where two or more cells were captured since they all have unusually high coverage and diversity. They are also small and tightly clustered away from the main groups of points.

Percent Mitochondrion

VlnPlot(data,features="percent.MT")

MALAT1

VlnPlot(data,features="MALAT1")

Cell Cycle

data@meta.data %>%
  group_by(seurat_clusters,Phase) %>%
  count() %>%
  group_by(seurat_clusters) %>%
  mutate(percent=100*n/sum(n)) %>%
  ungroup() %>%
  ggplot(aes(x=seurat_clusters,y=percent, fill=Phase)) +
  geom_col() +
  ggtitle("Percentage of cell cycle phases per cluster")

Percent Largest Gene

VlnPlot(data,features="percent.Largest.Gene")

Which largest gene

data[[]] %>%
  group_by(seurat_clusters, largest_gene) %>%
  count() %>%
  arrange(desc(n)) %>%
  group_by(seurat_clusters) %>%
  slice(1:2) %>%
  ungroup() %>%
  arrange(seurat_clusters, desc(n))
data@reductions$tsne@cell.embeddings %>%
  as_tibble() %>%
  add_column(seurat_clusters=data$seurat_clusters, largest_gene=data$largest_gene) %>%
  filter(largest_gene %in% largest_genes_to_plot) %>%
  ggplot(aes(x=tSNE_1, y=tSNE_2, colour=seurat_clusters)) +
  geom_point() +
  facet_wrap(vars(largest_gene))

That’s already quite nice for explaining some of the functionality of the clusters, but there’s more in there than just the behaviour of the most expressed gene, so let’s do a more systematic search for markers.

Finding Markers for each Cluster

Now that we have defined the different clusters we can start to evaluate them. One way to do this will be to identify genes whose expression defines each cluster which has been identified.

Suerat provides the FindMarkers function to identify genes which a specific to a given cluster. This is a somewhat generic function which can run a number of different tests. We are only going to focus on two of these but you can find the others in the Seurat documentation.

The two tests we are going to use are:

  1. The Wilcox rank sum test. This identifies genes which are differentially regulated between two groups of cells. It is a non-parametric test which makes very few assumptions about the behaviour of the data and just looks for genes which have expression which is consistently ranked more highly in one group of cells compared to another.

  2. The ROC test. This is a measure of how specifically a gene can predict membership of two groups. It gives a value between 0.5 (no predictive value) and 1 (perfectly predictive on its own) to say how useful each gene is at predicting. Again this is a non-parametric test which just cares about the ranked expression measures for each gene.

Single Prediction

In the simplest case we can find genes which appear to be upregulated in a specific cluster compared to all cells not in that cluster. The additional min.pct parameter says that the gene must be measured in at least 25% of the cells in either cluster 0 or all of the other other cells in order to be tested. This cuts down on testing genes which are effectively unexpressed.

FindMarkers(data,ident.1 = 0, min.pct = 0.25)

We can then use a convenience plotting method VlnPlot to show the expression levels of these genes in the cells in each cluster.

VlnPlot(data,features="VCAN")

We can indeed see that the VCAN gene is more highly expressed in cluster 0 than any of the other clusters, but we can also see that it is also reasonably highly expressed in clusters 9 and 12. These were both clusters we suspected of being multiple cells though.

Multiple Prediction

We can extend the same methodology to make predictions for all of the clusters. Here we’re calling FindMarkers for all of the clusters and combining the results into a single table. We add an additional column which just says which cluster each hit initially came from. This will take a little while to run, but at the end we’ll have predictions for all clusters.

# This loop just runs the FindMarkers function on all of the clusters
lapply(
  levels(data[["seurat_clusters"]][[1]]),
  function(x)FindMarkers(data,ident.1 = x,min.pct = 0.25)
) -> cluster.markers


# This simply adds the cluster number to the results of FindMarkers
sapply(0:(length(cluster.markers)-1),function(x) {
  cluster.markers[[x+1]]$gene <<- rownames(cluster.markers[[x+1]])
  cluster.markers[[x+1]]$cluster <<- x
})
##  [1]  0  1  2  3  4  5  6  7  8  9 10 11 12
# Finally we collapse the list of hits down to a single table and sort it by FDR to put the most significant ones first
as_tibble(do.call(rbind,cluster.markers)) %>% arrange(p_val_adj) -> cluster.markers
cluster.markers

We can extract from this list the most upregulated gene from each cluster

cluster.markers %>%
  group_by(cluster) %>%
    slice(1) %>%
      pull(gene) -> best.wilcox.gene.per.cluster

best.wilcox.gene.per.cluster
##  [1] "S100A12" "CCR7"    "MS4A1"   "CD8B"    "TRGC1"   "GZMB"    "IGHD"   
##  [8] "IL7R"    "CDKN1C"  "TUBB1"   "XCL1"    "LILRA4"  "ENHO"

We can then plot these out.

VlnPlot(data,features=best.wilcox.gene.per.cluster)

We can see that for some clusters (eg Cluster 8 - CDKN1C) We really do have a gene which can uniquely predict, but for many others (eg cluster 7 IL7R) we have a hit which also picks up other clusters (clusters 1, 3 and 4 in this case).

We can try to clean this up for any individual cluster by using the roc analysis.

FindMarkers(data,ident.1 = 7, ident.2 = 4, test.use = "roc", only.pos = TRUE)

We want to look at the power value here. A value of 1 is perfectly separating, and a value of 0 is random. Our best positive hit (more expressed in cluster 7) has a power of 0.808. We can see what that looks like.

VlnPlot(data,features="LTB")

That does indeed do a slightly better job at separating cluster 5 from cluster 4, but it also comes up all over the place in other clusters. Let’s look at a hit lower down the scale.

VlnPlot(data,features="TPT1")

This could actually be a better option to use as a marker for this cluster.

Automated Cell Type Annotation

We can use knowledge of cell type marker genes to classify our cells. Lots of systems exist to do this. We’re going to use scina. This analysis requires a list of marker genes for each of the clusters we want to find. We’re using a small set distributed with scina, but you can make a larger collection relevant to the cell types you’re using.

as.data.frame(GetAssayData(data,layer="data")) -> scina.data
load(system.file('extdata','example_signatures.RData', package = "SCINA"))

signatures
## $cd14_monocytes
##  [1] "AIF1"   "CST3"   "FCN1"   "FTH1"   "FTL"    "GPX1"   "LST1"   "LYZ"   
##  [9] "S100A8" "S100A9" "TYMP"  
## 
## $b_cells
## [1] "CD37"     "CD74"     "CD79A"    "CD79B"    "HLA-DPA1" "HLA-DRA" 
## 
## $cd56_nk
##  [1] "CLIC3"  "CST7"   "FGFBP2" "GNLY"   "GZMA"   "GZMB"   "HOPX"   "IFITM2"
##  [9] "KLRB1"  "NKG7"   "PRF1"
SCINA(
  scina.data,
  signatures, 
  max_iter = 100, 
  convergence_n = 10, 
  convergence_rate = 0.999, 
  sensitivity_cutoff = 0.9, 
  rm_overlap=TRUE, 
  allow_unknown=TRUE
) -> scina.results

data$scina_labels <- scina.results$cell_labels

Now we can plot out the tsne spread coloured by the automatic annotation.

DimPlot(data,reduction = "tsne", pt.size = 1, label = TRUE, group.by = "scina_labels", label.size = 5)

Now we can plot out the tsne spread coloured by the automatic annotation.

DimPlot(data,reduction = "tsne", pt.size = 1, label = TRUE, group.by = "scina_labels")

We can also relate this to the clusters which we automatically detected.

tibble(
  cluster = data$seurat_clusters,
  cell_type = data$scina_labels
) %>%
  group_by(cluster,cell_type) %>%
  count() %>%
  group_by(cluster) %>%
  mutate(
    percent=(100*n)/sum(n)
  ) %>%
  ungroup() %>%
  mutate(
    cluster=paste("Cluster",cluster)
  ) %>%
  ggplot(aes(x="",y=percent, fill=cell_type)) +
  geom_col(width=1) +
  coord_polar("y", start=0) +
  facet_wrap(vars(cluster)) +  
  theme(axis.text.x=element_blank()) +
  xlab(NULL) +
  ylab(NULL)

Colouring by genes

The other view we might want to use once we have picked out some genes we like is to colour the entirity of a projection by that gene. Sometimes we want to do this to look at genes we expect to be diagnostic for different cell subtypes, but sometimes we will use the same method to explore hits which come out of our own analysis.

We can use the FeaturePlot function to colour the tSNE projection with the expression level of a number of different genes.

FeaturePlot(data,features=best.wilcox.gene.per.cluster)

We can see that some of these genes very specifically isolate to their own cluster, but for others we see expression which is more widely spread over a number of clusters. This then leads to a larger problem which is how we evaluate the clustering which has been done in our data.

If you remember - the tSNE is built on pairwise distances between cells and these in turn come from the PCA distances over (in our case) 15 PCs, and these are calculated on (in our case) 500 hyper variable genes. Trying to work out way back through the final plot positions to this more abstract set of relationships is a challenging task.

Export to Loupe

Finally, we can go back from the analysis we’ve done here to the Loupe browser to try to take advantage of both the flexibility in R and the interaction within Loupe. To do this we’re doing to write out the tSNE data from R into a file which we can then open in Loupe.

One oddity in here is that during import Seruat removes the “-1” from the end of the names of all of the cell barcodes. If we don’t put them back then Loupe won’t recognise them, so we have to do a bit of data manipulation before saving the file.

data@reductions$tsne@cell.embeddings[1:10,]
##                        tSNE_1     tSNE_2
## AAACCTGAGAAACCAT-1 -30.398186   4.353388
## AAACCTGAGATAGCAT-1  11.565635 -26.299060
## AAACCTGAGCGTGAAC-1  17.920886  22.358555
## AAACCTGCAACGCACC-1   9.137368 -23.906778
## AAACCTGCATCGATTG-1  -6.156601  -2.620276
## AAACCTGTCCAGATCA-1 -29.886957   4.725325
## AAACGGGAGCGCTTAT-1  24.422809 -24.654272
## AAACGGGAGCTACCTA-1  22.388476 -11.855792
## AAACGGGCAACTGCTA-1  20.737263 -17.817923
## AAACGGGCAAGTCTGT-1  -1.726548 -21.998830
data@reductions$tsne@cell.embeddings[,] %>% 
  as_tibble(rownames = "barcode") %>% 
    mutate(barcode=paste0(barcode,"-1")) %>% 
      write_csv("for_loupe_import.csv")