This is an example of a workflow to process data in Seurat v3. Here we’re using a simple dataset consisting of a single set of cells which we believe should split into subgroups. In this exercise we will:

  • Load in the data

  • Do some basic QC and Filtering

  • Select genes which we believe are going to be informative

  • Perform dimensionality reduction

  • Detect clusters within the data

  • Find genes which define the clusters

  • Examine how robust the clusters and genes are


We’re going to start with some basic housekeeping. We’re going to load the packages we’re going to use, these will be:

  • Seurat (for general single cell loading and processing)

  • Sleepwalk (for data projection visualisation and exploration)

  • Tidyverse (for non-standard data manipulation and plotting)

  • SCINA (for cell type prediction)

We’re also going to set a nicer theme for ggplot in tidyverse so our graphs look nicer.

theme_set(theme_bw(base_size = 14))

Loading Data

We have the cellranger output files already produced. The files we’re working with here are the raw output from the CellRanger pipeline. We didn’t do anything else to them. We can see the files we have to work with. We’re working with the filtered h5 file, which means that we have to also install the hdf5 package in R. This means we can load the data from a single file using the Read10x_h5 function in Seurat.

Read10X_h5("../filtered_feature_bc_matrix.h5") -> data

  min.cells = 3, 
) -> data

## An object of class Seurat 
## 17136 features across 5070 samples within 1 assay 
## Active assay: RNA (17136 features, 0 variable features)


Before we do any analysis it’s really important to do some quality control and filtering to make sure we’re working with good data. Seurat automatically creates two metrics we can use:

  1. nCount_RNA the total number of reads (or more correctly UMIs) in the dataset

  2. nFeature_RNA the number of observed genes (anything with a nonzero count)

We can supplement this with other metrics which we can calculate ourselves.

Amount of MT genes

Single cell datasets can be filled with large numbers of reads coming from mitochondria. These often indicate a sick cell undergoing apoptosis. We want to check for this. Seurat comes with a load of built-in functions for accessing certain aspects of your data, but you can also dig into the raw data fairly easily.

The main data can be found under @assays$RNA. This is a matrix with genes as rownames and cell barcodes as columns. Technically we’re looking at @assays$RNA@counts which is the matrix of raw counts. This will become important later when we start to calculate normalised expression values to go alongside the counts.

We can find the gene names as the rownames of the @assays$RNA@counts slot of the Seurat object and we identify the mitochondrial genes by their names starting with “MT-”. Be aware that in other species the naming of the mitochondrial genes may not be the same so you’d need to adjust the pattern.

grep("^MT-",rownames(data@assays$RNA@counts),value = TRUE)
##  [1] "MT-ND1"  "MT-ND2"  "MT-CO1"  "MT-CO2"  "MT-ATP8" "MT-ATP6" "MT-CO3" 
##  [8] "MT-ND3"  "MT-ND4L" "MT-ND4"  "MT-ND5"  "MT-ND6"  "MT-CYB"

We can calculate the percentage of the data coming from a set of genes using the PercentageFeatureSet function. We can store the result in the main metadata store for the object by defining a new column called “percent.MT”.

PercentageFeatureSet(data,pattern="^MT-") -> data$percent.MT

##           3.151927           4.809843           3.133159           4.295333 
##           2.648676           3.776435

Amount of Ribosomal Genes

Ribosomal genes also tend to be very highly represented, and can vary between cell types, so it can be instructive to see how prevalent they are in the data. These are ribosomal protein genes rather than the actual rRNA, so they’re more a measure of the translational activity of the cell rather than the cleanliness of the polyA selection.

grep("^RP[LS]",rownames(data@assays$RNA@counts),value = TRUE)
##  [1] "RPL22"       "RPL11"       "RPS6KA1"     "RPS8"        "RPL5"       
##  [6] "RPS27"       "RPS6KC1"     "RPS7"        "RPS27A"      "RPL31"      
## [11] "RPL37A"      "RPL32"       "RPL15"       "RPSA"        "RPL14"      
## [16] "RPL29"       "RPL24"       "RPL22L1"     "RPL39L"      "RPL35A"     
## [21] "RPL9"        "RPL34-AS1"   "RPL34"       "RPS3A"       "RPL37"      
## [26] "RPS23"       "RPS14"       "RPL26L1"     "RPS18"       "RPS10"      
## [31] "RPL10A"      "RPL7L1"      "RPS12"       "RPS6KA2"     "RPS20"      
## [36] "RPL7"        "RPL30"       "RPL8"        "RPS6"        "RPL35"      
## [41] "RPL12"       "RPL7A"       "RPS24"       "RPLP2"       "RPL27A"     
## [46] "RPS13"       "RPS6KA4"     "RPS6KB2"     "RPS6KB2-AS1" "RPS3"       
## [51] "RPS25"       "RPS26"       "RPL41"       "RPL6"        "RPLP0"      
## [56] "RPL21"       "RPS29"       "RPL36AL"     "RPS6KL1"     "RPS6KA5"    
## [61] "RPS27L"      "RPL4"        "RPLP1"       "RPS17"       "RPS2"       
## [66] "RPS15A"      "RPL13"       "RPL26"       "RPL23A"      "RPL23"      
## [71] "RPL19"       "RPL27"       "RPS6KB1"     "RPL38"       "RPL17"      
## [76] "RPS15"       "RPL36"       "RPS28"       "RPL18A"      "RPS16"      
## [81] "RPS19"       "RPL18"       "RPL13A"      "RPS11"       "RPS9"       
## [86] "RPL28"       "RPS5"        "RPS21"       "RPL3"        "RPS19BP1"   
## [91] "RPS6KA3"     "RPS4X"       "RPS6KA6"     "RPL36A"      "RPL39"      
## [96] "RPL10"
PercentageFeatureSet(data,pattern="^RP[LS]") -> data$percent.Ribosomal

##           47.34694           27.49441           35.16101           26.25958 
##           42.52874           48.79154

Percentage of Largest Gene

We can also go into the count matrix and make our own metrics. The data is stored in a “Sparse Matrix” which is more efficient for storing data with a large proportion of unobserved values (such as 10X data). In this example we run apply over the columns (cells) and calculate what percentage of the data comes from the single most observed gene. Again, having a high proportion of your data dominated by a single gene is a metric which could either give biological context or indicate a technical problem, depending on what the gene is.

When we calculate this we normally find that MALAT1 is normally the largest gene by some distance - it’s a non-coding nuclear gene expressed at very high levels. This has such a big effect that we’ll measure it separately, and exclude it from our analysis here.

We will get:

  • The count for the largest gene per cell

  • The index position of the gene with the largest count

  • The name of the most highly expressed gene per cell

data[rownames(data) != "MALAT1",] -> data.nomalat

) -> data.nomalat$largest_count

) -> data.nomalat$largest_index

rownames(data.nomalat)[data.nomalat$largest_index] -> data.nomalat$largest_gene

100 * data.nomalat$largest_count / data.nomalat$nCount_RNA -> data.nomalat$percent.Largest.Gene

data.nomalat$largest_gene -> data$largest_gene
data.nomalat$percent.Largest.Gene -> data$percent.Largest.Gene


Seurat QC Plots

Seurat comes with some convenience methods for plotting out certain types of visualisation, such as the distribution of certain QC metrics. We can view this on both a linear and log scale to see which looks more helpful.

VlnPlot(data, features=c("nCount_RNA","percent.MT", "percent.Ribosomal","percent.Largest.Gene"))

For some metrics it’s better to view on a log scale.

VlnPlot(data, features=c("nCount_RNA","percent.MT", "percent.Largest.Gene")) + scale_y_log10()
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.