This is an example of a workflow to process data in Seurat v3. Here we’re using a simple dataset consisting of a single set of cells which we believe should split into subgroups. In this exercise we will:

  • Load in the data

  • Do some basic QC and Filtering

  • Select genes which we believe are going to be informative

  • Perform dimensionality reduction

  • Detect clusters within the data

  • Find genes which define the clusters

  • Examine how robust the clusters and genes are

Setup

We’re going to start with some basic housekeeping. We’re going to load the packages we’re going to use, these will be:

  • Seurat (for general single cell loading and processing)

  • Sleepwalk (for data projection visualisation and exploration)

  • Tidyverse (for non-standard data manipulation and plotting)

  • SCINA (for cell type prediction)

We’re also going to set a nicer theme for ggplot in tidyverse so our graphs look nicer.

library(Seurat)
library(tidyverse)
## Registered S3 method overwritten by 'cli':
##   method     from    
##   print.boxx spatstat
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(sleepwalk)
library(SCINA)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
theme_set(theme_bw(base_size = 14))

Loading Data

We have the cellranger output files already produced. The files we’re working with here are the raw output from the CellRanger pipeline. We didn’t do anything else to them. We can see the files we have to work with. We’re working with the text versions of the count tables, but we could also have loaded the h5 files using the Read10X_h5 function in Seurat. The output would be the same either way.

list.files("../filtered_feature_bc_matrix/")
## [1] "barcodes.tsv.gz" "features.tsv.gz" "matrix.mtx.gz"

We can load these as a set, Seurat will know how to combine them into a single object describing the experiment. We use the Read10X function to parse the data and then extract the parsed data from there to form the actual Seurat object.

Read10X("../filtered_feature_bc_matrix/") -> data

CreateSeuratObject(
  counts=data, 
  project="course", 
  min.cells = 3, 
  min.features=200
) -> data

data
## An object of class Seurat 
## 15969 features across 5058 samples within 1 assay 
## Active assay: RNA (15969 features, 0 variable features)

QC

Before we do any analysis it’s really important to do some quality control and filtering to make sure we’re working with good data. Seurat automatically creates two metrics we can use:

  1. nCount_RNA the total number of reads in the dataset

  2. nFeature_RNA the number of observed genes

We can supplement this with other metrics. Here we’re going to look at the number of genes from mitochondrial sources.

Amount of MT genes

Single cell datasets can be filled with large numbers of reads coming from mitochondria. These often indicate a sick cell undergoing apoptosis. We want to check for this. Seurat comes with a load of built-in functions for accessing certain aspects of your data, but you can also dig into the raw data fairly easily.

The main data can be found under @assays$RNA. This is a matrix with genes as rownames and cell barcodes as columns. Technically we’re looking at @assays$RNA@counts which is the matrix of raw counts. This will become important later when we start to calculate normalised expression values to go alongside the counts.

We can find the gene names as the rownames of the @assays$RNA@counts slot of the Seurat object and we identify the mitochondrial genes by their names starting with “MT-”.

grep("^MT-",rownames(data@assays$RNA@counts),value = TRUE)
##  [1] "MT-ND1"  "MT-ND2"  "MT-CO1"  "MT-CO2"  "MT-ATP8" "MT-ATP6" "MT-CO3" 
##  [8] "MT-ND3"  "MT-ND4L" "MT-ND4"  "MT-ND5"  "MT-ND6"  "MT-CYB"

We can calculate the percentage of the data coming from a set of genes using the PercentageFeatureSet function. We can store the result in the main metadata store for the object by defining a new column called “percent.MT”.

PercentageFeatureSet(data,pattern="^MT-") -> data$percent.MT

head(data$percent.MT)
## AAACCTGAGAAACCAT-1 AAACCTGAGATAGCAT-1 AAACCTGAGCGTGAAC-1 AAACCTGCAACGCACC-1 
##           3.289162           4.944290           3.188145           4.441226 
## AAACCTGCACACCGAC-1 AAACCTGCATCGATTG-1 
##           2.803738           3.971005

Percentage of Largest Gene

We can also go into the count matrix and make our own metrics. The data is stored in a “Sparse Matrix” which is more efficient for storing data with a large proportion of unobserved values (such as 10X data). In this example we run apply over the columns (cells) and calculate what percentage of the data comes from the single most observed gene. Again, having a high proportion of your data dominated by a single gene would be a concerning sign. We will also look later at the specific most highly expressed genes.

apply(
  data@assays$RNA@counts,
  2,
  function(x)(100*max(x))/sum(x)
) -> data$Percent.Largest.Gene

head(data$Percent.Largest.Gene)
## AAACCTGAGAAACCAT-1 AAACCTGAGATAGCAT-1 AAACCTGAGCGTGAAC-1 AAACCTGCAACGCACC-1 
##           7.359205           4.387187           8.756174           2.968863 
## AAACCTGCACACCGAC-1 AAACCTGCATCGATTG-1 
##          11.682243           8.383234

Plotting QC Metrics

Seurat comes with some convenience methods for plotting out certain types of visualisation, such as the distribution of certain QC metrics. We can view this on both a linear and log scale to see which looks more helpful.

VlnPlot(data, features=c("nCount_RNA","percent.MT"))

VlnPlot(data, features=c("nCount_RNA","percent.MT")) + scale_y_log10()
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.