This is an example of a workflow to process data in Seurat v5. Here we’re using a simple dataset consisting of a single set of cells which we believe should split into subgroups. In this exercise we will:
Load in the data
Do some basic QC and Filtering
Select genes which we believe are going to be informative
Perform dimensionality reduction
Detect clusters within the data
Find genes which define the clusters
Examine how robust the clusters and genes are
We’re going to start with some basic housekeeping. We’re going to load the packages we’re going to use, these will be:
Seurat (for general single cell loading and processing)
Tidyverse (for non-standard data manipulation and plotting)
Azimuth (for cell type prediction)
LoupeR (for exporting cloupe files)
We’re also going to set a nicer theme for ggplot in tidyverse so our graphs look nicer.
library(Seurat)
library(tidyverse)
library(Azimuth)
library(loupeR)
library(clustree)
theme_set(theme_bw(base_size = 14))
We have the cellranger output files already produced. The files we’re
working with here are the raw output from the CellRanger pipeline. We
didn’t do anything else to them. We can see the files we have to work
with. We’re working with the filtered h5 file, which means that we have
to also install the hdf5 package in R. This means we can load the data
from a single file using the Read10x_h5 function in
Seurat.
Read10X_h5("../filtered_feature_bc_matrix.h5") -> data
CreateSeuratObject(
counts=data,
project="course"
) -> data
data
## An object of class Seurat
## 36601 features across 5083 samples within 1 assay
## Active assay: RNA (36601 features, 0 variable features)
## 1 layer present: counts
Before we do any analysis it’s really important to do some quality control and filtering to make sure we’re working with good data. Seurat automatically creates two metrics we can use:
nCount_RNA the total number of reads (or more
correctly UMIs) in the dataset
nFeature_RNA the number of observed genes (anything
with a nonzero count)
We can supplement this with other metrics which we can calculate ourselves.
We can recreate the knee plot we saw in the 10X QC report. This is just the number of UMIs vs the index position of the reads.
data[[]] %>%
arrange(desc(nCount_RNA)) %>%
mutate(index=1:n()) %>%
ggplot(aes(x=index,y=nCount_RNA)) +
geom_line() +
scale_x_log10() +
scale_y_log10()
We can see that the default filtering has gone beyond the “knee” point and into the descending part of the distribution. When we look at the distribution of counts then we may want to remove some of the lower coverage cells. At the top end we may also have cells with high coverage which could represent duplets.
Single cell datasets can be filled with large numbers of reads coming from mitochondria. These often indicate a sick cell undergoing apoptosis. We want to check for this. Seurat comes with a load of built-in functions for accessing certain aspects of your data, but you can also dig into the raw data fairly easily.
The raw data can be found with
LayerData(data, layer="counts"). This is a matrix with
genes as rownames and cell barcodes as columns. Later we will create a
new “data” layer with a matrix of normalised counts in it.
We can find the gene names as the rownames of the Seurat object and we identify the mitochondrial genes by their names starting with “MT-”. Be aware that in other species the naming of the mitochondrial genes may not be the same so you’d need to adjust the pattern.
grep("^MT-",rownames(data),value = TRUE)
## [1] "MT-ND1" "MT-ND2" "MT-CO1" "MT-CO2" "MT-ATP8" "MT-ATP6" "MT-CO3"
## [8] "MT-ND3" "MT-ND4L" "MT-ND4" "MT-ND5" "MT-ND6" "MT-CYB"
We can calculate the percentage of the data coming from a set of
genes using the PercentageFeatureSet function. We can store
the result in the main metadata store for the object by defining a new
column called “percent.MT”.
PercentageFeatureSet(data,pattern="^MT-") -> data$percent.MT
data[[]] %>%
arrange(desc(percent.MT)) %>%
select(percent.MT) %>%
head()
Ribosomal genes also tend to be very highly represented, and can vary between cell types, so it can be instructive to see how prevalent they are in the data. These are ribosomal protein genes rather than the actual rRNA, so they’re more a measure of the translational activity of the cell rather than the cleanliness of the polyA selection. Secretory cells and other specialised cell types can have large amounts of these genes. In some instances the amount of ribosomal proteins can mask the measurement of other protein types.
grep("^RP[LS]",rownames(data),value = TRUE)
## [1] "RPL22" "RPL11" "RPS6KA1" "RPS8" "RPL5"
## [6] "RPS27" "RPS6KC1" "RPS7" "RPS27A" "RPL31"
## [11] "RPL37A" "RPL32" "RPL15" "RPSA" "RPL14"
## [16] "RPL29" "RPL24" "RPL22L1" "RPL39L" "RPL35A"
## [21] "RPL9" "RPL34-AS1" "RPL34" "RPS3A" "RPL37"
## [26] "RPS23" "RPS14" "RPL26L1" "RPS18" "RPS10-NUDT3"
## [31] "RPS10" "RPL10A" "RPL7L1" "RPS12" "RPS6KA2"
## [36] "RPS6KA2-IT1" "RPS6KA2-AS1" "RPS20" "RPL7" "RPL30"
## [41] "RPL8" "RPS6" "RPL35" "RPL12" "RPL7A"
## [46] "RPS24" "RPLP2" "RPL27A" "RPS13" "RPS6KA4"
## [51] "RPS6KB2" "RPS6KB2-AS1" "RPS3" "RPS25" "RPS26"
## [56] "RPL41" "RPL6" "RPLP0" "RPL21" "RPL10L"
## [61] "RPS29" "RPL36AL" "RPS6KL1" "RPS6KA5" "RPS27L"
## [66] "RPL4" "RPLP1" "RPS17" "RPL3L" "RPS2"
## [71] "RPS15A" "RPL13" "RPL26" "RPL23A" "RPL23"
## [76] "RPL19" "RPL27" "RPS6KB1" "RPL38" "RPL17"
## [81] "RPS15" "RPL36" "RPS28" "RPL18A" "RPS16"
## [86] "RPS19" "RPL18" "RPL13A" "RPS11" "RPS9"
## [91] "RPL28" "RPS5" "RPS21" "RPL3" "RPS19BP1"
## [96] "RPS6KA3" "RPS4X" "RPS6KA6" "RPL36A" "RPL39"
## [101] "RPL10" "RPS4Y1" "RPS4Y2"
PercentageFeatureSet(data,pattern="^RP[LS]") -> data$percent.Ribosomal
head(data$percent.Ribosomal)
## AAACCTGAGAAACCAT-1 AAACCTGAGATAGCAT-1 AAACCTGAGCGTGAAC-1 AAACCTGCAACGCACC-1
## 47.33620 27.49441 35.16101 26.25348
## AAACCTGCACACCGAC-1 AAACCTGCATCGATTG-1
## 42.52874 48.76208
You can see that these can make up a significant proportion of all of the reads in some cells.
Malat1 is a highly abundant nuclear transcript which remains in cells after their cell wall ruptures. It can be an interesting metric to look at.
You can try adding a new QC metric to measure the abundance of this gene.
# Measure MALAT1
Seurat comes with some convenience methods for plotting out certain types of visualisation, such as the distribution of certain QC metrics. We can view this on both a linear and log scale to see which looks more helpful.
VlnPlot(data, layer = "counts", features=c("nCount_RNA","percent.MT", "percent.Ribosomal"))
For some metrics it’s better to view on a log scale.
VlnPlot(data, layer="counts", features=c("nCount_RNA","percent.MT"), log = TRUE)
What we are generally looking for in each metric are the outlier cells which fall away from the main distribution. These are the cells we would consider removing. You can also look for other structure - for example the percent ribosomal is a bimodal distribution so maybe suggests that there are two big sub-populations in the data.
We can also plot metrics against each other to see what the relationship between them is. There is a built in seurat plot to do this which is easy to use, but isn’t especially pretty.
FeatureScatter(
data,
feature1 = "nCount_RNA",
feature2 = "percent.Ribosomal",
log = TRUE,
pt.size = 0.2
)
Rather than using Seurat’s built in plots you also have the option to
extract the data yourself and plot it using any of the R conventional
plotting systems. By using data[[]] you can get a data
frame of qc metrics which you can then maniuplate yourself.
data[[]] %>%
as_tibble(
rownames="Cell.Barcode"
) -> qc.metrics
head(qc.metrics)
We can then plot out with ggplot if we want to do more.
qc.metrics %>%
arrange(percent.Ribosomal) %>%
ggplot(aes(nCount_RNA,nFeature_RNA,colour=percent.Ribosomal)) +
geom_point() +
scale_color_gradientn(colours = c("grey","grey","red3")) +
ggtitle("Example of plotting QC metrics") +
geom_hline(yintercept = 750) +
geom_hline(yintercept = 2000)
These plots often make more sense on a log scale too:
qc.metrics %>%
arrange(percent.Ribosomal) %>%
ggplot(aes(nCount_RNA,nFeature_RNA,colour=percent.Ribosomal)) +
geom_point(size=1) +
scale_color_gradientn(colours = c("grey","grey","red3")) +
ggtitle("Example of plotting QC metrics") +
geom_hline(yintercept = 750) +
geom_hline(yintercept = 2000) +
scale_x_log10() + scale_y_log10()
We can see that there are potentially a couple of different populations seen here with different relationships between the read (UMI) counts and number of genes detected. We can also see that the population with fewer observed genes than expected given the number of reads has much higher levels of ribosomal proteins, which can soak up the sequencing capacity for some cells, hiding other genes.
More generally we can try to quantitate this effect by calculating a complexity value which just divides genes by UMIs. Higher values indicate that we’re getting shallower coverage of more genes, and lower values mean that we’re seeing fewer genes overall. This can also be an indicator of poor QC in extreme situations.
The standard way of calculating this is
log10(genes)/log10(counts) however this gives absolute
values which are difficult to judge. A possibly better approach is to
fit a line through the cloud and then calculate the difference from the
observed value to the expected.
qc.metrics %>%
mutate(complexity=log10(nFeature_RNA) / log10(nCount_RNA)) -> qc.metrics
lm(log10(qc.metrics$nFeature_RNA)~log10(qc.metrics$nCount_RNA)) -> complexity.lm
complexity.lm
##
## Call:
## lm(formula = log10(qc.metrics$nFeature_RNA) ~ log10(qc.metrics$nCount_RNA))
##
## Coefficients:
## (Intercept) log10(qc.metrics$nCount_RNA)
## 0.4220 0.7477
qc.metrics %>%
mutate(
complexity_diff = log10(nFeature_RNA) - ((log10(qc.metrics$nCount_RNA)*complexity.lm$coefficients[2])+complexity.lm$coefficients[1])
) -> qc.metrics
Now we can plot this
qc.metrics %>%
ggplot(aes(x=complexity_diff)) +
geom_density(fill="yellow")
min(c(max(qc.metrics$complexity_diff),0-min(qc.metrics$complexity_diff))) -> complexity_scale
qc.metrics %>%
mutate(complexity_diff=replace(complexity_diff,complexity_diff< -0.1,-0.1)) %>%
ggplot(aes(x=log10(nCount_RNA), y=log10(nFeature_RNA), colour=complexity_diff)) +
geom_point(size=0.5) +
geom_abline(slope=complexity.lm$coefficients[2], intercept = complexity.lm$coefficients[1]) +
scale_colour_gradient2(low="blue2",mid="grey",high="red2")
qc.metrics %>%
ggplot(aes(x=complexity_diff, y=percent.Ribosomal)) +
geom_point(size=1)
We can see a small subset of genes which have very low complexity which
are probably QC fails, but we can also see a larger effect of cells with
varying levels of ribosomal proteins which lose complexity as the
ribosomal content increases.
In general its a good idea to be fairly permissive when filtering your initial data. Depending on the source of your counts and the way they were imported you’ll probably already have removed the cells with very low counts, or the genes represented in only a few cells.
Here we’ll set a cutoff on two of the metrics we calculated, but you will need to look at the QC of your own data to help decide. Remember, we will look at QC again after quantitating and clustering the data, so we can always come back and filter more harshly later if we wish. It’s often easiest to apply additional filters after performing clustering where you can look at the properties of a group of clustered cells.
qc.metrics %>%
ggplot(aes(log10(nCount_RNA))) +
geom_histogram(binwidth = 0.05, fill="yellow", colour="black") +
ggtitle("Distribution of UMI Counts") +
geom_vline(xintercept = log10(1000))
qc.metrics %>%
ggplot(aes(log10(nFeature_RNA))) +
geom_histogram(binwidth = 0.05, fill="yellow", colour="black") +
ggtitle("Distribution of Feature Counts") +
geom_vline(xintercept = log10(2000))
qc.metrics %>%
ggplot(aes(percent.MT)) +
geom_histogram(binwidth = 0.5, fill="yellow", colour="black") +
ggtitle("Distribution of Percentage Mitochondrion") +
geom_vline(xintercept = 10)
You shouldn’t recycle QC cutoffs between experiments. Using a cutoff which represents outliers in your current dataset is by far the most reliable way to set a sensible value.
From the QC we can then filter the data to get rid of cells with unusual QC metrics. We’ve set cutoffs based on the plots we made before.
subset(
data,
nFeature_RNA>750 &
nFeature_RNA < 2000 &
percent.MT < 10
) -> data
data
## An object of class Seurat
## 36601 features across 3942 samples within 1 assay
## Active assay: RNA (36601 features, 0 variable features)
## 1 layer present: counts
Ideally, after filtering we should re-plot to make sure that the data really does look better.
Before we do any analysis with the data we needs to normalise the raw counts we currently have to get values which are more comparable between cells.
The default normalisation in Seurat is pretty simple - it simply scales the counts by the total counts in each cell, multiplies by 10,000 and then log transforms.
NormalizeData(data, normalization.method = "LogNormalize") -> data
## Normalizing layer: counts
We can now access the normalised data in
LayerData(data, layer="data"). We can use this to show that
we can get a list of the most highly expressed genes overall.
apply(LayerData(data,layer="data"),1,mean) -> gene.expression
## Warning in asMethod(object): sparse->dense coercion: allocating vector of size
## 1.1 GiB
names(gene.expression) <- rownames(data)
sort(gene.expression, decreasing = TRUE) -> gene.expression
head(gene.expression, n=50)
## MALAT1 B2M TMSB4X RPL10 RPL41 MT-CO1 EEF1A1 RPL13A
## 6.342245 4.957456 4.840603 4.501549 4.497800 4.463789 4.426327 4.421787
## RPL13 RPS18 RPLP1 RPS27 RPL21 RPLP2 RPS12 RPL34
## 4.391305 4.294999 4.291187 4.269649 4.250806 4.144890 4.143300 4.106036
## RPL32 RPS14 MT-CO3 RPS2 RPS27A MT-CO2 RPL26 RPS15A
## 4.070093 4.065816 4.053589 4.028858 4.023677 4.002888 3.985147 3.982837
## RPL28 RPL11 RPS19 RPL23A RPS3A RPL18A RPL27A RPS6
## 3.957320 3.941015 3.929464 3.912566 3.879806 3.878352 3.871731 3.856089
## RPS23 RPS4X RPL19 TMSB10 TPT1 RPS3 RPS8 RPL17
## 3.832583 3.825052 3.817570 3.794648 3.779531 3.776823 3.775376 3.751418
## RPL39 MT-ND4 RPL30 RPS24 RPL7 RPS15 MT-ATP6 RPL3
## 3.747930 3.738239 3.729084 3.714900 3.704778 3.696783 3.678792 3.616755
## RPL12 RPS25
## 3.582130 3.572445
We already mentioned MALAT1 which is an abundant nuclear transcript. We can also see both mitochondrial and ribosomal proteins. The high abundance of these transcripts is the reason we often focus on them, because of their influence on the data, and their changes in poor quality cells.
We can take a look at how effective the normalisation has been. This type of normalisation is pretty crude and we don’t expect perfectly normalised values, but this generally doesn’t matter given the sparse nature of the data and the magnitude of the differences between cells.
Let’s look at the distribution of values for a single gene (GAPDH).
tibble(
GAPDH = LayerData(data, layer = "data")["GAPDH",]
) %>%
ggplot(aes(x=GAPDH)) +
geom_histogram(binwidth = 0.05, fill="yellow", colour="black") +
ggtitle("GAPDH expression")
So even for a so-called housekeeping gene we still see a significant proportion of dropout cells, and expression values which spread over 3 orders of magnitude.
We can also go a bit wider and pick the first 100 cells and look at the distributions of their expression values.
as_tibble(
LayerData(data)[,1:100]
) %>%
pivot_longer(
cols=everything(),
names_to="cell",
values_to="expression"
) %>%
ggplot(aes(x=cell, y=expression)) +
stat_summary(geom="crossbar", fun.data=mean_sdl) +
theme(axis.text.x = element_blank())
We can see that the distributions of values across the cells are at least broadly similar.
Now that we have quantitated the data we can have a look at whether the cell cycle is having any effect on the data. Seurat comes with a bunch of marker genes for different cell cycle stages which we can use
cc.genes.updated.2019
## $s.genes
## [1] "MCM5" "PCNA" "TYMS" "FEN1" "MCM7" "MCM4"
## [7] "RRM1" "UNG" "GINS2" "MCM6" "CDCA7" "DTL"
## [13] "PRIM1" "UHRF1" "CENPU" "HELLS" "RFC2" "POLR1B"
## [19] "NASP" "RAD51AP1" "GMNN" "WDR76" "SLBP" "CCNE2"
## [25] "UBR7" "POLD3" "MSH2" "ATAD2" "RAD51" "RRM2"
## [31] "CDC45" "CDC6" "EXO1" "TIPIN" "DSCC1" "BLM"
## [37] "CASP8AP2" "USP1" "CLSPN" "POLA1" "CHAF1B" "MRPL36"
## [43] "E2F8"
##
## $g2m.genes
## [1] "HMGB2" "CDK1" "NUSAP1" "UBE2C" "BIRC5" "TPX2" "TOP2A"
## [8] "NDC80" "CKS2" "NUF2" "CKS1B" "MKI67" "TMPO" "CENPF"
## [15] "TACC3" "PIMREG" "SMC4" "CCNB2" "CKAP2L" "CKAP2" "AURKB"
## [22] "BUB1" "KIF11" "ANP32E" "TUBB4B" "GTSE1" "KIF20B" "HJURP"
## [29] "CDCA3" "JPT1" "CDC20" "TTK" "CDC25C" "KIF2C" "RANGAP1"
## [36] "NCAPD2" "DLGAP5" "CDCA2" "CDCA8" "ECT2" "KIF23" "HMMR"
## [43] "AURKA" "PSRC1" "ANLN" "LBR" "CKAP5" "CENPE" "CTCF"
## [50] "NEK2" "G2E3" "GAS2L3" "CBX5" "CENPA"
We can use these to try to predict the cell cycle of each cell using a built in scoring scheme.
CellCycleScoring(
data,
s.features = cc.genes.updated.2019$s.genes,
g2m.features = cc.genes.updated.2019$g2m.genes,
set.ident = TRUE
) -> data
We should now have a bunch of new QC metrics to give the score for S and G2M and a call for the Phase
data[[]]
We can look at the proportion of cells in different states. This would be particularly important if you had multiple samples and the spread of different cell cycle phases was wildly different between them.
data[[]] %>%
ggplot(aes(Phase)) + geom_bar() +
xlab(NULL) +
ggtitle("Cell Cycle Calls")
We can also look quantiatively at the calls for S and G2M to see how it made its decision
data[[]] %>%
ggplot(aes(x=S.Score, y=G2M.Score, color=Phase)) +
geom_point() +
coord_cartesian(xlim=c(-0.15,0.15), ylim=c(-0.15,0.15))
Although the tool has made predictions of the stage for each cell, there isn’t a huge separation between the groups it’s picked so we have some hope that this will have a relatively minor influence on the overall expression patterns we see. We’ll pick this up later once we’ve clustered the data and we can see what the content of the different clusters looks like.
Before going on to do the dimensionality reduction we’re going to do some filtering of genes to remove those which are likely to be uninformative in the overall structure of the data. The main method to do this is to find unusually variable genes - these are calculated in the context of the gene’s expression since lowly expressed genes are more likely to be variable by standard measures.
We need to run FindVariableFeatures to calculate the
varability stats. We can look at the variability information with
HVFInfo() once this has been done. We will optimise the
settings after looking at the results.
FindVariableFeatures(data) -> data
## Finding variable features for layer counts
HVFInfo(data) %>%
arrange(desc(variance.standardized)) %>%
head(n=20)
The most variable genes are likely to be those which are informative in distinguishing the major cell populations. In this instance we see Immunoglobulin genes at the very top, but other more normal genes below.
We’re going to need to decide how many variable genes to keep so we can look at the distribution of standardised variance to help us decide of a cutoff.
HVFInfo(data) %>%
filter(variance.standardized > 0) %>%
ggplot(aes(x=variance.standardized)) +
geom_density() +
coord_cartesian(xlim=c(0,2)) +
geom_vline(xintercept=1.3, colour="red") +
ggtitle("Distribution of standardised variance")
I had to focus in on the main distribution a bit since the scaling is so wide for the super-variable genes. We can see that variance above about 1.3 is outside the main distribution and likely to represent real biological variance.
Let’s see what that means for a number of genes.
HVFInfo(data) %>%
filter(variance.standardized >= 1.3) %>%
nrow()
## [1] 686
So just under 700 genes which seems a sensible number. We can round up to 700 to keep things simple.
Seurat provides a method to calculate a normalised intensity for each gene, and can then select the top ‘n’ most variable features. In this case we’re selecting the 700 most variable genes.
FindVariableFeatures(
data,
selection.method = "vst",
nfeatures=700
) -> data
## Finding variable features for layer counts
The names of the variable features can be accessed with
VariableFeatures().
These are the full list of variable features
HVFInfo(data) %>%
as_tibble(rownames="Gene") %>%
mutate(hypervariable = Gene %in% VariableFeatures((data))) -> variance.data
variance.data %>%
filter(hypervariable) %>%
arrange(desc(variance.standardized))
We can plot out a graph of the variance vs mean and highlight the selected genes so we can see whether we think we’re likely to capture what we need.
variance.data %>%
ggplot(aes(log(mean),log(variance),color=hypervariable)) +
geom_point() +
scale_color_manual(values=c("black","red"))
Before putting the data into PCA for dimensionality reduction we will scale the genes so that they have a mean of 0 and a variance of 1. This is claimed to make the analysis less biased by expression level in the forthcoming PCA.
ScaleData(data,features=rownames(data)) -> data
## Centering and scaling data matrix
Now we’ve got to the stage where we can do the reduction. We’re going to use two methods - PCA and tSNE. We can also see what a UMAP would look like.
We can start by actually running the PCA. We will only use the variable features which we previously selected. The PCA will calculate all of our PCs and will also give us a list of the genes which were most highly (and lowly) weighted in the different PCs.
RunPCA(data,features=VariableFeatures(data)) -> data
## PC_ 1
## Positive: CST3, CSTA, MNDA, S100A9, CFD, AIF1, FTL, CTSS, CD14, VCAN
## S100A12, TYROBP, FOS, FGL2, SERPINA1, AP1S2, FTH1, CYBB, S100A11, SAT1
## GRN, TSPO, TMEM176B, CFP, FCER1G, S100A6, SPI1, NPC2, TNFSF13B, COTL1
## Negative: IL32, CD3D, TRAC, CD7, CCL5, GZMA, NKG7, CD2, CST7, CD247
## CD3G, GZMM, PRF1, CD27, BCL11B, KLRD1, SYNE2, HOPX, KLRG1, FGFBP2
## LDHB, CCL4, GZMH, TRGC2, GZMB, NCR3, ITM2A, TRG-AS1, LYAR, CD8B
## PC_ 2
## Positive: NKG7, GZMA, S100A4, CCL5, CST7, PRF1, ANXA1, IL32, KLRD1, CD7
## GZMM, FGFBP2, CD247, CD3D, CCL4, CD2, GZMH, S100A6, KLRG1, HOPX
## GZMB, ACTB, IFITM2, CD3G, KLRF1, LYAR, SPON2, TRGC2, TRAC, S100A10
## Negative: CD79A, MS4A1, CD79B, BANK1, HLA-DRA, IGKC, HLA-DPA1, IGHD, HLA-DQB1, HLA-DPB1
## CD22, CD37, AFF3, FCRLA, HLA-DRB5, HLA-DQA2, HLA-DMB, ADAM28, HLA-DOB, BLNK
## FCER2, TCF4, PAX5, BCL11A, CD19, PDLIM1, CD72, NIBAN3, JCHAIN, TSPAN13
## PC_ 3
## Positive: GZMB, KLRF1, CLIC3, PRF1, FGFBP2, SPON2, NKG7, KLRD1, CST7, CCL4
## GZMA, HOPX, AKR1C3, PLAC8, RHOC, FCER1G, IGFBP7, GZMH, IFITM2, CD160
## ABI3, TYROBP, ZEB2, CCL5, DAB2, CMC1, CD247, CD38, HLA-DPB1, HLA-DPA1
## Negative: TRAC, LDHB, CD3D, BCL11B, CD27, CD3G, TRABD2A, RGCC, PRKCQ-AS1, RGS10
## PASK, IL32, ITM2A, CD8B, VIM, CD2, SCGB3A1, GPR171, RTKN2, CRIP2
## GRAP2, LINC00892, S100A10, GZMK, S100A12, ID3, VCAN, COTL1, GBP1, FXYD2
## PC_ 4
## Positive: PTCRA, GNG11, TUBB1, CAVIN2, ACRBP, TSC22D1, SERPINF1, SMIM5, HIST1H2AC, TPM2
## CLU, TMEM40, MPIG6B, GP9, TREML1, CLEC4C, PF4V1, PPP1R14B, CLDN5, AC147651.1
## SCT, LAMP5, CA2, ITGA2B, MYL9, TRIM58, DAB2, CMTM5, MYBL2, UGCG
## Negative: CD79B, MS4A1, CD79A, IGHD, CD22, BANK1, FGFBP2, PRF1, KLRD1, IFITM2
## CST7, NKG7, HOPX, HLA-DOB, FCER2, ADAM28, KLRF1, HLA-DQB1, CD72, SPON2
## CCL4, PAX5, GZMH, FCRLA, GZMA, HLA-DPA1, HLA-DPB1, CD19, ARHGAP24, HLA-DRA
## PC_ 5
## Positive: TUBB1, GNG11, CAVIN2, ACRBP, HIST1H2AC, TSC22D1, TMEM40, CLU, GP9, MPIG6B
## TREML1, CA2, AC147651.1, TRIM58, CLDN5, ITGA2B, PF4V1, MYL9, CMTM5, RAB27B
## RUFY1, NEXN, DMTN, SMOX, RGS18, ESAM, PDLIM1, PRKAR2B, LIMS1, GMPR
## Negative: SERPINF1, TPM2, CLEC4C, SCT, LAMP5, PPP1R14B, MYBL2, UGCG, DERL3, LINC00996
## TNFRSF21, PRXL2A, APP, IRF7, PLXNA4, SMIM5, SHD, JCHAIN, ASIP, IRF8
## LINC00865, PPM1J, SLC12A3, SCN9A, RNASE6, ALOX5AP, TCF4, TIFAB, ZNF385C, STMN1
We can use the DimPlot function to plot all of our
projections - we just need to tell it which one to use. Here we’re going
to just plot the first two PCs from our PCA. As we classified our cells
by cell cycle before it will pick this up and colour the clusters by
that so we can see if the cell cycle is having a big effect on the
clusters we’re picking out.
We can use the group.by option to opt to colour by any
metadata column. We can also add labels to the plot. Finally we can add
a call to the NoLegend() function to suppress the automatic
colour legend which is drawn.
DimPlot(data,reduction="pca")
We can look at later PCs by passing the dims
argument.
DimPlot(data,reduction="pca", dims=2:3)
This nicely shows us the power, but also the limitations of PCA in that we can see that not all of the useful information is captured in the first two principal components. The question then becomes how far down the set of PCs do we need to go to capture all of the biologically relevant information.
We can start with a simple plot called the elbow plot which simply quantitates the amount of variance captured in the different PCs
ElbowPlot(data)
From this we can see that there are fairly high amounts of information captured in the first 8 PCs and that maybe we can see some additional information up to around 15PCs, but beyond that the plot is very flat. Taking somewhere between 8-15 PCs should therefore capture what we want to see.
For a more detailed view we can do dimensionality heatmaps. These are plots of PCA weightings for the most highly and lowly weighted genes, shown against the set of cells which are most highly influenced by the PC. The idea is that as long as we’re seeing clear structure in one of these plots then we’re still adding potentially useful information to the analysis.
DimHeatmap(data,dims=1:18, cells=500)
Things are pretty clearly grouped up to PC10 where the separation is a bit more vague. It doesn’t matter hugely exactly where you draw the line to separate PCs, but here the first 10 are definitely nice, so we can try with those.
To try to capture more information in a single 2D plot we’re going to take the first 10 dimensions of the PCA - which were calculated on only the 700 most variable genes - forward into a tSNE projection. We can run this in a very similar way to the PCA, except that we specify the number of dimensions we want to use.
Since tSNE uses a randomised starting position, if we want to be able
to reproduce the plot we see then we’ll need to know the random ‘seed’
which was used to create the plot. We can capture the current state of
the random number generator (from the .Random.seed
function) and report it. This will change every time we run, but at
least we will report the result.
In our case, because we want everyone to get the same answer I’ve saved the seed from when I prepared this tutorial and we’ll re-use that.
8482 -> saved.seed
set.seed(saved.seed)
We are now going to run the tSNE. The one parameter we might need to play around with is the perplexity value (expected number of nearest neighbours). By default this is set (somewhat arbitrarily) to 30. Setting this to a low value will help resolve small clusters, but at the expense of large clusters becoming more diffuse. Setting it to higher values will make the larger clusters more distinct, but may lose smaller clusters.
RunTSNE(
data,
dims=1:10,
seed.use = saved.seed,
perplexity=10
) -> data
DimPlot(data,reduction = "tsne", pt.size = 1) + ggtitle("tSNE with Perplexity 10") + NoLegend()
RunTSNE(
data,dims=1:10,
seed.use = saved.seed,
perplexity=200
) -> data
DimPlot(data,reduction = "tsne", pt.size = 1) + ggtitle("tSNE with Perplexity 200") + NoLegend()
RunTSNE(
data,
dims=1:10,
seed.use = saved.seed
) -> data
DimPlot(data,reduction = "tsne", pt.size = 1) + ggtitle("tSNE with default Perplexity (30)") + NoLegend()
We can see the differences in clustering between the different perplexities - the structures aren’t completely different but the compactness of them and the emphasis on smaller clusters certainly changes. We can also see that there isn’t a huge effect of cell cycle in that all cycles are generally represented in all clusters, with maybe one cluster being somewhat depleted for cells in S-phase.
Running UMAP is just as straight forward, and more practical as cell numbers get higher.
RunUMAP(
data,
dims=1:10,
seed.use = saved.seed
) %>%
DimPlot(reduction = "umap", pt.size = 1) +
ggtitle("tSNE with default parameters") +
NoLegend()
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 09:18:34 UMAP embedding parameters a = 0.9922 b = 1.112
## 09:18:34 Read 3942 rows and found 10 numeric columns
## 09:18:34 Using Annoy for neighbor search, n_neighbors = 30
## 09:18:34 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 09:18:35 Writing NN index file to temp file /tmp/RtmpqnU5Ql/file18dd2ea5191ab
## 09:18:35 Searching Annoy index using 1 thread, search_k = 3000
## 09:18:36 Annoy recall = 100%
## 09:18:38 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 09:18:40 Initializing from normalized Laplacian + noise (using RSpectra)
## 09:18:41 Commencing optimization for 500 epochs, with 162268 positive edges
## 09:18:49 Optimization finished
The UMAP tends to separate the data more sparesely and use less of the plot area in its default configuration.
At the moment in our PCA and tSNE we can see that there are clusters of cells, but we haven’t tried to identify what these are. We will come to this problem now. We’re going to use a graph based method to detect clusters. This finds the ‘k’ nearest neighbours to each cell and makes this into a graph. It then looks for highly inter-connected subgraphs within the graph and uses these to define clusters.
In the first instance we just define the graph. We can control the
number of neigbours used using the k.param value. The
default is 20. As before we use the first 10 dimensions of the PCA to
calculate the neighbours.
FindNeighbors(data,dims=1:10) -> data
## Computing nearest neighbor graph
## Computing SNN
Since we’re only calculating distances for the 20 nearest neighbours we get another sparse matrix of distances.
data@graphs$RNA_snn[1:10,1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
## [[ suppressing 10 column names 'AAACCTGAGAAACCAT-1', 'AAACCTGAGATAGCAT-1', 'AAACCTGAGCGTGAAC-1' ... ]]
##
## AAACCTGAGAAACCAT-1 1 . . . . . . . . .
## AAACCTGAGATAGCAT-1 . 1.0000000 . 0.2903226 . . . . . .
## AAACCTGAGCGTGAAC-1 . . 1 . . . . . . .
## AAACCTGCAACGCACC-1 . 0.2903226 . 1.0000000 . . . . . .
## AAACCTGCATCGATTG-1 . . . . 1 . . . . .
## AAACCTGTCCAGATCA-1 . . . . . 1 . . . .
## AAACGGGAGCGCTTAT-1 . . . . . . 1 . . .
## AAACGGGAGCTACCTA-1 . . . . . . . 1 . .
## AAACGGGCAACTGCTA-1 . . . . . . . . 1 .
## AAACGGGCAAGTCTGT-1 . . . . . . . . . 1
We can then segment the graph using the FindClusters
method. The resolution controls how fragmented the graph will be. Larger
values give larger clusters, smaller values gives smaller clusters. It
can be difficult to know what is a sensible number to pick, so you can
try a range of values between about 0.1 and 2 to see what effect this
has. We can run the profiling multiple times and then plot out the
results.
FindClusters(data,resolution = 0.1, cluster.name = "Clust1") -> data
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 3942
## Number of edges: 143276
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9723
## Number of communities: 7
## Elapsed time: 0 seconds
FindClusters(data,resolution = 0.2, cluster.name = "Clust2") -> data
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 3942
## Number of edges: 143276
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9525
## Number of communities: 8
## Elapsed time: 0 seconds
FindClusters(data,resolution = 0.3, cluster.name = "Clust3") -> data
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 3942
## Number of edges: 143276
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9342
## Number of communities: 9
## Elapsed time: 0 seconds
FindClusters(data,resolution = 0.5, cluster.name = "Clust5") -> data
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 3942
## Number of edges: 143276
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9014
## Number of communities: 10
## Elapsed time: 0 seconds
FindClusters(data,resolution = 0.8, cluster.name = "Clust8") -> data
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 3942
## Number of edges: 143276
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8670
## Number of communities: 13
## Elapsed time: 0 seconds
FindClusters(data,resolution = 1.5, cluster.name = "Clust15") -> data
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 3942
## Number of edges: 143276
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8034
## Number of communities: 15
## Elapsed time: 0 seconds
We can now plot these out using clustree.
clustree(data,prefix = "Clust")
We’ll take 0.8 as a fairly sensible split as it’s the first time the big cluster 0 splits. We can refer back and see what we could have further split or grouped if we’d gone higher or lower. This is often a bit of an exploration. Plotting out the results on the dimensionality plots is often a good way to evaluate too.
DimPlot(data,reduction="tsne",pt.size = 1, label = TRUE, label.size = 7, group.by = "Clust1")
DimPlot(data,reduction="tsne",pt.size = 1, label = TRUE, label.size = 7, group.by = "Clust3")
DimPlot(data,reduction="tsne",pt.size = 1, label = TRUE, label.size = 7, group.by = "Clust8")
DimPlot(data,reduction="tsne",pt.size = 1, label = TRUE, label.size = 7, group.by = "Clust15")
There often isn’t an obviously “correct” answer here. It’s just a matter of convenience and assessment. You can always come back later and change your mind as to the best solution.
FindClusters(data,resolution = 0.8) -> data
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 3942
## Number of edges: 143276
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8670
## Number of communities: 13
## Elapsed time: 0 seconds
The clusters are stored in the “seurat_clusters” metadata annotation so they can be used in any way the previous QC data was used. They will also be picked up automatically when projections are plotted.
head(data$seurat_clusters, n=20)
## AAACCTGAGAAACCAT-1 AAACCTGAGATAGCAT-1 AAACCTGAGCGTGAAC-1 AAACCTGCAACGCACC-1
## 2 0 5 0
## AAACCTGCATCGATTG-1 AAACCTGTCCAGATCA-1 AAACGGGAGCGCTTAT-1 AAACGGGAGCTACCTA-1
## 1 2 7 7
## AAACGGGCAACTGCTA-1 AAACGGGCAAGTCTGT-1 AAACGGGCAGACGCAA-1 AAACGGGCAGTCCTTC-1
## 7 0 5 8
## AAACGGGCATGCTAGT-1 AAACGGGGTTACGACT-1 AAACGGGGTTCAGTAC-1 AAAGATGAGAGCTTCT-1
## 12 6 4 0
## AAAGATGAGGAGCGTT-1 AAAGATGCACAACTGT-1 AAAGATGCACACCGCA-1 AAAGATGCAGACAAGC-1
## 2 5 3 1
## Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12
If we go back and plot our PCA we can see the clusters, but we can see that many of the clusters don’t resolve very well in PC1 vs PC2.
DimPlot(data,reduction="pca",label = TRUE)+ggtitle("PC1 vs PC2 with Clusters")
If we look at the same thing with the tSNE plot we can see that all of the information across the 8PCs used is preserved and we see the overall similarity of the cells.
DimPlot(data,reduction="tsne",pt.size = 1, label = TRUE, label.size = 7)
Examining the properties of the clusters
========================================
Now that we have our clusters we can look to see if they are being influenced by any of the QC metrics we calculated earlier. We can see that some of the clusters are skewed in one or more of the metrics we’ve calculated so we will want to take note of this. Some of these skews could be biological in nature, but they could be noise coming from the data.
VlnPlot(data,features="nCount_RNA", group.by = "seurat_clusters")
Clusters with very low coverage could be mis-called cells, so we could look at Cluster 8. Clusters with high coverage could be duplets, especially if they have releatively small cell numbers, so we’d be suspicious of clusters 7, 9, 10 and 12
VlnPlot(data,features="nFeature_RNA", group.by = "seurat_clusters")
It might be tempting to think that clusters 7,9,10 and 12 could be from
GEMs where two or more cells were captured since they all have unusually
high coverage and diversity.
VlnPlot(data,features="percent.MT", group.by = "seurat_clusters")
Nothing obviously problematic there.
data@meta.data %>%
group_by(seurat_clusters,Phase) %>%
count() %>%
group_by(seurat_clusters) %>%
mutate(percent=100*n/sum(n)) %>%
ungroup() %>%
ggplot(aes(x=seurat_clusters,y=percent, fill=Phase)) +
geom_col() +
ggtitle("Percentage of cell cycle phases per cluster")
The QC and exploration of this data often happens in cycles where you can try things and see what they look like.
Now that we have defined the different clusters we can start to evaluate them. One way to do this will be to identify genes whose expression defines each cluster which has been identified.
Suerat provides the FindMarkers function to identify
genes which a specific to a given cluster. This is a somewhat generic
function which can run a number of different tests. We are only going to
focus on two of these but you can find the others in the Seurat
documentation.
The two tests we are going to use are:
The Wilcox rank sum test. This identifies genes which are differentially regulated between two groups of cells. It is a non-parametric test which makes very few assumptions about the behaviour of the data and just looks for genes which have expression which is consistently ranked more highly in one group of cells compared to another.
The ROC test. This is a measure of how specifically a gene can predict membership of two groups. It gives a value between 0.5 (no predictive value) and 1 (perfectly predictive on its own) to say how useful each gene is at predicting. Again this is a non-parametric test which just cares about the ranked expression measures for each gene.
In the simplest case we can find genes which appear to be upregulated
in a specific cluster compared to all cells not in that cluster. The
additional min.pct parameter says that the gene must be
measured in at least 25% of the cells in either cluster 1 or all of the
other other cells in order to be tested. This cuts down on testing genes
which are effectively unexpressed.
FindMarkers(data,ident.1 = 1, min.pct = 0.25)
We can then use a convenience plotting method VlnPlot to
show the expression levels of these genes in the cells in each
cluster.
VlnPlot(data,features="LEF1")
We can indeed see that the LEF1 gene is more highly expressed in cluster
1 than any of the other clusters, but we can also see that it is also
highly expressed in cluster 3. This is the adjacent cluster and these
would have been merged at a lower resolution.
We can extend the same methodology to make predictions for all of the
clusters. In more recent versions of Seurat we can use the
FindAllMarkers function to do this systematically.
By default this method searches for both positive (enriched) and negative (depleted) markers so you can filter on log2FoldChange if you just want the positive ones.
FindAllMarkers(data,min.pct = 0.25) -> cluster.markers
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
## Calculating cluster 9
## Calculating cluster 10
## Calculating cluster 11
## Calculating cluster 12
cluster.markers %>%
filter(avg_log2FC > 0) -> cluster.markers
head(cluster.markers, n=20)
We can extract from this list the most upregulated gene from each cluster
cluster.markers %>%
group_by(cluster) %>%
slice(1) %>%
ungroup() %>%
pull(gene) -> best.wilcox.gene.per.cluster
best.wilcox.gene.per.cluster
## [1] "S100A12" "CCR7" "CD79A" "IL7R" "GZMH" "GZMK" "CD79A"
## [8] "CLEC10A" "GZMB" "CDKN1C" "TUBB1" "XCL1" "LILRA4"
We can then plot these out.
VlnPlot(data,features=best.wilcox.gene.per.cluster)
We can see that for some clusters (eg Cluster 6 - CLEC10A) We really do have a gene which can uniquely predict, but for many others (eg cluster 0 S100A12) we have a hit which also picks up other clusters (clusters 7 and 10 in this case).
Let’s look at the unique and multiple hits for each cluster.
cluster.markers |>
filter(p_val_adj < 0.05) |>
group_by(gene) |>
mutate(unique = if_else(n()>1, "Multiple", "Unique")) %>%
ungroup() %>%
arrange(unique) %>%
ggplot(aes(x=cluster, y=avg_log2FC, colour=unique)) +
geom_jitter(height=0) +
scale_colour_manual(values = c("grey","red2"))
This give a good idea of how many hits there are to each cluster, how strong they are, and whether they just come up in that cluster.
We could filter the hits to find the best hit for each cluster which wasn’t a hit in any other cluster to see if we can get better individual markers.
cluster.markers %>%
filter(p_val_adj < 0.05) %>%
group_by(gene) %>%
filter(n() == 1) %>%
ungroup() %>%
arrange(p_val_adj) %>%
group_by(cluster) %>%
slice(1) %>%
ungroup() %>%
pull(gene) -> best.unique.wilcox.gene.per.cluster
best.unique.wilcox.gene.per.cluster
## [1] "LRRK2" "TRABD2A" "TNFRSF13B" "NSG1" "TIGIT" "TC2N"
## [7] "FCER2" "CLEC10A" "SPON2" "CDKN1C" "TUBB1" "XCL1"
## [13] "LILRA4"
We can then plot these out.
VlnPlot(data,features=best.unique.wilcox.gene.per.cluster)
We can see that these look much more specific.
The other view we might want to use once we have picked out some genes we like is to colour the entirety of a projection by that gene. Sometimes we want to do this to look at genes we expect to be diagnostic for different cell subtypes, but sometimes we will use the same method to explore hits which come out of our own analysis.
We can use the FeaturePlot function to colour the tSNE
projection with the expression level of a number of different genes.
FeaturePlot(data,features=best.unique.wilcox.gene.per.cluster, reduction = "tsne", label = TRUE, label.size = 5)
There are many programs to predict cell types. They all work well if you have a good set of marker genes.
Here we’re using the Azimuth pipeline. It integrates your data with one of a number of publicly available cell collections and then returns a labeling of each cell, as well as the integrated dimensionality reduction from the reference data.
We’re going to plot this out not using their UMAP, but the tsne we calculated before.
RunAzimuth(data, reference = "pbmcref", ) -> data
data
## An object of class Seurat
## 36696 features across 3942 samples within 4 assays
## Active assay: RNA (36601 features, 700 variable features)
## 3 layers present: counts, data, scale.data
## 3 other assays present: prediction.score.celltype.l1, prediction.score.celltype.l2, prediction.score.celltype.l3
## 4 dimensional reductions calculated: pca, tsne, integrated_dr, ref.umap
DimPlot(data, reduction = "tsne", group.by = "predicted.celltype.l1", label = TRUE, pt.size = 2)
We can see that the level 1 annotation picks out the largest structures within the data, but doesn’t have the same resolution as our own clustering.
DimPlot(data, reduction = "tsne", group.by = "predicted.celltype.l2", label = TRUE, pt.size = 2)
At level 2 some of our sub-clusters start to resolve, some cleanly but others are more messy. You can see that certain celltypes are highly similar and not easy to differntiate.
We can also relate the Azimuth labels to our own clusters to see which cell type is predicted for each cluster.
data[[]] %>%
as_tibble(rownames="Cell") %>%
group_by(seurat_clusters, predicted.celltype.l2) %>%
count() %>%
ungroup() %>%
ggplot(aes(x=seurat_clusters, y=n, fill=predicted.celltype.l2)) +
geom_col()
Many of these split cleanly at level 2 annotation but some are still a mix of a couple of different annotations.
data[[]] %>%
as_tibble(rownames="Cell") %>%
group_by(seurat_clusters, predicted.celltype.l1,predicted.celltype.l2) %>%
count() %>%
group_by(seurat_clusters) %>%
mutate(
percentage = 100*n/sum(n)
) %>%
ungroup() %>%
arrange(desc(percentage)) %>%
group_by(seurat_clusters) %>%
slice(1) %>%
ungroup() %>%
arrange(desc(percentage))
We can see clean assignments for many clusters, but some at the bottom are still a bit ambiguous.
Finally, we can go back from the analysis we’ve done here to the Loupe browser to try to take advantage of both the flexibility in R and the interaction within Loupe. To do this we’re using an R package which 10X produce which can export information from a Seurat object into a cloupe file.
Information on installing and using this package can be found at:
https://www.10xgenomics.com/support/software/loupe-browser/latest/tutorials/introduction/lb-louper
create_loupe_from_seurat(data, output_name = "seurat_exported", force = TRUE)
## 2025/11/18 09:20:54 extracting matrix, clusters, and projections
## 2025/11/18 09:20:55 selected assay: RNA
## 2025/11/18 09:20:55 selected clusters: active_cluster orig.ident Phase old.ident Clust1 seurat_clusters Clust2 Clust3 Clust5 Clust8 Clust15 RNA_snn_res.0.8 predicted.celltype.l1 predicted.celltype.l2 predicted.celltype.l3
## 2025/11/18 09:20:55 selected projections: tsne ref.umap
## 2025/11/18 09:20:55 validating count matrix
## 2025/11/18 09:20:55 validating clusters
## 2025/11/18 09:20:55 validating projections
## 2025/11/18 09:20:55 creating temporary hdf5 file: /tmp/RtmpqnU5Ql/file18dd2e4202f3b2.h5
## 2025/11/18 09:20:59 invoking louper executable
## 2025/11/18 09:20:59 running command: "/bi/home/andrewss/.local/share/R/loupeR/louper create --input='/tmp/RtmpqnU5Ql/file18dd2e4202f3b2.h5' --output='/bi/home/andrewss/10XCourse Data/Seurat/seurat_exported.cloupe' --force"