Orchestrating Single-Cell Analysis with Bioconductor: Workshop
Instructors names and contact information
This workshop gives an introductory overview of analyzing single-cell data, particularly RNA-seq, using Bioconductor software. This workshop will help participants to understand essential Bioconductor infrastructure, such as the SingleCellExperiment class, and various analytical routines using real-world data. Finally, this workshop is modeled after the manuscript “Orchestrating Single-Cell Analysis with Bioconductor” (Amezquita et al. 2019). Students will analyze provided example datasets on their personal laptop. This workshop will be a mixture of example code shown by instructors (available through this repository), short exercises, and discussion.
- Basic knowledge of R syntax
- Some familiarity with S4 objects may be helpful, but is not required
- Some familiarity with tidyverse syntax and methods, such as the usage of pipes, may be helpful, but is not required
- Familiarity with high-throughput gene expression data as obtained from RNA-seq; familiarity with single-cell RNA-seq helpful, but not required
Relevant background reading:
- (Amezquita et al. 2019)
In this workflow, we will be using the
TENxPBMCData package’s 3k cell dataset. This dataset contains various cell types, and is small enough to demonstrate essential Bioconductor packages and workflows.
Students will be able to run code interactively during the workshop on their personal computers during a live demonstration of the code used herein, with ample opportunity for questions, answers, and discussion.
R / Bioconductor packages used
First, please make sure you have the latest Bioconductor release version 3.9. See Bioconductor installation instructions:
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(version = "3.10") BiocManager::install(c( 'TENxPBMCData', # dataset 'DropletUtils', # quality control 'scater', 'scran', # normalization/downstream analysis/viz 'iSEE', # interactive viz (not run) 'BiocParallel', # faster computation 'devtools', # for installing cellassign 'tidyr', # data munging 'tensorflow')) # cellassign dependency ## optional for annotation section: devtools::install_github('irrationone/cellassign') # automated annotation of cells
library(tidyr) library(BiocParallel) library(iSEE) library(TENxPBMCData) library(DropletUtils) library(scater) library(scran)
50 min workshop consisting of the following:
|Introduction to Workshop||5m|
|Infrastructure: the SingleCellExperiment class||10m|
|Framework of Single-Cell RNA-seq analysis||10m|
|Data processing (QC, Norm, Feature Selection)||10m|
|Downstream: Clustering, DE, Annotation||10m|
|Sharing Data: Interactive Analysis||5m|
Workshop goals and objectives
- Describe the framework needed to implement a basic single-cell RNA-seq analysis
- Describe the utility and design of the SingleCellExperiment object in the context of the analysis framework
- Identify critical steps within a typical single-cell RNA-seq analysis that can greatly influence end-results
- Import single-cell RNA-seq data from raw counts into a SingleCellExperiment object
- Utilize the SingleCellExperiment object for annotation, subsetting, and processing via ad-hoc and established methods
- Produce descriptive plots from SingleCellExperiment objects to evaluate key quality control and processing steps
- Perform a basic end-to-end analysis of a simple yet heterogeneous scRNA-seq dataset
Overview of the SingleCellExperiment Class
The SingleCellExperiment class is a lightweight data container for storing data from single-cell assays or experiments that benefits from continual validity checks to prevent malformed data input. This has made the SingleCellExperiment data container the foundation of many packages oriented toward single-cell analysis available in Bioconductor, providing seamless interoperability between packages and facilitating the development and usage of cutting-edge computational methods. The SingleCellExperiment class was developed as an extension of the existing SummarizedExperiment class originally designed for bulk assays. In addition to supporting many existing software packages, SingleCellExperiment provides convenient methods and data structures that are specific to single-cell analyses.
The SingleCellExperiment object is organized into components, which are written to and accessed programmatically with accessors named after their corresponding component. Primary data, such as count matrices representing sequencing read or unique molecular identifier (UMI) counts, are stored in the assays component as one or more matrices (including sparse or dense matrices e.g. using the Matrix package), where rows represent features (e.g. genes, transcripts, genomic regions) and columns represent cells. Furthermore, each row and column can be annotated with a rich set of metadata. For example, row metadata are stored in the rowData component and could contain alternative gene id mappings, such as Entrez gene IDs alongside HUGO gene symbols. Furthermore, for rows corresponding to potentially disjoint genomic features, a special rowRanges component can be created to hold sets of genomic coordinates. Column metadata is stored in the colData component and contains information pertinent to cell-level characteristics. In particular, the data within the colData component can be used for subsetting of the SingleCellExperiment object, such as removing poor quality cells. An additional feature of the SingleCellExperiment class is the reducedDims component, which contains low-dimensional representations of data such as Principal Components Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP).
Because all of the primary data, transformations of the data, and associated metadata reside in a single SingleCellExperiment object, this design feature enables the continual validity checking that assures data integrity as well as functional cross-references for each cell across all (meta)data characteristics. To enable analyses in computing environments where the data are too large to fit into memory, disk-backed representations of the data (e.g. HDF5) are supported in the SingleCellExperiment class.
A Common Framework for Analyzing Single-Cell RNA-seq data
In this workshop, we will be following the workflow above and, in particular, demonstrating the utility of the SingleCellExperiment class to store the record of our analysis. We assume here that the reader has performed necessary preprocessing of the data - specifically, read alignment and subsequent quantification of the reads into a counts matrix. For the purposes of the workshop, we will be utilizing a publicly available dataset from the TabulaMurisData package, utilizing a subset of the data for further analysis.
The framework of the analytical portion following preprocessing can be divided conceptually into three sections:
Data Processing: these steps ultimately lead to the creation of a clean expression matrix, which in general serves as the common basis for most downstream analysis. This also includes the calculation of reduced dimension representations derived from the clean expression matrix, such as PCA, t-SNE, and UMAP, that are often used in downstream analysis and results visualization.
Downstream Analysis: the choice of analysis hereafter are typically performed in service of answering a biological objective, and thus are not necessarily a linear series of steps as in data processing (the possible exception to this being clustering, as it often serves as the basis for other analyses). A variety of analyses can be performed, ranging from differential expression, to trajectory analysis, to annotation tasks such as gene set enrichment analysis and cell labeling.
Accessible and Reproducible Analysis: sharing the record of analysis - most of if not all of - can be shared via the SingleCellExperiment object created throughout the analysis above - can be shared with external collaborators via platforms such as iSEE.
Please note, the workflow demonstrated in this workshop is written with the aim of simplicity, and thus will likely require nontrivial tweaking of parameters or alternate methods in real-world analyses.
One note: in this workflow, we will be loading libraries as they become necessary to clearly link libraries to their respective functions, which usually runs counter to the norm of loading libraries first, at the top of the analysis script.
Preprocessing & Import to R
We will assume here that sequencing alignment and quantification of the data into a counts matrix, as well as the subsequent import to R has already been performed since this is highly platform- or technology-dependent.
Note that for 10X Genomics data (which is used in this example workflow), the counts matrix and associated metadata (cell barcodes, data path, etc.) can be imported via the
read10xCounts() function. For data processed through
Kallisto, we recommend checking out the
tximeta Bioconductor packages. These are either imported as
SingleCellExperiment or as a counts matrix which can be then coerced into a
SingleCellExperiment object as demonstrated below.
Constructing the SingleCellExperiment
Below we show an example of creating a
SingleCellExperiment class object from a counts matrix and associated experimental metadata.
library(SingleCellExperiment) ## More realistic: read in your experimental design metadata ## If its per cell metadata, make sure it lines up with your ## counts matrix row IDs correctly ## my_metadata <- read.csv("my_metadata.csv") ## Example data ncells <- 100 my_counts_matrix <- matrix(rpois(20000, 5), ncol = ncells) my_metadata <- data.frame(genotype = rep(c('A', 'B'), each = 50), experiment_id = 'Experiment1') ## Construct the sce object manually sce <- SingleCellExperiment(assays = list(counts = my_counts_matrix), colData = my_metadata) ## Manually adding a variable that is the same across all cells colData(sce) <- cbind(colData(sce), date = '2020-01-01') sce
## class: SingleCellExperiment ## dim: 200 100 ## metadata(0): ## assays(1): counts ## rownames: NULL ## rowData names(0): ## colnames: NULL ## colData names(3): genotype experiment_id date ## reducedDimNames(0): ## spikeNames(0):
From Publicly Available Data
From here on out, we will be working with a small example dataset from the
TENxPBMCData Bioconductor package which has already been packaged into a
SingleCellExperiment class object:
library(TENxPBMCData) sce <- TENxPBMCData('pbmc3k')
## class: SingleCellExperiment ## dim: 32738 2700 ## metadata(0): ## assays(1): counts ## rownames(32738): ENSG00000243485 ENSG00000237613 ... ## ENSG00000215616 ENSG00000215611 ## rowData names(3): ENSEMBL_ID Symbol_TENx Symbol ## colnames: NULL ## colData names(11): Sample Barcode ... Individual Date_published ## reducedDimNames(0): ## spikeNames(0):
One decision that should be made early on in the analysis is what row identifier to identify genes. Depending on how the data is imported, the
rowData component may already have additional annotation information, such as multiple row mappings. For our new
sce object from the
pbmc3k dataset, we can take a look at
rowData to see our options:
## DataFrame with 32738 rows and 3 columns ## ENSEMBL_ID Symbol_TENx Symbol ## <character> <character> <character> ## ENSG00000243485 ENSG00000243485 MIR1302-10 NA ## ENSG00000237613 ENSG00000237613 FAM138A FAM138A ## ENSG00000186092 ENSG00000186092 OR4F5 OR4F5 ## ENSG00000238009 ENSG00000238009 RP11-34P13.7 LOC100996442 ## ENSG00000239945 ENSG00000239945 RP11-34P13.8 NA ## ... ... ... ... ## ENSG00000215635 ENSG00000215635 AC145205.1 NA ## ENSG00000268590 ENSG00000268590 BAGE5 NA ## ENSG00000251180 ENSG00000251180 CU459201.1 NA ## ENSG00000215616 ENSG00000215616 AC002321.2 NA ## ENSG00000215611 ENSG00000215611 AC002321.1 NA
We see that we could choose between
ENSEMBL_ID (the default),
Symbol. For ease of readability and subsetting, we will utilize the
Symbol_TENx identifier as our object’s rownames, making it possible to subset the
sce with gene symbols as in
rownames(sce) <- rowData(sce)$Symbol_TENx
Now, while this seems to work just fine, eventually we may run into an issue because we actually have duplicated row names here. Depending on how a downstream function is coded, this may cause an esoteric error to pop-up. In fact, here we have 95 duplicates based on the
Symbol_TENx gene identifier mapping.
We can avoid future errors (and many headaches) by removing duplicates before any analysis:
## counts dupes from top to bottom dupes <- duplicated(rownames(sce)) ## keep all the not (!) duplicated genes sce <- sce[!dupes, ]
Keep in mind, the above is likely the most inelegant solution to the problem. Other methods could include, from the duplicated set of genes, choosing the one with the highest expression, aggregating the counts per cell, or keeping them all by adding an additional suffix to make the row names unique. Each has its own tradeoffs, so we leave this choice up to the diligent reader.
And one more bit of preprocessing to prevent a potential downstream error is to assign our columns proper names. We can grab the barcodes of each cell from
colData and assign them as column names as follows:
colnames(sce) <- sce$Barcode
The aim of this section is to form the basis for more interesting downstream analyses. Thus, the objective here is to transform the data into a “clean” expression matrix that has been normalized and freed of technical artifacts, as well as a dimensionality reduction representation that can be used in subsequent analyses and visualization.
Quality Control Metrics
The first step is to ensure that our dataset only contains viable cells, e.g. droplets that contain proper mRNA libraries.
One way to do that is to use the popular “knee plot”, which shows the relationship between the log rank vs the log total counts, and then calculate where the “knee” of the plot is. We use the
DropletUtils package to demonstrate this in our example PBMC dataset.
library(DropletUtils) ## Calculate the rank vs total counts per cell br <- barcodeRanks(counts(sce)) ## Create the knee plot plot(log10(br$rank), log10(br$total)) abline(h = log10(metadata(br)$knee))
## Save the calculated knee from `barcodeRanks()` knee <- log10(metadata(br)$knee)
We see that the knee calculated via this method (horizontal line) is at 1740, or on the log scale, 3.2405.
This can be used as a filter to remove cells that are likely to be empty droplets. Before we do that, we will finish calculating other quality control (QC) metrics via the
scater package and show the results from the first three cells.
library(scater) sce <- calculateQCMetrics(sce)
We can display some of the calculated QC metrics appended to the
colData component - there are a number of other columns present, but for brevity will only show two pertinent ones.
colData(sce)[1:3, c("log10_total_features_by_counts", "log10_total_counts")]
## DataFrame with 3 rows and 2 columns ## log10_total_features_by_counts log10_total_counts ## <numeric> <numeric> ## AAACATACAACCAC-1 2.89 3.38 ## AAACATTGAGCTAC-1 3.13 3.69 ## AAACATTGATCAGC-1 3.05 3.5
We can further inspect these cells based on their total counts as well as vs the total features detected by counts (e.g. the number of genes that have nonzero counts).
hist(sce$log10_total_counts, breaks = 100) abline(v = knee)
smoothScatter(sce$log10_total_counts, sce$log10_total_features_by_counts, nbin = 250) abline(v = knee)
While there are various ways to filter cells, here we actually will not need to perform any filtering, as the data has already undergone a stringent quality control, and thus all the cells can be considered high quality.
For the sake of completeness, we will demonstrate here - without evaluating - how to subset based on the previously calculated barcode ranks knee:
## not run sce <- sce[, sce$log10_total_counts > knee]
Next up we will transform the primary data, the counts, into a (log) normalized version. In this section, we will use the
scran package throughout.
First however, we will need to calculate scaling factors per cell. This function relies on an initial “quick and dirty” clustering to get roughly similar pools of cells. These are used to generate pool-based estimates, from which the subsequent cell-based size factors are generated. To learn more about the method, see the
?computeSumFactors documentation. Here we will use the
quickCluster() function to do the initial clustering.
library(scran) ## not run: AMI hangs here ## quick_clusters <- quickCluster(sce, use.ranks = FALSE) ## sce <- computeSumFactors(sce, clusters = quick_clusters)
These (cell) size factors are then used to log-normalize the counts data:
sce <- scater::normalize(sce)
## List of length 2 ## names(2): counts logcounts
If we inspect our
sce, we can see that we now have two assays,
This section will once again feature the
scran package heavily, as we select for informative genes by selecting for those with high coefficients of biological variation.
Since this experiment does not have spike-ins, we will fit the mean-variance trend across the endogenous genes.
fit <- trendVar(sce, use.spikes = FALSE) plot(fit$mean, fit$var) curve(fit$trend(x), col = 'red', lwd = 2, add = TRUE)
We can see that the trend line goes through the central mass of genes, and thus continue on with looking at the decomposed variance. In this method, it is assumed that the total variance is the sum of the technical and biological variance, where the technical variance can be determined by interpolating the fitted trend at the mean log-count for that gene. Thus the biological variance is the total variance minus this interpolated (technical) variance.
We can then rank and choose genes which have a biological coefficient of variance greater than zero.
dec <- decomposeVar(sce, fit) dec <- dec[order(dec$bio, decreasing = TRUE), ] # order by bio var
## mean total bio tech p.value FDR ## LYZ 1.629 3.925 3.116 0.809 0 0 ## S100A9 1.068 3.352 2.628 0.725 0 0 ## HLA-DRA 1.543 3.071 2.278 0.793 0 0 ## FTL 3.637 2.941 2.203 0.738 0 0 ## CD74 2.209 2.696 1.900 0.796 0 0
The total number of genes with biological variance greater than zero as 3770.
Alternatively, we could use the p-value/FDR as a way to rank our genes, but do note the following (from the
“Ranking based on p-value tends to prioritize HVGs that are more likely to be true positives but, at the same time, less likely to be interesting. This is because the ratio can be very large for HVGs that have very low total variance and do not contribute much to the cell-cell heterogeneity.”
However we choose, we can save these highly variable genes and use them for subsequent analyses:
## hvg_genes <- rownames(dec)[1:2000] hvg_genes <- rownames(dec)[dec$bio > 0]
For the purpose of sharing and saving this list of genes, we can stash the result into the
metadata component of our
sce object as follows:
metadata(sce)$hvg_genes <- hvg_genes metadata(sce)$hvg_genes[1:10]
##  "LYZ" "S100A9" "HLA-DRA" "FTL" "CD74" "CST3" "S100A8" ##  "TYROBP" "NKG7" "FTH1"
metadata component can hold any object, as it is a list container. Any results that you’d like to keep are safe to store here, and a great way to save or share intermediate results that would otherwise be kept in separate objects.
We now can perform dimensionality reduction using our highly variable genes (
hvg_genes) subset. To do this, we will first calculate the PCA representation via the
runPCA() function from the
scater package. We will calculate 50 components on our highly variable genes:
sce <- runPCA(sce, ncomponents = 50, feature_set = hvg_genes)
The results of these calculations will be stored in the
reducedDims component. This method saves the percent variance explained per component as an attribute, which can be accessed as follows, and subsequently plot the “elbow plot”:
## access the attribute where percentVar is saved in reducedDim pct_var_explained <- attr(reducedDim(sce, 'PCA'), 'percentVar') plot(pct_var_explained) # elbow plot
To calculate a 2-dimensional representation of the data, we will use the top 20 components of our PCA result to compute the UMAP representation.
sce <- runUMAP(sce, use_dimred = 'PCA', n_dimred = 20) plotUMAP(sce)
With that, we have a canvas on which to paint our downstream analyses.
Downstream Statistical Analyses
There are a plethora of potential downstream analyses to run, the choice of which is highly dependent on the biological objective. For this example dataset, our aim will be to identify the key cell types via a combination of clustering and differential expression.
Based on our earlier UMAP plot, it appears that we have a few distinct clusters. To do this computationally, we can utilize the
scran package to:
- build a shared nearest neighbor (SNN) graph
- calculate based on the SNN graph the most representative clusters
In this first step, we will specify that we will consider
k nearest neighbors, and
d dimensions from the PCA calculation as follows:
set.seed(1234) # to make results reproducible snng <- buildSNNGraph(sce, k = 50, d = 20)
Following the graph construction, we can calculate the clusters using a variety of different graph-based methods from the
igraph package. Here, we use the louvain method to determine our cell’s cluster memberships.
set.seed(1234) snng_clusters <- igraph::cluster_louvain(snng)
We see that we have the following numbers of cells per cluster:
## ## 1 2 3 4 5 ## 687 350 556 528 579
To view this result graphically on the UMAP plot, we first assign the result to the
colData component as a new column, and specify this as our color variable in the
colData(sce)$clusters <- as.factor(snng_clusters$membership) plotUMAP(sce, colour_by = 'clusters')
Naturally, this result will change as we tweak the number of
k neighbors to consider and with the specific clustering algorithm, but for now we will go onwards to find markers of each of our clusters.
In this section, we will look to identify genes that are unique to each of our clusters. To accomplish this, we will lean on the
scran package to perform the analysis, and then the
scater package to visualize the results.
For this analysis, we will limit ourselves to a top subset of highly variable genes in our
hvg_genes set, purely for the sake of computation time. Furthermore, we will limit our consideration to genes with an increased log fold-change of at least 1.5 versus other clusters. We will also use the
BiocParallel package to parallelize the computation and speed up our processing via the
markers <- findMarkers(sce, clusters = colData(sce)$clusters, subset.row = hvg_genes[1:250], lfc = 1.5, direction = 'up', log.p = TRUE, BPPARAM = BiocParallel::MulticoreParam(1))
We can view the top 5 markers that are differentially expressed (by our specified metrics):
## Top log.p.value log.FDR logFC.2 logFC.3 logFC.4 logFC.5 ## CST3 1 -831.4 -825.9 3.446 3.470 3.413 3.480 ## TYROBP 1 -778.7 -773.8 3.325 3.324 2.691 3.301 ## LYZ 2 -526.9 -522.8 4.028 4.011 4.072 4.000 ## FTL 3 -667.3 -662.9 3.110 3.634 3.463 3.534 ## FTH1 4 -474.3 -470.6 2.771 3.118 3.100 2.755
We can see that CD3D, a marker of T cells, is one of our top differentially expressed genes in cluster 1. We can plot the expression of this gene across all our clusters as follows:
plotExpression(sce, 'CD3D', x = 'clusters')