Analysis of Large Single-Cell RNA-Seq Datasets in R/Bioconductor

Introduction

This workshop illustrates the latest Bioconductor infrastructure to analyze large single-cell datasets that may not entirely fit in memory.

We focus on the most common application in exploratory single-cell analysis, namely to find subpopulations of cells.

The proposed workflow consists of the following steps:

  1. Normalization;
  2. Dimensionality reduction;
  3. Clustering.

We will exploit a number of Bioconductor packages able to interact with the HDF5 on-disk data representation, freeing us of the need to load the full dataset in memory.

Note that this workshop is a high-level introduction to the analysis of large datasets from a user perspective. The focus is on packages that the user may directly use to analyze such data. The focus is NOT the direct interaction with infrastructural packages (such as the DelayedArray package) and is not aimed at developers wishing to build upon this infrastructure. If you are interested in these topics, see the workshop “Effectively using the DelayedArray framework to support the analysis of large data sets.”

Interacting with HDF5 files in R/Bioconductor

Here, we provide a brief introduction on the HDF5 infrastrucure in R/Bioconductor. See “Effectively using the DelayedArray framework to support the analysis of large data sets” for a more technical treatment of the topic.

At the low-level, the main interface between HDF5 and Bioconductor is implemented in the packages rhdf5, which provides read/write functionalities, Rhdf5lib, which provides C and C++ HDF5 libraries, and beachmat (Lun, Pagès, and Smith 2018), which provides a consistent C++ class interface for a variety of commonly used matrix types, including sparse and HDF5-backed matrices.

These packages are useful for developers that want to develop methods able to interact with HDF5 datasets. However, for most Bioconductor users interested in the analysis of single-cell data, the entry point is represented by the high-level class SingleCellExperiment (implemented in the SingleCellExperiment package ) and the lower level classes HDF5Matrix and DelayedMatrix, which can be stored in the assay slot of a SingleCellExperiment object. These matrix classes are implemented in the HDF5Array and DelayedArray packages, respectively. Once the data are stored in a SingleCellExperiment object with HDF5Matrix or DelayedMatrix as its assay, the packages scater (McCarthy et al. 2017), scran (Lun, Bach, and Marioni 2016), BiocSingular and mbkmeans can be seamlessly used.

The package DelayedMatrixStats deserves a special mention: it implements the rich API of the CRAN package matrixStats for HDF5Matrix and DelayedMatrix objects.

We invite the reader to find more details on all the mentioned packages in their relative vignettes. In the remainder of this workshop, we will use these methods to find cell sub-populations in a real datasets.

The SingleCellExperiment class

Add description of the class, with figure.

The dataset

To illustrate the analysis of large datasets, we use the largest of the Human Cell Atlas preview datasets available in the HCAData Bioconductor package.

The Human Cell Atlas is an international collaborative effort to map every cell type present in the human body (Regev et al. 2017). We focus on a set on the Census of Immune Cells dataset, specifically on the bone marrow subset (see https://preview.data.humancellatlas.org for additional details).

The dataset consists of 378,000 immune cells from the bone marrow of 8 healthy donors, processed with the 10X Genomics Chromium v2 platform at the Broad Institute. For each donor, 8 libraries were prepared, and 6000 cells were sequenced for each library.

Preprocessing

The data available in the package has already been preprocessed and the package contains a matrix of raw counts (in HDF5 format) of the top 6000 cell barcodes ranked by total UMI per barcode.

The raw counts were generated by Cell Ranger with GRCh38, standard 10X reference.

Note that:

  • For each channel approximately 7000 cells were loaded, and the authors expect about 4000 non-empty droplets.
  • Filtering of low quality barcodes is recommended.

More information on the preprocessing is available on the HCA website.

Loading the data in R

We use the HCAData package, which uses ExperimentHub to retrieve the data. Note that this is a time consuming step, because it needs to download the remotely hosted HDF5 matrix. DR: CONSIDER HOSTING SOMEWHERE JUST THE SUBSET?

start_time <- Sys.time()
library(HCAData)
library(ExperimentHub)
library(SingleCellExperiment)

eh <- ExperimentHub()
query(eh, "HCAData")
## ExperimentHub with 6 records
## # snapshotDate(): 2019-06-20 
## # $dataprovider: Human Cell Atlas
## # $species: Homo sapiens
## # $rdataclass: character
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## #   tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["EH2047"]]' 
## 
##            title                                                         
##   EH2047 | Human Cell Atlas - Census of Immune Cells, Bone marrow, 'de...
##   EH2048 | Human Cell Atlas - Census of Immune Cells, Bone marrow, sam...
##   EH2049 | Human Cell Atlas - Census of Immune Cells, Bone marrow, gen...
##   EH2050 | Human Cell Atlas - Census of Immune Cells, Umbilical cord b...
##   EH2051 | Human Cell Atlas - Census of Immune Cells, Umbilical cord b...
##   EH2052 | Human Cell Atlas - Census of Immune Cells, Umbilical cord b...
sce <- HCAData("ica_bone_marrow")
sce
## class: SingleCellExperiment 
## dim: 33694 378000 
## metadata(0):
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ...
##   ENSG00000277475 ENSG00000268674
## rowData names(2): ID Symbol
## colnames(378000): MantonBM1_HiSeq_1-AAACCTGAGCAGGTCA-1
##   MantonBM1_HiSeq_1-AAACCTGCACACTGCG-1 ...
##   MantonBM8_HiSeq_8-TTTGTCATCTGCCAGG-1
##   MantonBM8_HiSeq_8-TTTGTCATCTTGAGAC-1
## colData names(1): Barcode
## reducedDimNames(0):
## spikeNames(0):

The object sce of class SingleCellExperiment contains the data in the assay counts, as well as a set of row and column data, accessible with the rowData() and colData() functions, respectively.

The assay counts is a DelayedMatrix with HDF5 backend that contains the gene expression data. It can be retrieve with either the counts() or assay() functions.

counts(sce)
## <33694 x 378000> DelayedMatrix object of type "integer":
##                 MantonBM1_HiSeq_1-AAACCTGAGCAGGTCA-1 ...
## ENSG00000243485                                    0   .
## ENSG00000237613                                    0   .
## ENSG00000186092                                    0   .
## ENSG00000238009                                    0   .
## ENSG00000239945                                    0   .
##             ...                                    .   .
## ENSG00000277856                                    0   .
## ENSG00000275063                                    0   .
## ENSG00000271254                                    0   .
## ENSG00000277475                                    0   .
## ENSG00000268674                                    0   .
##                 MantonBM8_HiSeq_8-TTTGTCATCTTGAGAC-1
## ENSG00000243485                                    0
## ENSG00000237613                                    0
## ENSG00000186092                                    0
## ENSG00000238009                                    0
## ENSG00000239945                                    0
##             ...                                    .
## ENSG00000277856                                    0
## ENSG00000275063                                    0
## ENSG00000271254                                    0
## ENSG00000277475                                    0
## ENSG00000268674                                    0

To check how where data is stored, we can use the seed() function, which gives valuable information on the geometry of the HDF5 file.

seed(counts(sce))
## An object of class "HDF5ArraySeed"
## Slot "filepath":
## [1] "/home/rstudio/.cache/ExperimentHub/325c13401453_2047"
## 
## Slot "name":
## [1] "counts"
## 
## Slot "dim":
## [1]  33694 378000
## 
## Slot "first_val":
## [1] 0
## 
## Slot "chunkdim":
## [1]  184 2060

Library information

Unfortunately, the column data does not include any metadata on the libraries, although some is available at the HCA website.

We can extract information on the donors and libraries by parsing the barcodes.

library(stringr)
cell_info <- str_split(colData(sce)$Barcode, "_", simplify = TRUE)
donor <- cell_info[,1]
lib <- paste0(donor, "_", str_split(cell_info[,3], "-", simplify = TRUE)[,1])
colData(sce)$donor <- donor
colData(sce)$lib <- lib

We add donor and library information to the column data. This will be useful for subsetting.

Subsetting

To work with a small enough dataset, we will randomly subsample the data to have only 50 cells per library. This will allow us to run the workshop in a reasonable time, while preserving the batch structure typical of a large-scale experiment.

Note that it is sufficient to add eval=FALSE to the next code chunk to run the workflow on the full dataset.

library(dplyr)

set.seed(50)

colData(sce) %>%
  as.data.frame() %>%
  group_by(lib) %>%
  sample_n(50) %>% 
  pull(Barcode) -> cell_idx

sce <- sce[, cell_idx]
sce
## class: SingleCellExperiment 
## dim: 33694 3150 
## metadata(0):
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ...
##   ENSG00000277475 ENSG00000268674
## rowData names(2): ID Symbol
## colnames(3150): MantonBM1_HiSeq_1-TGCCCTATCGGCGCTA-1
##   MantonBM1_HiSeq_1-GGGAATGAGTAACCCT-1 ...
##   MantonBM8_HiSeq_8-GGACATTTCCTACAGA-1
##   MantonBM8_HiSeq_8-TCAATCTTCGTTGCCT-1
## colData names(3): Barcode donor lib
## reducedDimNames(0):
## spikeNames(0):
counts(sce)
## <33694 x 3150> DelayedMatrix object of type "integer":
##                 MantonBM1_HiSeq_1-TGCCCTATCGGCGCTA-1 ...
## ENSG00000243485                                    0   .
## ENSG00000237613                                    0   .
## ENSG00000186092                                    0   .
## ENSG00000238009                                    0   .
## ENSG00000239945                                    0   .
##             ...                                    .   .
## ENSG00000277856                                    0   .
## ENSG00000275063                                    0   .
## ENSG00000271254                                    0   .
## ENSG00000277475                                    0   .
## ENSG00000268674                                    0   .
##                 MantonBM8_HiSeq_8-TCAATCTTCGTTGCCT-1
## ENSG00000243485                                    0
## ENSG00000237613                                    0
## ENSG00000186092                                    0
## ENSG00000238009                                    0
## ENSG00000239945                                    0
##             ...                                    .
## ENSG00000277856                                    0
## ENSG00000275063                                    0
## ENSG00000271254                                    0
## ENSG00000277475                                    0
## ENSG00000268674                                    0

Note that because we have not actually made any computations, the original HDF5 file is untouched.

seed(counts(sce))
## An object of class "HDF5ArraySeed"
## Slot "filepath":
## [1] "/home/rstudio/.cache/ExperimentHub/325c13401453_2047"
## 
## Slot "name":
## [1] "counts"
## 
## Slot "dim":
## [1]  33694 378000
## 
## Slot "first_val":
## [1] 0
## 
## Slot "chunkdim":
## [1]  184 2060

To avoid reading from this huge file, it is best to save the subset on a new, smaller HDF5 file. This can be done with the realize() function.

counts(sce) <- realize(counts(sce), BACKEND = "HDF5Array")
seed(counts(sce))
## An object of class "HDF5ArraySeed"
## Slot "filepath":
## [1] "/tmp/RtmpkXNUiN/HDF5Array_dump/auto00001.h5"
## 
## Slot "name":
## [1] "/HDF5ArrayAUTO00001"
## 
## Slot "dim":
## [1] 33694  3150
## 
## Slot "first_val":
## [1] 0
## 
## Slot "chunkdim":
## [1] 3270  305

This is a time consuming step, but will make the subsequent calls much faster.

Parallel computing

For the workshop, we run the workflow in serial mode. When running the workflow on large datasets, we recommend running the workflow in parallel, by using the BiocParallel package. See the BiocParallel vignette for more information on parallel computing.

library(BiocParallel)
register(SerialParam())
register(MulticoreParam(6)) ## remove this!

Quality control

Filtering of low-quality samples

First, we use the scater package to compute a set of QC measures and filter out the low-quality samples.

library(scater)

sce <- calculateQCMetrics(sce, 
          feature_controls=list(Mito=grep("^MT", rowData(sce)$Symbol)))

We remove cells with low UMI counts and high proportion of mitocondrial reads, using it as a proxy for cell damage.

high_mito <- isOutlier(sce$pct_counts_Mito, nmads=3, type="higher")
low_counts <- sce$total_counts < 100
table((!high_mito) & (!low_counts))
## 
## FALSE  TRUE 
##   412  2738
sce <- sce[, (!high_mito) & (!low_counts)]
sce
## class: SingleCellExperiment 
## dim: 33694 2738 
## metadata(0):
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ...
##   ENSG00000277475 ENSG00000268674
## rowData names(10): ID Symbol ... total_counts log10_total_counts
## colnames(2738): MantonBM1_HiSeq_1-TGCCCTATCGGCGCTA-1
##   MantonBM1_HiSeq_1-GGGAATGAGTAACCCT-1 ...
##   MantonBM8_HiSeq_8-CTAGCCTGTTTACTCT-1
##   MantonBM8_HiSeq_8-GGACATTTCCTACAGA-1
## colData names(39): Barcode donor ...
##   pct_counts_in_top_200_features_Mito
##   pct_counts_in_top_500_features_Mito
## reducedDimNames(0):
## spikeNames(0):

Removal of lowly expressed genes

In addition, we keep only those genes that have at least 1 UMI in at least 5% of the data. These threshold are dataset-specific and may need to be taylored to specific applications.

num_reads <- 1
num_cells <- 0.05*ncol(sce)
keep <- which(DelayedArray::rowSums(counts(sce) >= num_reads ) >= num_cells)
sce <- sce[keep,]
sce
## class: SingleCellExperiment 
## dim: 4442 2738 
## metadata(0):
## assays(1): counts
## rownames(4442): ENSG00000279457 ENSG00000188976 ...
##   ENSG00000198695 ENSG00000198727
## rowData names(10): ID Symbol ... total_counts log10_total_counts
## colnames(2738): MantonBM1_HiSeq_1-TGCCCTATCGGCGCTA-1
##   MantonBM1_HiSeq_1-GGGAATGAGTAACCCT-1 ...
##   MantonBM8_HiSeq_8-CTAGCCTGTTTACTCT-1
##   MantonBM8_HiSeq_8-GGACATTTCCTACAGA-1
## colData names(39): Barcode donor ...
##   pct_counts_in_top_200_features_Mito
##   pct_counts_in_top_500_features_Mito
## reducedDimNames(0):
## spikeNames(0):

Again, it is worth realize()-ing the matrix to speed up calculations.

counts(sce) <- realize(counts(sce), BACKEND = "HDF5Array")
seed(counts(sce))
## An object of class "HDF5ArraySeed"
## Slot "filepath":
## [1] "/tmp/RtmpkXNUiN/HDF5Array_dump/auto00002.h5"
## 
## Slot "name":
## [1] "/HDF5ArrayAUTO00002"
## 
## Slot "dim":
## [1] 4442 2738
## 
## Slot "first_val":
## [1] 0
## 
## Slot "chunkdim":
## [1] 1273  785

Normalization

Normalization is a crucial step in the preprocessing of the results. Here, we use the scran package to compute size factors that we will use to compute the normalized log-expression values.

It has been shown that the scran method works best if the size factors are computed within roughly homogeneous cell populations; hence, it is beneficial to run a quick clustering on the raw data to compute better size factors. This ensures that we do not pool cells that are very different. Note that this is not the final clustering to identify cell sub-populations.

In order to carry out this clustering, we use a mini-batch k-means algorithm, implemented in the mbkmeans package. We will talk more about the details of this method in the Clustering section.

DR: CONSIDER NORMALIZING EACH BATCH INDEPENDENTLY AND THEN MERGING THEM WITH MNN.

library(scran)
library(mbkmeans)
set.seed(1000)

clusters <- mbkmeans(counts(sce), clusters=10, batch_size = 100)
table(clusters$Clusters)
## 
##   1   2   3   4   5   6   7   8   9  10 
##   3   8  12 430 667   6 799 715  64  34
sce <- computeSumFactors(sce, min.mean=0.1, cluster=clusters$Clusters)

It is useful to check whether the size factors are correlated with the total number of reads per cell.

plot(sce$total_counts, sizeFactors(sce), log="xy", xlab="Total reads", ylab="scran size factors")

Finally, we compute normalized log-expression values with the normalize() function from the scater package.

sce <- scater::normalize(sce)

Note that the log-normalized data are stored in the logcounts assay of the object. Since the counts assay is a DelayedMatrix and we have only one set of size factors in the object, also the normalized data are stored as a DelayedMatrix.

logcounts(sce)
## <4442 x 2738> DelayedMatrix object of type "double":
##                 MantonBM1_HiSeq_1-TGCCCTATCGGCGCTA-1 ...
## ENSG00000279457                                    0   .
## ENSG00000188976                                    0   .
## ENSG00000187608                                    0   .
## ENSG00000078808                                    0   .
## ENSG00000160087                                    0   .
##             ...                                    .   .
## ENSG00000212907                             0.000000   .
## ENSG00000198886                             3.996768   .
## ENSG00000198786                             2.687402   .
## ENSG00000198695                             0.000000   .
## ENSG00000198727                             3.395429   .
##                 MantonBM8_HiSeq_8-GGACATTTCCTACAGA-1
## ENSG00000279457                            0.0000000
## ENSG00000188976                            0.2639181
## ENSG00000187608                            0.0000000
## ENSG00000078808                            0.0000000
## ENSG00000160087                            0.0000000
##             ...                                    .
## ENSG00000212907                            0.4869413
## ENSG00000198886                            4.0753686
## ENSG00000198786                            1.7692736
## ENSG00000198695                            0.0000000
## ENSG00000198727                            3.7273595

This allows us to store in memory only the colData and rowData, resulting in a fairly small object.

library(pryr)
object_size(sce)
## 2.71 MB

Dimensionality reduction

library(BiocSingular)
library(DelayedMatrixStats)

## find most variable genes
vars <- DelayedMatrixStats::rowVars(logcounts(sce))
names(vars) <- rownames(sce)
vars <- sort(vars, decreasing = TRUE)

for_pca <- t(logcounts(sce)[names(vars)[1:1000],])

pca <- BiocSingular::runPCA(for_pca, rank = 30,
                             scale = TRUE,
                             BSPARAM = RandomParam(deferred = FALSE))

reducedDim(sce, "PCA") <- pca$x
sce
## class: SingleCellExperiment 
## dim: 4442 2738 
## metadata(1): log.exprs.offset
## assays(2): counts logcounts
## rownames(4442): ENSG00000279457 ENSG00000188976 ...
##   ENSG00000198695 ENSG00000198727
## rowData names(10): ID Symbol ... total_counts log10_total_counts
## colnames(2738): MantonBM1_HiSeq_1-TGCCCTATCGGCGCTA-1
##   MantonBM1_HiSeq_1-GGGAATGAGTAACCCT-1 ...
##   MantonBM8_HiSeq_8-CTAGCCTGTTTACTCT-1
##   MantonBM8_HiSeq_8-GGACATTTCCTACAGA-1
## colData names(39): Barcode donor ...
##   pct_counts_in_top_200_features_Mito
##   pct_counts_in_top_500_features_Mito
## reducedDimNames(1): PCA
## spikeNames(0):
plotPCA(sce, colour_by = "donor")

Clustering with Mini-batch k-means

Choose the number of clusters

wcss <- lapply(10:20, function(k) {
  cl <- mbkmeans(sce, reduceMethod = "PCA", clusters = k,
                 batch_size = 100, num_init=10, max_iters=100,
                 calc_wcss = TRUE)
})
plot(10:20, sapply(wcss, function(x) sum(x$WCSS_per_cluster)),
     type="b",
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

clusters2 <- mbkmeans(sce, reduceMethod = "PCA", clusters=16, 
                                  batch_size=500,
                                  num_init = 10, max_iters = 100)

sce$cluster <- as.factor(clusters2$Clusters)

Visualization

sce <- runTSNE(sce, use_dimred = "PCA",
               external_neighbors=TRUE, 
               BNPARAM = BiocNeighbors::AnnoyParam(),
               nthreads = 1)

plotTSNE(sce, colour_by = "cluster")