Bioconductor 5xx:
Analysis of multi-sample
multi-group scRNA-seq data

Introduction

What is DS analysis?

A fundamental task in the analysis of single-cell RNA-sequencing (scRNA-seq) data is the identification of systematic transcriptional changes(Stegle, Teichmann, and Marioni, n.d.). Such analyses are a critical step in the understanding of molecular responses, and have applications in development, in perturbation studies or in disease.
Most of the current scRNA-seq differential expression (DE) analysis methods are designed to test one set of cells against another (or more generally, multiple sets together), and can be used to compare cell clusters (e.g., for identifying marker genes) or across conditions (cells from one condition versus another) (Soneson and Robinson 2018). In such statistical models, the cells are the experimental units and thus represent the population that inferences will extrapolate to.

Using established terminology, we refer to cell identity as the combination of cell type, a stable molecular signature, and cell state, a transient snapshot of a cell’s molecular events (Wagner, Regev, and Yosef 2016; Trapnell, n.d.). This classification is inherently arbitrary, but still provides a basis for biological interpretation and a framework for discovering interesting expression patterns from scRNA-seq datasets. For example, T cells could be defined as a single (albeit diverse) cell type or could be divided into discrete subtypes, if relevant information to categorize each cell at this level were available. In either case, the framework presented here would be able to focus on the cell type of interest and look for changes (in expression) across samples.
Given the emergence of multi-sample multi-group scRNA-seq datasets, the goal becomes making sample-level inferences (i.e., experimental units are samples). Thus, differential state (DS) analysis is defined as following a given cell type across a set of samples (e.g., individuals) and experimental conditions (e.g., treatments), in order to identify cell-type-specific responses, i.e., changes in cell state. DS analysis: i) should be able to detect diluted changes that only affect a single cell type, a subset of cell types or even a subset of a single subpopulation; and, ii) is intended to be orthogonal to clustering or cell type assignment.

Starting point

The starting point for a DS analysis is a (sparse) matrix of gene expression, either as counts or some kind of normalized data, where rows = genes and columns = cells. Each cell additionally has a cluster (subpopulation) label as well as a sample label; metadata should accompany the list of samples, such that they can be organized into comparable groups with sample-level replicates (e.g., via a design matrix).

The approach presented here is modular and thus the type label could originate from an earlier step in the analysis, such as clustering (Duó, Robinson, and Soneson 2018; Freytag et al. 2018), perhaps after integration (Butler et al. 2018; Stuart et al. 2018) or after labeling of clusters (Diaz-Mejia et al. 2019) or after cell-level type assignment (Zhang et al. 2019). Although we have pipelines for these various entry points, the specific details and suitability of all these pre-processing steps needs careful exploration and are beyond the scope of this workflow.

Getting started

Data description

To illustrate a standard DS analysis workflow, we will use Kang et al. (2018)’s droplet-based scRNA-seq data of PBMCs cells from 8 lupus patients measured before and after 6h-treatment with INF-β (16 samples in total). The original dataset is deposited under Gene Expression Ombnibus (GEO) accession GSE96583, and has been made available via Bioconductor’s ExperimentHub web resource as a SingleCellExperiment (SCE) object that contains unfiltered raw counts, and any gene and cell metadata available from the original data source.

Loading the data

We first initialize a Hub instance to search for and load available data with the ExperimentHub function, and store the complete list of records in the variable eh. Using query, we then retrieve any records that match our keyword(s) of interest, as well as their accession IDs (EH1234).

library(ExperimentHub)
eh <- ExperimentHub()
query(eh, "Kang")
## ExperimentHub with 3 records
## # snapshotDate(): 2019-06-20 
## # $dataprovider: NCI_GDC, GEO
## # $species: Homo sapiens
## # $rdataclass: BSseq, SingleCellExperiment, character
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## #   tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["EH1661"]]' 
## 
##            title                                               
##   EH1661 | Whole Genome Bisulfit Sequencing Data for 47 samples
##   EH1662 | Whole Genome Bisulfit Sequencing Data for 47 samples
##   EH2259 | Kang18_8vs8

Finally, we load the data of interest into R via [[ and the corresponding accession ID:

(sce <- eh[["EH2259"]])
## class: SingleCellExperiment 
## dim: 35635 29065 
## metadata(0):
## assays(1): counts
## rownames(35635): MIR1302-10 FAM138A ... MT-ND6 MT-CYB
## rowData names(2): ENSEMBL SYMBOL
## colnames(29065): AAACATACAATGCC-1 AAACATACATTTCC-1 ...
##   TTTGCATGGTTTGG-1 TTTGCATGTCTTAC-1
## colData names(5): ind stim cluster cell multiplets
## reducedDimNames(1): TSNE
## spikeNames(0):

For data handling and visualization throughout this workflow, we will require the following packages:

# data handling
library(dplyr)
library(magrittr)
library(Matrix)
library(purrr)
library(reshape2)
library(S4Vectors)
library(tibble)

# visualzation
library(ComplexHeatmap)
library(ggplot2)
library(pheatmap)
library(scales)

Data preparation

Before proceeding with basic preprocessing and filtering steps, we drop non-singlet cells as well as cells that have not been assigned a cluster ID:

sce <- sce[, sce$multiplets == "singlet" & !is.na(sce$cell)]
dim(sce)
## [1] 35635 24673

For simplicity, we will further retain only cell-metadata columns that are relevant to our analysis (and use intuitive names for these):

colData(sce) %>% 
    as.data.frame %>% 
    transmute(
        group_id = stim, 
        patient_id = ind,
        sample_id = paste0(stim, ind),
        cluster_id = cell) %>%
    mutate_all(as.factor) %>% 
    set_rownames(colnames(sce)) %>% 
    DataFrame -> colData(sce)
head(colData(sce))
## DataFrame with 6 rows and 4 columns
##                  group_id patient_id sample_id      cluster_id
##                  <factor>   <factor>  <factor>        <factor>
## AAACATACATTTCC-1     ctrl       1016  ctrl1016 CD14+ Monocytes
## AAACATACCAGAAA-1     ctrl       1256  ctrl1256 CD14+ Monocytes
## AAACATACCATGCA-1     ctrl       1488  ctrl1488     CD4 T cells
## AAACATACCTCGCT-1     ctrl       1256  ctrl1256 CD14+ Monocytes
## AAACATACCTGGTA-1     ctrl       1039  ctrl1039 Dendritic cells
## AAACATACGATGAA-1     ctrl       1488  ctrl1488     CD4 T cells

For consistency and easy accession throughout this workflow, we will store cluster and sample IDs, as well as the number of clusters and samples:

nk <- length(kids <- set_names(levels(sce$cluster_id)))
ns <- length(sids <- set_names(levels(sce$sample_id)))

Finally, we compile a table that summarizes the experimental design:

m <- match(sids, sce$sample_id)
n_cells <- as.numeric(table(sce$sample_id))
(ei <- data.frame(colData(sce)[m, ], 
    n_cells, row.names = NULL) %>% 
    select(-"cluster_id"))
##    group_id patient_id sample_id n_cells
## 1      ctrl        101   ctrl101     858
## 2      ctrl       1015  ctrl1015    2738
## 3      ctrl       1016  ctrl1016    1723
## 4      ctrl       1039  ctrl1039     461
## 5      ctrl        107   ctrl107     517
## 6      ctrl       1244  ctrl1244    1879
## 7      ctrl       1256  ctrl1256    2108
## 8      ctrl       1488  ctrl1488    2031
## 9      stim        101   stim101    1166
## 10     stim       1015  stim1015    2352
## 11     stim       1016  stim1016    1635
## 12     stim       1039  stim1039     641
## 13     stim        107   stim107     525
## 14     stim       1244  stim1244    1464
## 15     stim       1256  stim1256    2026
## 16     stim       1488  stim1488    2549

Preprocessing

The scater package (McCarthy et al. 2017) provides a variey of tools for preprocessing and quality control of single-cell transcriptomic data. For completeness, we will apply some minimal filtering steps to

  • remove undetected genes
  • remove cells with very few or many detected genes
  • remove very lowly expressed genes
  • compute normalized expression values for visualization

For more thorough preprocessing, we refer to the Quality control with scater vignette.

# remove undetected genes
sce[rowSums(counts(sce) > 0) > 0, ]
## class: SingleCellExperiment 
## dim: 18464 24673 
## metadata(0):
## assays(1): counts
## rownames(18464): RP11-34P13.8 AL627309.1 ... S100B PRMT2
## rowData names(2): ENSEMBL SYMBOL
## colnames(24673): AAACATACATTTCC-1 AAACATACCAGAAA-1 ...
##   TTTGCATGGGACGA-1 TTTGCATGTCTTAC-1
## colData names(4): group_id patient_id sample_id cluster_id
## reducedDimNames(1): TSNE
## spikeNames(0):
dim(sce)
## [1] 35635 24673

We use calculateQCMetrics to compute various quality control metrics for each cell and gene, stored in the colData and rowData, respectively, and proceed with filtering cells and genes as noted above:

library(scater)

# calculate quality control (QC) metrics
sce <- calculateQCMetrics(sce)

# get cells w/ few/many detected genes
sce$is_outlier <- isOutlier(
    metric = sce$total_features_by_counts,
    nmads = 2, type = "both", log = TRUE)

# remove outlier cells
sce <- sce[, !sce$is_outlier]
dim(sce)
## [1] 35635 22221
# remove lowly expressed genes & normalize
sce <- sce[rowSums(counts(sce) > 1) >= 10, ]
dim(sce)
## [1]  6413 22221

Finally, we use normalize to calculate log2-transformed normalized expression values by dividing each count by its size factor, adding a pseudo-count of 1, and log-transforming1.

sizeFactors(sce) <- librarySizeFactors(sce)
sce <- normalize(sce)
assayNames(sce)
## [1] "counts"    "logcounts"
range(logcounts(sce))
## [1] 0.000000 9.728775

Alternatively, expression values could be obtained via vst (variance stabilizing transformation) from the sctransform package (Hafemeister and Satija 2019), which returns Pearson residuals from a regularized negative binomial regression model that can be interpreted as normalized expression values.

DS analysis

Aggregation of single-cell to pseudo-bulk data

In order to leverage existing robust bulk RNA-seq DE frameworks, such as DESeq2, edgeR and limma, we first aggregate measurements for each cluster at the sample level to obtain pseudobulk data. While, in principle, various combinations of input data (raw/(log-)normalized counts, CPM ect.) and summary statistics (sum, mean, median) could be applied, we here default to the sum of raw counts.

Aggregation of single-cell to pseudo-bulk data. Cell-level measurements for each subpopulation (cluster) and sample are aggregated to obtain pseudo-bulk data.

Figure 1: Aggregation of single-cell to pseudo-bulk data. Cell-level measurements for each subpopulation (cluster) and sample are aggregated to obtain pseudo-bulk data.

For aggregation, we use Matrix.utils’s aggregate.Matrix function with the colData columns "cluster_id" and "sample_id" as groupings. Note that aggregate.Matrix initially yields a matrix of dimensions #(cluster-sample-instances) × #(genes), which we re-split and transform to obtain, for each cluster, and matrix of dimensions #(genes) × #(samples):

library(Matrix.utils)

system.time({
    # aggregate by cluster-sample
    groups <- colData(sce)[, c("cluster_id", "sample_id")]
    pb <- aggregate.Matrix(t(counts(sce)), 
        groupings = groups, fun = "sum") 

    # split by cluster, transform & rename columns
    pb <- split.data.frame(pb, rep(kids, ns)) %>% 
        lapply(function(u) set_colnames(t(u), unname(sids)))
})
##    user  system elapsed 
##   0.823   0.096   0.918

From the code snippet above, we obtain a list of length nk, where each element contains a dgCMatrix of pseudobulk counts with rows = genes and columns = samples. For a more elegant show-method and easier data accession, we construct a SCE where each assay sheet corresponds to one cluster:

# construct SCE of pseudo-bulk counts
# (assays = clusters, rows = genes, columns = samples)
(pb <- SingleCellExperiment(assays = pb))
## class: SingleCellExperiment 
## dim: 6413 16 
## metadata(0):
## assays(8): B cells CD14+ Monocytes ... Megakaryocytes NK cells
## rownames(6413): NOC2L HES4 ... S100B PRMT2
## rowData names(0):
## colnames(16): ctrl101 ctrl1015 ... stim1256 stim1488
## colData names(0):
## reducedDimNames(0):
## spikeNames(0):

The approach used here for aggregation presents but one of many ways, and was selected due to its simplicity and efficiency. Alternatively, one could, for example, split the character vector of cells (colnames(sce)) by cluster-sample, and use rowSums(counts(sce[, ...])) for aggregation. However, this approach is comparatively more verbose and less efficient2:

library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:reshape2':
## 
##     dcast, melt
## The following object is masked from 'package:purrr':
## 
##     transpose
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:SummarizedExperiment':
## 
##     shift
## The following object is masked from 'package:GenomicRanges':
## 
##     shift
## The following object is masked from 'package:IRanges':
## 
##     shift
## The following objects are masked from 'package:S4Vectors':
## 
##     first, second
system.time({
    # split cell by cluster-sample
    cs_by_ks <- as.data.frame(colData(sce)) %>% 
        rownames_to_column("cells") %>% setDT %>% 
        split(flatten = FALSE, sorted = TRUE,
            by = c("cluster_id", "sample_id")) %>% 
        map_depth(2, "cells")
    
    # for ea. cluster-sample..
    pb2 <- map_depth(cs_by_ks, 2, function(cs)
        rowSums(counts(sce[, cs]))) %>% # ..compute pseudobulks
        map(data.frame) # column-bind samples
})
##    user  system elapsed 
##   9.006   0.062   8.838

Pseudobulk-level MDS plot

Prior to conducting any formal testing, we can compute a multi-dimensional scaling (MDS) plot of aggregated signal to explore overall sample similarities. Ideally, such a represenation of the data should separate both clusters and groups from one another. Vice versa, samples from the same cluster/group should fall close to each other.
In our MDS plot on pseudobulk counts (Fig. 2), we can appreciate that the horizontal dimension (MDS dim. 1) clearly separates cell-populations (clusters), while control and stimulated samples (groups) are separated vertically (MDS dim. 2).

library(edgeR)

# compute MDS coordinates
mds <- as.list(assays(pb)) %>% 
    lapply(as.data.frame.matrix) %>% 
    bind_cols %>% 
    DGEList(remove.zeros = TRUE) %>% 
    calcNormFactors %>% 
    plotMDS.DGEList(plot = FALSE)
    
# prep. data.frame for plotting
gg_df <- data.frame(mds[c("x", "y")],
    cluster_id = rep(kids, each = ns),
    sample_id = rep(sids, nk),
    group_id = ei$group_id[match(sids, ei$sample_id)])

ggplot(gg_df, aes(x, y, col = cluster_id, shape = group_id)) + 
    geom_point(size = 3, alpha = 0.8) +
    labs(x = "MDS dim. 1", y = "MDS dim. 2") + 
    theme(panel.grid.minor = element_blank()) +
    coord_fixed() + theme_bw()
Pseudobulk-level MDS plot. Each point represents one cluster-sample instance; points are colored by cluster ID and shaped by group ID.

Figure 2: Pseudobulk-level MDS plot. Each point represents one cluster-sample instance; points are colored by cluster ID and shaped by group ID.

Cluster-sample cell-counts

While DE analysis is typically used for comparison between cell-types, and may struggle with rare subpopulations, DS analysis compares cluster-sample instances that are likely to be much smaller. Thus, DS analysis may only be applicable to more prominent populations. It is therefore recommended to check cluster-sample cell-counts, and to possibly exclude small instances from downstream analyses. In our example, we might consider, for instance, omitting DS analysis on the “Megakaryocytes” and “Dendritic cells” clusters, as these contain less than 30 cells across almost all samples.

options(width = 100)
table(sce$cluster_id, sce$sample_id)
##                    
##                     ctrl101 ctrl1015 ctrl1016 ctrl1039 ctrl107 ctrl1244 ctrl1256 ctrl1488 stim101
##   B cells               101      424      119       25      43      109      205      203     130
##   CD14+ Monocytes       136      644      315       86     176      314      292      251     166
##   CD4 T cells           288      819      413      177     155     1041      973     1151     375
##   CD8 T cells            72      174      532       26      20       56      126       55     104
##   Dendritic cells         5        6        6        3       1       17        5        9      13
##   FCGR3A+ Monocytes      50      177       85       14      25       28       36       75      91
##   Megakaryocytes          9       16        7        4       4       12       14       18       7
##   NK cells               69      174      127       17      40      104      257      116     102
##                    
##                     stim1015 stim1016 stim1039 stim107 stim1244 stim1256 stim1488
##   B cells                319      114       35      50       86      196      259
##   CD14+ Monocytes        550      282      111     146      244      307      309
##   CD4 T cells            714      352      265     176      835      934     1449
##   CD8 T cells            128      485       29      15       30      109       39
##   Dendritic cells          9        6        3       6       15       13       21
##   FCGR3A+ Monocytes      159       93       22      26       22       51      113
##   Megakaryocytes          17       10       10       3        9       15       26
##   NK cells               191      194       23      46      113      231      167

Testing for DS

To calculate differential tests, we require a matrix describing the experimental design, and a contrast matrix specifying the comparison of interest, i.e., the combination of model parameters assumed to equal zero under the null hypothesis.

Design and contrast matrices of appropriate format can be created using the model.matrix (stats package) and makeContrast function (limma package), respectively. For clarity, we also set the dimension names of the design matrix (though this is not required by edgeR).

library(limma)

# construct design & contrast matrix
(design <- model.matrix(~ 0 + ei$group_id) %>% 
    set_rownames(ei$sample_id) %>% 
    set_colnames(levels(ei$group_id)))
##          ctrl stim
## ctrl101     1    0
## ctrl1015    1    0
## ctrl1016    1    0
## ctrl1039    1    0
## ctrl107     1    0
## ctrl1244    1    0
## ctrl1256    1    0
## ctrl1488    1    0
## stim101     0    1
## stim1015    0    1
## stim1016    0    1
## stim1039    0    1
## stim107     0    1
## stim1244    0    1
## stim1256    0    1
## stim1488    0    1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$`ei$group_id`
## [1] "contr.treatment"
(contrast <- makeContrasts("stim-ctrl", levels = design))
##       Contrasts
## Levels stim-ctrl
##   ctrl        -1
##   stim         1
# for ea. cluster, run edgeR w/ default parameters
res <- lapply(kids, function(k) {
    y <- assays(pb)[[k]]
    y <- DGEList(y, remove.zeros = TRUE)
    y <- calcNormFactors(y)
    y <- estimateDisp(y, design)
    fit <- glmQLFit(y, design)
    fit <- glmQLFTest(fit, contrast = contrast)
    topTags(fit, n = Inf, sort.by = "none")$table %>% 
        dplyr::mutate(gene = rownames(.), cluster_id = k) %>% 
        dplyr::rename(p_val = PValue, p_adj = FDR)
})

Results filtering & overview

To get a general overview of the differential testing results, we first filter them to retain hits with FDR < 5% and  | logFC |  > 1, and count the number of differential findings by cluster.

# filter FDR < 0.05, |logFC| > 1 & sort by FDR
res_fil <- lapply(res, 
    function(u)  u %>% 
        dplyr::filter(p_adj < 0.05, abs(logFC) > 1) %>% 
        dplyr::arrange(p_adj))

# nb. & % of DE genes per cluster
n_de <- vapply(res_fil, nrow, numeric(1))
cbind(n_de, p_gs = n_de / nrow(sce) * 100)
##                   n_de      p_gs
## B cells            591  9.215656
## CD14+ Monocytes   1730 26.976454
## CD4 T cells        440  6.861063
## CD8 T cells        241  3.757992
## Dendritic cells    196  3.056292
## FCGR3A+ Monocytes  854 13.316700
## Megakaryocytes     117  1.824419
## NK cells           373  5.816311

Between-cluster concordance

DS analysis aims at identifying cell-type-specific changes (in expression) across conditions. In this setting, key questions of interest arise, e.g., which genes are DE in only a single (or very few) clusters? How many DE genes are shared between clusters? In summary, what is the general concordance in differential findings between clusters?

To gain an impression of the between-cluster (dis-)agreement on DE genes, we generate an upset-plot that visualizas the number of DE genes that are shared across or unique to certain clusters:

library(UpSetR)
upset(fromList(map(res_fil, "gene")))

From the upset-plot above we can deduce that, for instance,  > 150 of genes are DE across all clusters, about half of CD14+ Monocytes DE genes are unique to that cluster, and that the two Monocytes clusters (FCGR3A+ and CD4+) share the largest set of DE genes ( ≈ 350), ect..

Visualization

Dimension reduction

One of the most popular plots for representing single-cell data are t-SNE plots, where each cell is represented in a lower, usually two-dimensional, space computed using t-stochastic neighbor embedding (t-SNE).

Dimensionality reductions available within our SCE can be accessed via reducedDims from the scater package, and visualized using plotReducedDim. For our dataset, the t-SNE colored by cluster IDs (Fig. 3; left) shows that cell-populations are well-separated from one another. INF-β stimulation manifests as a severe shift in the t-SNE projection of cells (Fig. 3; right), indicating widespread, genome-scale transcriptiontal changes.

# t-SNE colored by cluster ID
plotReducedDim(sce, use_dimred = "TSNE", 
    colour_by = "cluster_id", point_size = 0.8, point_alpha = 0.4) + 
    guides(fill = guide_legend(override.aes = list(alpha = 1, size = 5)))

# t-SNE colored by group ID
plotReducedDim(sce, use_dimred = "TSNE", 
    colour_by = "group_id", point_size = 0.8, point_alpha = 0.4) + 
    guides(fill = guide_legend(override.aes = list(alpha = 1, size = 5)))