Effectively using the DelayedArray framework as a user to support the analysis of large datasets

BioC 2019: 24-27 June

Workshop description

This workshop gives an introductory overview of the DelayedArray framework, which can be used by Bioconductor packages to support the analysis of large datasets. A DelayedArray is like an ordinary array in R, but allows for the data to be in-memory1, on-disk in a file, or even hosted on a remote server. Participants will learn where they might encounter a DelayedArray in the wild while using Bioconductor and help them understand the fundamental concepts underlying the framework. This workshop will be a mixture of lecture with example code and discussion. Examples will mostly be drawn from the analysis of single-cell RNA-sequencing data.

Instructor

Pre-requisites

  • Basic knowledge of R syntax.
  • Familiarity with common operations on matrices in R, such as colSums() and colMeans().
  • Some familiarity with S4 objects may be helpful but is not required.
  • Some familiarity with single-cell RNA-sequencing may be helpful but is not required.

Workshop Participation

Students will be able to run code interactively during the workshop. There will be opportunities throughout the workshop for questions and discussion.

Time outline

Activity Time
Introductory material 10m
First contact 40m
Package ecosystem 10m
Real world encounters analysing scRNA-seq data 25m
Workflow tips for DelayedArray-backed analyses 15m

Workshop goals and objectives

Learning goals

  • Learn of existing packages and functions that use the DelayedArray framework.
  • Develop a high-level understanding of classes and packages that implement the DelayedArray framework.
  • Become familiar with the fundamental concepts of delayed operations, block-processing, and realization.
  • Reason about potential bottlenecks in algorithms operating on DelayedArray objects.

Learning objectives

  • Identify when an object is a DelayedArray or one of its derivatives.
  • Be able to recognise when it is useful to use a DelayedArray instead of an ordinary array or other array-like data structure.
  • Learn how to load and save a DelayedArray-backed object.
  • Learn how the ‘block size’ and ‘chunking’ of the dataset affect performance when operating on DelayedArray objects.
  • Take away some miscellaneous tips and tricks I’ve learnt over the years when working with DelayedArray-backed objects.

Introductory material

Data from a high-throughput biological assay, such as single-cell RNA-sequencing (scRNA-seq), will often be summarised as a matrix of counts, where rows correspond to features and columns to samples2. Within Bioconductor, the SummarizedExperiment class is the recommended container for such data, offering a rich interface that tightly links assay measurements to data on the features and the samples.

The SummarizedExperiment class is used to store rectangular arrays of experimental results (assays). Here, each assay is drawn as a matrix but higher-dimensional arrays are also supported.

Traditionally, the assay data are stored in-memory as an ordinary array object3. Storing the data in-memory becomes a real pain with the ever-growing size of ’omics datasets. It is now not uncommon to collect 10, 000 − 100, 000, 000 measurements on 100 − 1, 000, 000 samples, which would occupy 10 − 1, 000 gigabytes (Gb) if stored in-memory as ordinary R arrays.

The DelayedArray framework offers a solution to this problem. Wrapping an array-like object (typically an on-disk object) in a DelayedArray object allows one to perform common array operations on it without loading the object in memory. In order to reduce memory usage and optimize performance, operations on the object are either delayed or executed using a block processing mechanism.

Projects enabled by DelayedArray

The DelayedArray framework enables the analysis of datasets that are too large to be stored or processed in-memory. This has become particularly relevant with the advent of large single-cell RNA-sequencing (scRNA-seq) studies containing tens of thousands to millions of cells.

In my own research I have made extensive use of the DelayedArray framework when analysing whole genome bisulfite sequencing (WGBS) datasets. To give a recent example, we profiled 45 human brain samples using WGBS to measure DNA methylation (Rizzardi et al. 2019). For most tissues, we focus on so-called so-called CpG methylation, which requires analysing matrices with roughly 20 million rows (CpG loci) and 45 columns (samples). This sized data is challenging, but well within the realms of high performance computing available at a modern research institute. However, the brain also has extensive non-CpG methylation and there are an order of magnitude more non-CpG loci. This necessitated extensive re-factoring of our software tools and we successfully adopted the DelayedArray framework to enable this research.

In this workshop we will see how the analysis of large scRNA-seq datasets is enabled by the DelayedArray framework.

First contact

The heart of the DelayedArray framework is implemented in the DelayedArray package, which we now load.

library(DelayedArray)
#> Loading required package: stats4
#> Loading required package: matrixStats
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#> 
#>     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#>     clusterExport, clusterMap, parApply, parCapply, parLapply,
#>     parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, append, as.data.frame, basename, cbind,
#>     colnames, dirname, do.call, duplicated, eval, evalq, Filter,
#>     Find, get, grep, grepl, intersect, is.unsorted, lapply, Map,
#>     mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
#>     setdiff, sort, table, tapply, union, unique, unsplit, which,
#>     which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:base':
#> 
#>     expand.grid
#> Loading required package: IRanges
#> Loading required package: BiocParallel
#> 
#> Attaching package: 'DelayedArray'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
#> The following objects are masked from 'package:base':
#> 
#>     aperm, apply, rowsum

# NOTE: This turns off parallelisation, which we do for didactic purposes.
setAutoBPPARAM(SerialParam())

We will begin with an example using some scRNA-seq data on 1.3 million brain cells from embryonic mice, generated by 10X Genomics. These data are available in the TENxBrainData Bioconductor package.

# NOTE: The TENxBrainData package loads and attaches the HDF5Array package, 
#       amongst others.
library(TENxBrainData)
#> Loading required package: SingleCellExperiment
#> Loading required package: SummarizedExperiment
#> Loading required package: GenomicRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: 'Biobase'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     anyMissing, rowMedians
#> Loading required package: HDF5Array
#> Loading required package: rhdf5

# NOTE: This will download the data and may take a little while on the first 
#       run. The result will be cached, however, so subsequent runs avoid 
#       re-downloading the data.
tenx <- TENxBrainData()
#> snapshotDate(): 2019-06-20
#> see ?TENxBrainData and browseVignettes('TENxBrainData') for documentation
#> downloading 0 resources
#> loading from cache 
#>     'EH1042 : 1042'

Let’s take a look at the tenx object.

tenx
#> class: SingleCellExperiment 
#> dim: 27998 1306127 
#> metadata(0):
#> assays(1): counts
#> rownames: NULL
#> rowData names(2): Ensembl Symbol
#> colnames(1306127): AAACCTGAGATAGGAG-1 AAACCTGAGCGGCTTC-1 ...
#>   TTTGTCAGTTAAAGTG-133 TTTGTCATCTGAAAGA-133
#> colData names(4): Barcode Sequence Library Mouse
#> reducedDimNames(0):
#> spikeNames(0):

The data are stored in a SingleCellExperiment object, an extension of the SummarizedExperiment class. With data from 1.3 million cells, this is roughly 100,000-times more samples than a typical bulk RNA-seq dataset and would occupy 146 GB in-memory if stored as an ordinary matrix.

We might expect that interacting with an object containing this many samples would feel sluggish, but this is not the case. For example, we can efficiently subset the object.

# Take a random sample of 1000 genes and 200000 samples
tenx[sample(nrow(tenx), 1000), sample(ncol(tenx), 200000)]
#> class: SingleCellExperiment 
#> dim: 1000 200000 
#> metadata(0):
#> assays(1): counts
#> rownames: NULL
#> rowData names(2): Ensembl Symbol
#> colnames(200000): GACGGCTAGGCTATCT-59 CGTTAGACAGCGTCCA-90 ...
#>   AAAGTAGCAGTTCCCT-107 ACATACGTCGTACCGG-48
#> colData names(4): Barcode Sequence Library Mouse
#> reducedDimNames(0):
#> spikeNames(0):

Now, this efficiency could all be down to some trickery through the outer SingleCellExperiment class. To make things clearer, we’ll extract the counts assay.

# NOTE: We'll discuss the use of `withDimnames = FALSE` later in the workshop.
tenx_counts <- assay(tenx, "counts", withDimnames = FALSE)

Now, let’s do something that would ordinarily be a terrible idea, and something that’s frustrated me way too many times: let’s “accidentally” print out the entire counts matrix.

tenx_counts
#> <27998 x 1306127> HDF5Matrix object of type "integer":
#>                [,1]       [,2]       [,3]       [,4] ... [,1306124]
#>     [1,]          0          0          0          0   .          0
#>     [2,]          0          0          0          0   .          0
#>     [3,]          0          0          0          0   .          0
#>     [4,]          0          0          0          0   .          0
#>     [5,]          0          0          0          0   .          0
#>      ...          .          .          .          .   .          .
#> [27994,]          0          0          0          0   .          0
#> [27995,]          1          0          0          2   .          0
#> [27996,]          0          0          0          0   .          0
#> [27997,]          0          0          0          0   .          0
#> [27998,]          0          0          0          0   .          0
#>          [,1306125] [,1306126] [,1306127]
#>     [1,]          0          0          0
#>     [2,]          0          0          0
#>     [3,]          0          0          0
#>     [4,]          0          0          0
#>     [5,]          0          0          0
#>      ...          .          .          .
#> [27994,]          0          0          0
#> [27995,]          1          0          0
#> [27996,]          0          0          0
#> [27997,]          0          0          0
#> [27998,]          0          0          0

Hallelujah! Unlike what you may have experienced when printing out a large matrix, this didn’t overwhelm the screen with thousands of lines of output nor did it cause the R session to hang indefinitely. In fact, this gives us a rather pretty printing of the counts matrix4. No need for panicked mashing of Ctrl-c or Esc.

We can now clearly see that tenx_counts is no ordinary matrix. In fact, it is an HDF5Matrix, which is a type of DelayedArray5.

class(tenx_counts)
#> [1] "HDF5Matrix"
#> attr(,"package")
#> [1] "HDF5Array"
is(tenx_counts, "DelayedArray")
#> [1] TRUE

The data contained in an HDF5Matrix is actually stored on disk in a Hierarchical Data Format (HDF5) file. Consequently, the tenx_counts object takes up very little space in memory.

print(object.size(tenx_counts), units = "auto")
#> 2.1 Kb

We can learn more about the internals of the tenx_counts object using the seed() function6.

seed(tenx_counts)
#> An object of class "HDF5ArraySeed"
#> Slot "filepath":
#> [1] "/home/rstudio/.cache/ExperimentHub/655d7be6dc93_1040"
#> 
#> Slot "name":
#> [1] "counts"
#> 
#> Slot "dim":
#> [1]   27998 1306127
#> 
#> Slot "first_val":
#> [1] 0
#> 
#> Slot "chunkdim":
#> [1] 100 100

Three examples of computing on a DelayedArray

We will now play around with computing on the counts matrix. To make things slightly easier, we will first subset the data to 1000 samples.

tenx_counts_subset <- tenx_counts[, 1:1000]

Library sizes

Firstly, let’s compute the library sizes for this subset of samples. We can do this using colSums().

lib_sizes <- colSums(tenx_counts_subset)
summary(lib_sizes)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1453    2755    3898    4689    5548   34233

Proportion of cells with non-zero expression for each gene

Secondly, suppose we want to know for each gene the proportion of cells with non-zero expression. We can do this using rowSums() in conjunction with some standard R commands (logical comparisons and division).

prop_non_zero <- rowSums(tenx_counts_subset > 0) /  ncol(tenx_counts_subset)
summary(prop_non_zero)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> 0.00000 0.00000 0.00300 0.07235 0.07400 1.00000

Median expression of each gene

Finally, suppose we want to know the median expression of each gene. Here, we will quantify expression as counts per million (CPM) using library size normalization.

cpm <- t(t(1e6 * tenx_counts_subset) / lib_sizes)
cpm
#> <27998 x 1000> DelayedMatrix object of type "double":
#>              [,1]     [,2]     [,3] ...   [,999]  [,1000]
#>     [1,]        0        0        0   .        0        0
#>     [2,]        0        0        0   .        0        0
#>     [3,]        0        0        0   .        0        0
#>     [4,]        0        0        0   .        0        0
#>     [5,]        0        0        0   .        0        0
#>      ...        .        .        .   .        .        .
#> [27994,]   0.0000   0.0000   0.0000   .   0.0000   0.0000
#> [27995,] 247.1577   0.0000   0.0000   . 277.1619   0.0000
#> [27996,]   0.0000   0.0000   0.0000   .   0.0000   0.0000
#> [27997,]   0.0000   0.0000   0.0000   .   0.0000   0.0000
#> [27998,]   0.0000   0.0000   0.0000   .   0.0000   0.0000

We can then compute the median expression of each gene using DelayedMatrixStats::rowMedians().

library(DelayedMatrixStats)
#> 
#> Attaching package: 'DelayedMatrixStats'
#> The following object is masked from 'package:Biobase':
#> 
#>     rowMedians
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colAlls, colAnyMissings, colAnyNAs, colAnys, colAvgsPerRowSet,
#>     colCollapse, colCounts, colCummaxs, colCummins, colCumprods,
#>     colCumsums, colDiffs, colIQRDiffs, colIQRs, colLogSumExps,
#>     colMadDiffs, colMads, colMeans2, colMedians, colOrderStats,
#>     colProds, colQuantiles, colRanks, colSdDiffs, colSds,
#>     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#>     colWeightedMeans, colWeightedMedians, colWeightedSds,
#>     colWeightedVars, rowAlls, rowAnyMissings, rowAnyNAs, rowAnys,
#>     rowAvgsPerColSet, rowCollapse, rowCounts, rowCummaxs,
#>     rowCummins, rowCumprods, rowCumsums, rowDiffs, rowIQRDiffs,
#>     rowIQRs, rowLogSumExps, rowMadDiffs, rowMads, rowMeans2,
#>     rowMedians, rowOrderStats, rowProds, rowQuantiles, rowRanks,
#>     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs,
#>     rowVars, rowWeightedMads, rowWeightedMeans,
#>     rowWeightedMedians, rowWeightedSds, rowWeightedVars
median_expression <- rowMedians(cpm)
summary(median_expression)
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#>     0.00     0.00     0.00    18.73     0.00 22069.23

Summary

These 3 examples highlight the power of the DelayedArray framework. Recall that the data in these examples live on disk in an HDF5 file, yet we interacted with tenx_counts_subset and computed on it much as we would if the data were in-memory as an ordinary matrix. Also note that all 3 examples returned ordinary R vectors.

class(lib_sizes)
#> [1] "numeric"
class(prop_non_zero)
#> [1] "numeric"
class(cpm)
#> [1] "DelayedMatrix"
#> attr(,"package")
#> [1] "DelayedArray"

To do so, we made (implicit) use of the three fundamental concepts of the DelayedArray framework:

  1. Delayed operations
  2. Block-processing
  3. Realization

We’ll now discuss each of these in turn.

Delayed operations

Taking a careful look at tenx_counts_subset, we see that it is a DelayedMatrix rather than an HDF5Matrix.

tenx_counts_subset
#> <27998 x 1000> DelayedMatrix object of type "integer":
#>             [,1]    [,2]    [,3]    [,4] ...  [,997]  [,998]  [,999]
#>     [1,]       0       0       0       0   .       0       0       0
#>     [2,]       0       0       0       0   .       0       0       0
#>     [3,]       0       0       0       0   .       0       0       0
#>     [4,]       0       0       0       0   .       0       0       0
#>     [5,]       0       0       0       0   .       0       0       0
#>      ...       .       .       .       .   .       .       .       .
#> [27994,]       0       0       0       0   .       0       0       0
#> [27995,]       1       0       0       2   .       5       3       2
#> [27996,]       0       0       0       0   .       0       0       0
#> [27997,]       0       0       0       0   .       0       0       0
#> [27998,]       0       0       0       0   .       0       0       0
#>          [,1000]
#>     [1,]       0
#>     [2,]       0
#>     [3,]       0
#>     [4,]       0
#>     [5,]       0
#>      ...       .
#> [27994,]       0
#> [27995,]       0
#> [27996,]       0
#> [27997,]       0
#> [27998,]       0

The subsetting operation has ‘degraded’ the tenx_counts_subset object to a DelayedMatrix.

is(tenx_counts_subset, "HDF5Matrix")
#> [1] FALSE
is(tenx_counts_subset, "DelayedMatrix")
#> [1] TRUE

The showtree() function can help us see what changed when we subsetted the data.

showtree(tenx_counts)
#> 27998x1306127 integer: HDF5Matrix object
#> └─ 27998x1306127 integer: [seed] HDF5ArraySeed object
showtree(tenx_counts_subset)
#> 27998x1000 integer: DelayedMatrix object
#> └─ 27998x1000 integer: Subset
#>    └─ 27998x1306127 integer: [seed] HDF5ArraySeed object

The subsetting operation has been registered in what is termed a ‘delayed operation’. Registering a delayed operation does not modify the underlying data. Instead, the operation is recorded and only performed when the DelayedArray object is ‘realized’. Realization of a DelayedArray triggers the execution of the delayed operations carried by the object and returns the result as an ordinary array.

This allows us to chain together multiple operations and only perform them as required. Here is a contrived example.

# Add 1 to every element (a delayed op).
x <- tenx_counts_subset + 1L
showtree(x)
#> 27998x1000 integer: DelayedMatrix object
#> └─ 27998x1000 integer: Unary iso op stack
#>    └─ 27998x1000 integer: Subset
#>       └─ 27998x1306127 integer: [seed] HDF5ArraySeed object

# Compute log of every element (another delayed op).
lx <- log(x)
showtree(lx)
#> 27998x1000 double: DelayedMatrix object
#> └─ 27998x1000 double: Unary iso op stack
#>    └─ 27998x1000 integer: Subset
#>       └─ 27998x1306127 integer: [seed] HDF5ArraySeed object

# Transpose the result (another delayed op).
tlx <- t(lx)
showtree(tlx)
#> 1000x27998 double: DelayedMatrix object
#> └─ 1000x27998 double: Unary iso op stack
#>    └─ 1000x27998 integer: Aperm (perm=c(2,1))
#>       └─ 27998x1000 integer: Subset
#>          └─ 27998x1306127 integer: [seed] HDF5ArraySeed object

# Realize a subset of the data as an ordinary matrix.
as.array(tlx[1:5, 1:10])
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7]     [,8]      [,9] [,10]
#> [1,]    0    0    0    0    0    0    0 0.000000 0.0000000     0
#> [2,]    0    0    0    0    0    0    0 0.000000 0.0000000     0
#> [3,]    0    0    0    0    0    0    0 0.000000 0.6931472     0
#> [4,]    0    0    0    0    0    0    0 0.000000 0.0000000     0
#> [5,]    0    0    0    0    0    0    0 1.098612 0.6931472     0

Many common operations can be registered as delayed operations. Here are some examples7. Notice that in each case the result is ‘degraded’ to a DelayedMatrix8.

DelayedSubset

val <- tenx_counts[, 1:100]
val
#> <27998 x 100> DelayedMatrix object of type "integer":
#>            [,1]   [,2]   [,3]   [,4] ...  [,97]  [,98]  [,99] [,100]
#>     [1,]      0      0      0      0   .      0      0      1      0
#>     [2,]      0      0      0      0   .      0      0      0      0
#>     [3,]      0      0      0      0   .      0      0      0      0
#>     [4,]      0      0      0      0   .      0      0      0      0
#>     [5,]      0      0      0      0   .      0      0      0      0
#>      ...      .      .      .      .   .      .      .      .      .
#> [27994,]      0      0      0      0   .      0      0      0      0
#> [27995,]      1      0      0      2   .      0      1      2      0
#> [27996,]      0      0      0      0   .      0      0      0      0
#> [27997,]      0      0      0      0   .      0      0      0      0
#> [27998,]      0      0      0      0   .      0      0      0      0
showtree(val)
#> 27998x100 integer: DelayedMatrix object
#> └─ 27998x100 integer: Subset
#>    └─ 27998x1306127 integer: [seed] HDF5ArraySeed object

DelayedAperm

val <- t(tenx_counts)
val
#> <1306127 x 27998> DelayedMatrix object of type "integer":
#>                [,1]     [,2]     [,3]     [,4] ... [,27995] [,27996]
#>       [1,]        0        0        0        0   .        1        0
#>       [2,]        0        0        0        0   .        0        0
#>       [3,]        0        0        0        0   .        0        0
#>       [4,]        0        0        0        0   .        2        0
#>       [5,]        0        0        0        0   .        2        0
#>        ...        .        .        .        .   .        .        .
#> [1306123,]        0        0        0        0   .        0        2
#> [1306124,]        0        0        0        0   .        0        0
#> [1306125,]        0        0        0        0   .        1        0
#> [1306126,]        0        0        0        0   .        0        0
#> [1306127,]        0        0        0        0   .        0        0
#>            [,27997] [,27998]
#>       [1,]        0        0
#>       [2,]        0        0
#>       [3,]        0        0
#>       [4,]        0        0
#>       [5,]        0        0
#>        ...        .        .
#> [1306123,]        0        0
#> [1306124,]        0        0
#> [1306125,]        0        0
#> [1306126,]        0        0
#> [1306127,]        0        0
showtree(val)
#> 1306127x27998 integer: DelayedMatrix object
#> └─ 1306127x27998 integer: Aperm (perm=c(2,1))
#>    └─ 27998x1306127 integer: [seed] HDF5ArraySeed object

DelayedUnaryIsoOp

val <- tenx_counts + 1L
val
#> <27998 x 1306127> DelayedMatrix object of type "integer":
#>                [,1]       [,2]       [,3]       [,4] ... [,1306124]
#>     [1,]          1          1          1          1   .          1
#>     [2,]          1          1          1          1   .          1
#>     [3,]          1          1          1          1   .          1
#>     [4,]          1          1          1          1   .          1
#>     [5,]          1          1          1          1   .          1
#>      ...          .          .          .          .   .          .
#> [27994,]          1          1          1          1   .          1
#> [27995,]          2          1          1          3   .          1
#> [27996,]          1          1          1          1   .          1
#> [27997,]          1          1          1          1   .          1
#> [27998,]          1          1          1          1   .          1
#>          [,1306125] [,1306126] [,1306127]
#>     [1,]          1          1          1
#>     [2,]          1          1          1
#>     [3,]          1          1          1
#>     [4,]          1          1          1
#>     [5,]          1          1          1
#>      ...          .          .          .
#> [27994,]          1          1          1
#> [27995,]          2          1          1
#> [27996,]          1          1          1
#> [27997,]          1          1          1
#> [27998,]          1          1          1
showtree(val)
#> 27998x1306127 integer: DelayedMatrix object
#> └─ 27998x1306127 integer: Unary iso op stack
#>    └─ 27998x1306127 integer: [seed] HDF5ArraySeed object
val <- tenx_counts + 1:2
val
#> <27998 x 1306127> DelayedMatrix object of type "integer":
#>                [,1]       [,2]       [,3]       [,4] ... [,1306124]
#>     [1,]          1          1          1          1   .          1
#>     [2,]          2          2          2          2   .          2
#>     [3,]          1          1          1          1   .          1
#>     [4,]          2          2          2          2   .          2
#>     [5,]          1          1          1          1   .          1
#>      ...          .          .          .          .   .          .
#> [27994,]          2          2          2          2   .          2
#> [27995,]          2          1          1          3   .          1
#> [27996,]          2          2          2          2   .          2
#> [27997,]          1          1          1          1   .          1
#> [27998,]          2          2          2          2   .          2
#>          [,1306125] [,1306126] [,1306127]
#>     [1,]          1          1          1
#>     [2,]          2          2          2
#>     [3,]          1          1          1
#>     [4,]          2          2          2
#>     [5,]          1          1          1
#>      ...          .          .          .
#> [27994,]          2          2          2
#> [27995,]          2          1          1
#> [27996,]          2          2          2
#> [27997,]          1          1          1
#> [27998,]          2          2          2
showtree(val)
#> 27998x1306127 integer: DelayedMatrix object
#> └─ 27998x1306127 integer: Unary iso op with args
#>    └─ 27998x1306127 integer: [seed] HDF5ArraySeed object

DelayedSubassign

# NOTE: Be careful with delayed subassignment. You can end up with objects that 
#       are surprisingly large in-memory.
tmp <- tenx_counts
tmp[, 1] <- 100
tmp
#> <27998 x 1306127> DelayedMatrix object of type "double":
#>                [,1]       [,2]       [,3] ... [,1306126] [,1306127]
#>     [1,]        100          0          0   .          0          0
#>     [2,]        100          0          0   .          0          0
#>     [3,]        100          0          0   .          0          0
#>     [4,]        100          0          0   .          0          0
#>     [5,]        100          0          0   .          0          0
#>      ...          .          .          .   .          .          .
#> [27994,]        100          0          0   .          0          0
#> [27995,]        100          0          0   .          0          0
#> [27996,]        100          0          0   .          0          0
#> [27997,]        100          0          0   .          0          0
#> [27998,]        100          0          0   .          0          0
showtree(tmp)
#> 27998x1306127 double: DelayedMatrix object
#> └─ 27998x1306127 double: Subassign
#>    └─ 27998x1306127 integer: [seed] HDF5ArraySeed object

DelayedDimnames

tmp <- tenx_counts
rownames(tmp) <- paste0("R", seq_len(nrow(tmp)))
tmp
#> <27998 x 1306127> DelayedMatrix object of type "integer":
#>              [,1]       [,2]       [,3]       [,4] ... [,1306124]
#>     R1          0          0          0          0   .          0
#>     R2          0          0          0          0   .          0
#>     R3          0          0          0          0   .          0
#>     R4          0          0          0          0   .          0
#>     R5          0          0          0          0   .          0
#>    ...          .          .          .          .   .          .
#> R27994          0          0          0          0   .          0
#> R27995          1          0          0          2   .          0
#> R27996          0          0          0          0   .          0
#> R27997          0          0          0          0   .          0
#> R27998          0          0          0          0   .          0
#>        [,1306125] [,1306126] [,1306127]
#>     R1          0          0          0
#>     R2          0          0          0
#>     R3          0          0          0
#>     R4          0          0          0
#>     R5          0          0          0
#>    ...          .          .          .
#> R27994          0          0          0
#> R27995          1          0          0
#> R27996          0          0          0
#> R27997          0          0          0
#> R27998          0          0          0
showtree(tmp)
#> 27998x1306127 integer: DelayedMatrix object
#> └─ 27998x1306127 integer: Set dimnames
#>    └─ 27998x1306127 integer: [seed] HDF5ArraySeed object

DelayedNaryIsoOp

val <- tenx_counts + tenx_counts
val
#> <27998 x 1306127> DelayedMatrix object of type "integer":
#>                [,1]       [,2]       [,3]       [,4] ... [,1306124]
#>     [1,]          0          0          0          0   .          0
#>     [2,]          0          0          0          0   .          0
#>     [3,]          0          0          0          0   .          0
#>     [4,]          0          0          0          0   .          0
#>     [5,]          0          0          0          0   .          0
#>      ...          .          .          .          .   .          .
#> [27994,]          0          0          0          0   .          0
#> [27995,]          2          0          0          4   .          0
#> [27996,]          0          0          0          0   .          0
#> [27997,]          0          0          0          0   .          0
#> [27998,]          0          0          0          0   .          0
#>          [,1306125] [,1306126] [,1306127]
#>     [1,]          0          0          0
#>     [2,]          0          0          0
#>     [3,]          0          0          0
#>     [4,]          0          0          0
#>     [5,]          0          0          0
#>      ...          .          .          .
#> [27994,]          0          0          0
#> [27995,]          2          0          0
#> [27996,]          0          0          0
#> [27997,]          0          0          0
#> [27998,]          0          0          0
showtree(val)
#> 27998x1306127 integer: DelayedMatrix object
#> └─ 27998x1306127 integer: N-ary iso op
#>    ├─ 27998x1306127 integer: [seed] HDF5ArraySeed object
#>    └─ 27998x1306127 integer: [seed] HDF5ArraySeed object

DelayedAbind

val <- cbind(tenx_counts, tenx_counts)
val
#> <27998 x 2612254> DelayedMatrix object of type "integer":
#>                [,1]       [,2]       [,3]       [,4] ... [,2612251]
#>     [1,]          0          0          0          0   .          0
#>     [2,]          0          0          0          0   .          0
#>     [3,]          0          0          0          0   .          0
#>     [4,]          0          0          0          0   .          0
#>     [5,]          0          0          0          0   .          0
#>      ...          .          .          .          .   .          .
#> [27994,]          0          0          0          0   .          0
#> [27995,]          1          0          0          2   .          0
#> [27996,]          0          0          0          0   .          0
#> [27997,]          0          0          0          0   .          0
#> [27998,]          0          0          0          0   .          0
#>          [,2612252] [,2612253] [,2612254]
#>     [1,]          0          0          0
#>     [2,]          0          0          0
#>     [3,]          0          0          0
#>     [4,]          0          0          0
#>     [5,]          0          0          0
#>      ...          .          .          .
#> [27994,]          0          0          0
#> [27995,]          1          0          0
#> [27996,]          0          0          0
#> [27997,]          0          0          0
#> [27998,]          0          0          0
showtree(val)
#> 27998x2612254 integer: DelayedMatrix object
#> └─ 27998x2612254 integer: Abind (along=2)
#>    ├─ 27998x1306127 integer: [seed] HDF5ArraySeed object
#>    └─ 27998x1306127 integer: [seed] HDF5ArraySeed object

No-op

The DelayedArray framework is smart enough to recognise that some combinations of operations are ‘no-ops’.

val <- t(t(tenx_counts))
val
#> <27998 x 1306127> HDF5Matrix object of type "integer":
#>                [,1]       [,2]       [,3]       [,4] ... [,1306124]
#>     [1,]          0          0          0          0   .          0
#>     [2,]          0          0          0          0   .          0
#>     [3,]          0          0          0          0   .          0
#>     [4,]          0          0          0          0   .          0
#>     [5,]          0          0          0          0   .          0
#>      ...          .          .          .          .   .          .
#> [27994,]          0          0          0          0   .          0
#> [27995,]          1          0          0          2   .          0
#> [27996,]          0          0          0          0   .          0
#> [27997,]          0          0          0          0   .          0
#> [27998,]          0          0          0          0   .          0
#>          [,1306125] [,1306126] [,1306127]
#>     [1,]          0          0          0
#>     [2,]          0          0          0
#>     [3,]          0          0          0
#>     [4,]          0          0          0
#>     [5,]          0          0          0
#>      ...          .          .          .
#> [27994,]          0          0          0
#> [27995,]          1          0          0
#> [27996,]          0          0          0
#> [27997,]          0          0          0
#> [27998,]          0          0          0
showtree(val)
#> 27998x1306127 integer: HDF5Matrix object
#> └─ 27998x1306127 integer: [seed] HDF5ArraySeed object

But it can be fooled.

# NOTE: This is a no-op but DelayedArray doesn't recognise it as one.
val <- tenx_counts + 0L
val
#> <27998 x 1306127> DelayedMatrix object of type "integer":
#>                [,1]       [,2]       [,3]       [,4] ... [,1306124]
#>     [1,]          0          0          0          0   .          0
#>     [2,]          0          0          0          0   .          0
#>     [3,]          0          0          0          0   .          0
#>     [4,]          0          0          0          0   .          0
#>     [5,]          0          0          0          0   .          0
#>      ...          .          .          .          .   .          .
#> [27994,]          0          0          0          0   .          0
#> [27995,]          1          0          0          2   .          0
#> [27996,]          0          0          0          0   .          0
#> [27997,]          0          0          0          0   .          0
#> [27998,]          0          0          0          0   .          0
#>          [,1306125] [,1306126] [,1306127]
#>     [1,]          0          0          0
#>     [2,]          0          0          0
#>     [3,]          0          0          0
#>     [4,]          0          0          0
#>     [5,]          0          0          0
#>      ...          .          .          .
#> [27994,]          0          0          0
#> [27995,]          1          0          0
#> [27996,]          0          0          0
#> [27997,]          0          0          0
#> [27998,]          0          0          0
showtree(val)
#> 27998x1306127 integer: DelayedMatrix object
#> └─ 27998x1306127 integer: Unary iso op stack
#>    └─ 27998x1306127 integer: [seed] HDF5ArraySeed object

# NOTE: This also a no-op but DelayedArray doesn't recognise it as one.
val <- cbind(tenx_counts)
val
#> <27998 x 1306127> HDF5Matrix object of type "integer":
#>                [,1]       [,2]       [,3]       [,4] ... [,1306124]
#>     [1,]          0          0          0          0   .          0
#>     [2,]          0          0          0          0   .          0
#>     [3,]          0          0          0          0   .          0
#>     [4,]          0          0          0          0   .          0
#>     [5,]          0          0          0          0   .          0
#>      ...          .          .          .          .   .          .
#> [27994,]          0          0          0          0   .          0
#> [27995,]          1          0          0          2   .          0
#> [27996,]          0          0          0          0   .          0
#> [27997,]          0          0          0          0   .          0
#> [27998,]          0          0          0          0   .          0
#>          [,1306125] [,1306126] [,1306127]
#>     [1,]          0          0          0
#>     [2,]          0          0          0
#>     [3,]          0          0          0
#>     [4,]          0          0          0
#>     [5,]          0          0          0
#>      ...          .          .          .
#> [27994,]          0          0          0
#> [27995,]          1          0          0
#> [27996,]          0          0          0
#> [27997,]          0          0          0
#> [27998,]          0          0          0
showtree(val)
#> 27998x1306127 integer: HDF5Matrix object
#> └─ 27998x1306127 integer: [seed] HDF5ArraySeed object

Block-processing

In Library sizes, we needed to compute the column sums of tenx_counts_subset, whose data live on disk in an HDF5 file. One way of achieving this would be to load the entire dataset into memory as an ordinary matrix and then run base::colSums()9.

lib_sizes_in_mem <- colSums(as.array(tenx_counts_subset))
# Check we get the same result as before.
identical(lib_sizes_in_mem, lib_sizes)
#> [1] TRUE

Here the data at are small enough to load into memory, but what if that’s not the case10? One way you might think to do this is to loop over the columns of the matrix, load that column into memory, and compute it’s sum11

lib_sizes_loop_over_cols <- vector("numeric", ncol(tenx_counts_subset))
for (j in seq_len(ncol(tenx_counts_subset))) {
  # A simple progress report.
  if (j %% 100 == 0) message("Processed ", j, "/", ncol(tenx_counts_subset))
  lib_sizes_loop_over_cols[j] <- colSums(
    as.array(tenx_counts_subset[, j, drop = FALSE]))
}
#> Processed 100/1000
#> Processed 200/1000
#> Processed 300/1000
#> Processed 400/1000
#> Processed 500/1000
#> Processed 600/1000
#> Processed 700/1000
#> Processed 800/1000
#> Processed 900/1000
#> Processed 1000/1000
# Check we get the same result as before.
identical(lib_sizes_loop_over_cols, lib_sizes)
#> [1] TRUE

But what if you can’t load even a single column into memory12? We might loop over the columns of the matrix, partition each column, load each partition, compute its sum, and then compute the sum of the partition sums.

# NOTE: Here we will partition each column into two equal-sized subsets.
lib_sizes_loop_over_cols_subset <- vector("numeric", ncol(tenx_counts_subset))
tmp_colsums <- vector("numeric", 2)
nr <- nrow(tenx_counts_subset)

for (j in seq_len(ncol(tenx_counts_subset))) {
  # A simple progress report.
  if (j %% 100 == 0) message("Processed ", j, "/", ncol(tenx_counts_subset))
  for (i in 1:2) {
    i1 <- (i - 1) * nr / 2 + 1
    i2 <- i * nr / 2
    tmp_colsums[i] <- colSums(
      as.array(tenx_counts_subset[seq(i1, i2), j, drop = FALSE]))
  }
  lib_sizes_loop_over_cols_subset[j] <- sum(tmp_colsums)
}
#> Processed 100/1000
#> Processed 200/1000
#> Processed 300/1000
#> Processed 400/1000
#> Processed 500/1000
#> Processed 600/1000
#> Processed 700/1000
#> Processed 800/1000
#> Processed 900/1000
#> Processed 1000/1000
# Check we get the same result as before.
identical(lib_sizes_loop_over_cols_subset, lib_sizes)
#> [1] TRUE

Hopefully you can now begin to see the general pattern, a strategy which the DelayedArray package calls ‘block-processing’:

  1. Load a ‘block’ of the data into memory.
  2. Compute a summary statistic.
  3. Combine the block-level statistics in an appropriate way to get the final result.

Block-processing illustrated

Some examples of block-processing are illustrated in the following figures:

Block-processed column sums

When we run colSums(tenx_counts_subset) we are using a block-processed version of column sums, specifically the colSums,DelayedMatrix-method implemented in the DelayedArray package. We can see this more clearly by turning on verbose progress reporting from the DelayedArray package.

# Turn on progress reports for DelayedArray's block-processing routines.
DelayedArray:::set_verbose_block_processing(TRUE)
#> [1] FALSE
head(colSums(tenx_counts_subset))
#> Processing block 1/2 ... OK
#> Processing block 2/2 ... OK
#> [1]  4046  2087  4654  3193  8444 11178

More examples of block-processed operations in DelayedArray

Some of the most useful functions in the DelayedArray package implement common operations on a DelayedMatrix using block-processing. These include the following row and column summarization methods:

  • rowSums()
  • colSums()
  • rowMeans()
  • colMeans()
  • rowMaxs()
  • colMaxs()
  • rowMins()
  • colMins()
  • rowRanges()
  • colRanges()

Two useful but lesser known functions use block-processing to compute column/row sums of a DelayedMatrix based on a grouping variable:

  • rowsum()
  • colsum()
# Compute separate library sizes for mitochondrial and non-mitochondrial genes.
is_mito <- grepl("^mt-", rowData(tenx)$Symbol)
summary(is_mito)
#>    Mode   FALSE    TRUE 
#> logical   27985      13
lib_sizes_grouped <- rowsum(tenx_counts_subset, group = is_mito)
#> Processing block [[1/2, 1/1]] ... OK
#> Processing block [[2/2, 1/1]] ... OK
head(t(lib_sizes_grouped))
#>      FALSE TRUE
#> [1,]  3888  158
#> [2,]  2014   73
#> [3,]  4572   82
#> [4,]  3115   78
#> [5,]  8273  171
#> [6,] 10777  401

Finally, matrix multiplication is implemented as a block-processed operation.

# This is mathematically equivalent to rowSums(tenx_counts_subset).
tenx_counts_subset %*% matrix(1, nrow = ncol(tenx_counts_subset))
#> <27998 x 1> DelayedMatrix object of type "double":
#>          [,1]
#>     [1,]   18
#>     [2,]    0
#>     [3,]    0
#>     [4,]    0
#>     [5,]    0
#>      ...    .
#> [27994,]    0
#> [27995,] 1122
#> [27996,]   93
#> [27997,]    0
#> [27998,]    1

DelayedMatrixStats

We’ve already seen the DelayedMatrixStats package in action back when computing the Median expression of each gene. DelayedMatrixStats is a port of the matrixStats package’s API for use with DelayedMatrix objects. It provides more than 70 functions that apply to rows and columns of DelayedMatrix objects

General block-processing

As we have seen, many common block-processing operations on a DelayedMatrix have already been implemented in DelayedArray or are provided by DelayedMatrixStats. Nonetheless, there may be times you need to implement your own algorithm using block-processing. The documentation on this topic is a little sparse, but some details can be found in help("block_processing", "DelayedArray") and help("ArrayGrid", "DelayedArray") or by reading the source code of the aforementioned packages. Briefly, to perform block-processing requires that you:

  1. Set up an ArrayGrid over your DelayedArray. This specifies the ‘block’ structure that will be traversed when processing the DelayedArray.
  2. Iterate over the DelayedArray via the ArrayGrid
    1. Read each block of data into memory as an ordinary array using read_block().
    2. Compute the statistic for that block
  3. Appropriately combine the block-level statistics to get your final result.

The blockApply() and blockReduce() functions can help facilitate steps 1-3 , even incorporating parallelization via the BiocParallel package.

Realization

To realize a DelayedArray object is to trigger execution of the delayed operations carried by the object and return the result as an ordinary array, One way to achieve this is to call as.array() on it.

tenx_counts_subset_realized <- as.array(tenx_counts_subset)
tenx_counts_subset_realized[1:10, 1:10]
#>       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#>  [1,]    0    0    0    0    0    0    0    0    0     0
#>  [2,]    0    0    0    0    0    0    0    0    0     0
#>  [3,]    0    0    0    0    0    0    0    0    0     0
#>  [4,]    0    0    0    0    0    0    0    0    0     0
#>  [5,]    0    0    0    0    0    0    0    0    0     0
#>  [6,]    0    0    0    0    0    0    0    0    0     0
#>  [7,]    0    0    0    0    0    0    0    0    0     0
#>  [8,]    0    0    0    0    2    0    2    0    1     1
#>  [9,]    0    0    1    0    1    0    0    0    0     0
#> [10,]    0    0    0    0    0    0    0    0    0     0

However, this realizes the full object at once in memory which could require too much memory if the object is big13.

A large DelayedArray object is preferably realized on disk, which is most commonly an HDF5 file.

Realizing to an HDF5 file

Realizing to an HDF5 file requires that the HDF5Array package is installed14

tenx_counts_subset_hdf5 <- writeHDF5Array(tenx_counts_subset)
#> Realizing block 1/2 ... OK, writing it ... OK
#> Realizing block 2/2 ... OK, writing it ... OK
tenx_counts_subset_hdf5
#> <27998 x 1000> HDF5Matrix object of type "integer":
#>             [,1]    [,2]    [,3]    [,4] ...  [,997]  [,998]  [,999]
#>     [1,]       0       0       0       0   .       0       0       0
#>     [2,]       0       0       0       0   .       0       0       0
#>     [3,]       0       0       0       0   .       0       0       0
#>     [4,]       0       0       0       0   .       0       0       0
#>     [5,]       0       0       0       0   .       0       0       0
#>      ...       .       .       .       .   .       .       .       .
#> [27994,]       0       0       0       0   .       0       0       0
#> [27995,]       1       0       0       2   .       5       3       2
#> [27996,]       0       0       0       0   .       0       0       0
#> [27997,]       0       0       0       0   .       0       0       0
#> [27998,]       0       0       0       0   .       0       0       0
#>          [,1000]
#>     [1,]       0
#>     [2,]       0
#>     [3,]       0
#>     [4,]       0
#>     [5,]       0
#>      ...       .
#> [27994,]       0
#> [27995,]       0
#> [27996,]       0
#> [27997,]       0
#> [27998,]       0

# Alternatively, we could 'coerce' the result to be an HDF5Array.
as(tenx_counts_subset, "HDF5Array")
#> Realizing block 1/2 ... OK, writing it ... OK
#> Realizing block 2/2 ... OK, writing it ... OK
#> <27998 x 1000> HDF5Matrix object of type "integer":
#>             [,1]    [,2]    [,3]    [,4] ...  [,997]  [,998]  [,999]
#>     [1,]       0       0       0       0   .       0       0       0
#>     [2,]       0       0       0       0   .       0       0       0
#>     [3,]       0       0       0       0   .       0       0       0
#>     [4,]       0       0       0       0   .       0       0       0
#>     [5,]       0       0       0       0   .       0       0       0
#>      ...       .       .       .       .   .       .       .       .
#> [27994,]       0       0       0       0   .       0       0       0
#> [27995,]       1       0       0       2   .       5       3       2
#> [27996,]       0       0       0       0   .       0       0       0
#> [27997,]       0       0       0       0   .       0       0       0
#> [27998,]       0       0       0       0   .       0       0       0
#>          [,1000]
#>     [1,]       0
#>     [2,]       0
#>     [3,]       0
#>     [4,]       0
#>     [5,]       0
#>      ...       .
#> [27994,]       0
#> [27995,]       0
#> [27996,]       0
#> [27997,]       0
#> [27998,]       0

Notice that this process of realization used block-processing, which avoids loading the entire dataset entire memory. Also notice that the result of realization is here returned as an HDF5Matrix, which no longer carries around the delayed operations.

class(tenx_counts_subset)
#> [1] "DelayedMatrix"
#> attr(,"package")
#> [1] "DelayedArray"
class(tenx_counts_subset_hdf5)
#> [1] "HDF5Matrix"
#> attr(,"package")
#> [1] "HDF5Array"

showtree(tenx_counts_subset)
#> 27998x1000 integer: DelayedMatrix object
#> └─ 27998x1000 integer: Subset
#>    └─ 27998x1306127 integer: [seed] HDF5ArraySeed object
showtree(tenx_counts_subset_hdf5)
#> 27998x1000 integer: HDF5Matrix object
#> └─ 27998x1000 integer: [seed] HDF5ArraySeed object

Used like this, writeHDF5Array() and as(..., "HDF5Array") will write their results to a file in HDF5 dump directory, a dumping ground for automatically created HDF5 datasets. We can see a log of the operations that have written to the HDF5 dump directory using showHDF5DumpLog().

showHDF5DumpLog()
#> [2019-06-22 01:52:37] #1 In file '/tmp/RtmpsWpKRJ/HDF5Array_dump/auto00001.h5': creation of dataset '/HDF5ArrayAUTO00001' (27998x1000:integer, chunkdims=5291x188, level=6)
#> [2019-06-22 01:52:42] #2 In file '/tmp/RtmpsWpKRJ/HDF5Array_dump/auto00002.h5': creation of dataset '/HDF5ArrayAUTO00002' (27998x1000:integer, chunkdims=5291x188, level=6)

Often, however, we will want full control of where and how the data are written to the HDF5 file15 and the writeHDF5Array() function gives you full control over this and more.

# Write the data to a user-specified HDF5 file using maximum compression and 
# 'chunking' along the columns.
my_hdf5_file <- tempfile(fileext = ".h5")
tenx_counts_subset_my_file_hdf5 <- writeHDF5Array(
  tenx_counts_subset,
  filepath = my_hdf5_file,
  chunkdim = c(nrow(tenx_counts_subset), 1),
  level = 9)
#> Realizing block 1/2 ... OK, writing it ... OK
#> Realizing block 2/2 ... OK, writing it ... OK

# Compare `tenx_counts_subset_hdf5` to `tenx_counts_subset_my_file_hdf5`
seed(tenx_counts_subset_hdf5)
#> An object of class "HDF5ArraySeed"
#> Slot "filepath":
#> [1] "/tmp/RtmpsWpKRJ/HDF5Array_dump/auto00001.h5"
#> 
#> Slot "name":
#> [1] "/HDF5ArrayAUTO00001"
#> 
#> Slot "dim":
#> [1] 27998  1000
#> 
#> Slot "first_val":
#> [1] 0
#> 
#> Slot "chunkdim":
#> [1] 5291  188
seed(tenx_counts_subset_my_file_hdf5)
#> An object of class "HDF5ArraySeed"
#> Slot "filepath":
#> [1] "/tmp/RtmpsWpKRJ/file91da21ac87b3.h5"
#> 
#> Slot "name":
#> [1] "/HDF5ArrayAUTO00003"
#> 
#> Slot "dim":
#> [1] 27998  1000
#> 
#> Slot "first_val":
#> [1] 0
#> 
#> Slot "chunkdim":
#> [1] 27998     1

Realization backends

We’ve now seen that we can realize to an HDF5 file. This is called as the HDF5Array ‘realization backend’ and is implemented in the HDF5Array package. There are a few other realization backends to be aware of.

supportedRealizationBackends()
#>        BACKEND      package  
#> 1     RleArray DelayedArray  
#> 2    HDF5Array    HDF5Array  
#> 3   TENxMatrix    HDF5Array

There is also the NULL backend, which means the data are realized in memory as an ordinary array and then wrapped in a DelayedArray. This is the default realization backend upon loading/attaching the DelayedArray package.

getRealizationBackend()
#> NULL

The default realization backend can be altered with setRealizationBackend().

setRealizationBackend("HDF5Array")
getRealizationBackend()
#> [1] "HDF5Array"
setRealizationBackend(NULL)
getRealizationBackend()
#> NULL

It can be important to know what your current realization backend is because it will be used implicitly by some functions. For example, matrix multiplication that involves a DelayedMatrix uses the current realization backend.

setRealizationBackend("HDF5Array")
tenx_counts_subset %*% matrix(1, nrow = ncol(tenx_counts_subset))
#> Realizing block 1/1 ... OK, writing it ... OK
#> <27998 x 1> HDF5Matrix object of type "double":
#>          [,1]
#>     [1,]   18
#>     [2,]    0
#>     [3,]    0
#>     [4,]    0
#>     [5,]    0
#>      ...    .
#> [27994,]    0
#> [27995,] 1122
#> [27996,]   93
#> [27997,]    0
#> [27998,]    1

setRealizationBackend(NULL)
tenx_counts_subset %*% matrix(1, nrow = ncol(tenx_counts_subset))
#> <27998 x 1> DelayedMatrix object of type "double":
#>          [,1]
#>     [1,]   18
#>     [2,]    0
#>     [3,]    0
#>     [4,]    0
#>     [5,]    0
#>      ...    .
#> [27994,]    0
#> [27995,] 1122
#> [27996,]   93
#> [27997,]    0
#> [27998,]    1

The realize() function

We’ve seen that we can realize to the HDF5Array backend using writeHDF5Array(tenx_counts_subset) and as(tenx_counts_subset, "HDF5Array"). A third way of realizing a DelayedArray to an HDF5 file is with the realize() function.

realize(tenx_counts_subset, BACKEND = "HDF5Array")
#> Realizing block 1/2 ... OK, writing it ... OK
#> Realizing block 2/2 ... OK, writing it ... OK
#> <27998 x 1000> HDF5Matrix object of type "integer":
#>             [,1]    [,2]    [,3]    [,4] ...  [,997]  [,998]  [,999]
#>     [1,]       0       0       0       0   .       0       0       0
#>     [2,]       0       0       0       0   .       0       0       0
#>     [3,]       0       0       0       0   .       0       0       0
#>     [4,]       0       0       0       0   .       0       0       0
#>     [5,]       0       0       0       0   .       0       0       0
#>      ...       .       .       .       .   .       .       .       .
#> [27994,]       0       0       0       0   .       0       0       0
#> [27995,]       1       0       0       2   .       5       3       2
#> [27996,]       0       0       0       0   .       0       0       0
#> [27997,]       0       0       0       0   .       0       0       0
#> [27998,]       0       0       0       0   .       0       0       0
#>          [,1000]
#>     [1,]       0
#>     [2,]       0
#>     [3,]       0
#>     [4,]       0
#>     [5,]       0
#>      ...       .
#> [27994,]       0
#> [27995,]       0
#> [27996,]       0
#> [27997,]       0
#> [27998,]       0

So why might you use realize() instead of these other options? Because it allows us to easily switch out the realization backend and will defer to the current realization backend if none is supplied.

# Realize to the current realization backend.
getRealizationBackend()
#> NULL
realize(tenx_counts_subset)
#> <27998 x 1000> DelayedMatrix object of type "integer":
#>             [,1]    [,2]    [,3]    [,4] ...  [,997]  [,998]  [,999]
#>     [1,]       0       0       0       0   .       0       0       0
#>     [2,]       0       0       0       0   .       0       0       0
#>     [3,]       0       0       0       0   .       0       0       0
#>     [4,]       0       0       0       0   .       0       0       0
#>     [5,]       0       0       0       0   .       0       0       0
#>      ...       .       .       .       .   .       .       .       .
#> [27994,]       0       0       0       0   .       0       0       0
#> [27995,]       1       0       0       2   .       5       3       2
#> [27996,]       0       0       0       0   .       0       0       0
#> [27997,]       0       0       0       0   .       0       0       0
#> [27998,]       0       0       0       0   .       0       0       0
#>          [,1000]
#>     [1,]       0
#>     [2,]       0
#>     [3,]       0
#>     [4,]       0
#>     [5,]       0
#>      ...       .
#> [27994,]       0
#> [27995,]       0
#> [27996,]       0
#> [27997,]       0
#> [27998,]       0

# Realize as an RleArray.
setRealizationBackend("RleArray")
tenx_counts_subset_rlearray <- realize(tenx_counts_subset)
#> Realizing block 1/2 ... OK, writing it ... OK
#> Realizing block 2/2 ... OK, writing it ... OK
tenx_counts_subset_rlearray
#> <27998 x 1000> RleMatrix object of type "integer":
#>             [,1]    [,2]    [,3]    [,4] ...  [,997]  [,998]  [,999]
#>     [1,]       0       0       0       0   .       0       0       0
#>     [2,]       0       0       0       0   .       0       0       0
#>     [3,]       0       0       0       0   .       0       0       0
#>     [4,]       0       0       0       0   .       0       0       0
#>     [5,]       0       0       0       0   .       0       0       0
#>      ...       .       .       .       .   .       .       .       .
#> [27994,]       0       0       0       0   .       0       0       0
#> [27995,]       1       0       0       2   .       5       3       2
#> [27996,]       0       0       0       0   .       0       0       0
#> [27997,]       0       0       0       0   .       0       0       0
#> [27998,]       0       0       0       0   .       0       0       0
#>          [,1000]
#>     [1,]       0
#>     [2,]       0
#>     [3,]       0
#>     [4,]       0
#>     [5,]       0
#>      ...       .
#> [27994,]       0
#> [27995,]       0
#> [27996,]       0
#> [27997,]       0
#> [27998,]       0

# Switch back to the default backend.
setRealizationBackend(NULL)

This is probably most useful when writing package code so that you can allow the user control over the realization backend.

Package ecosystem

The DelayedArray framework is, unsurprisingly, implemented in the DelayedArray package. However, there are several other key packages that are an important part of the broader ‘ecosystem’. More importantly, as a user of Bioconductor software, it is increasingly likely that you will encounter DelayedArray objects during a data analysis, especially if you are analysing single-cell data16. The following table lists packages that depend upon the DelayedArray package.

dep_tbl <- BiocPkgTools::buildPkgDependencyDataFrame()
da_dep_tbl <- dep_tbl[dep_tbl$dependency == "DelayedArray", 
                      c("Package", "edgetype")]
da_dep_tbl <- da_dep_tbl[with(da_dep_tbl, order(edgetype, Package)), ]
colnames(da_dep_tbl) <- c("Package", "Dependency Type")
rmarkdown::paged_table(da_dep_tbl)

We will briefly highlight some of the key packages in this table, broadly categorising these as ‘user-focused’/‘user-facing’ or ‘developer-focused’ packages and those that span the spectrum.

Packages that both users and developers should probably know about

DelayedArray

Well, duh. Implements the DelayedArray and RleArray classes, along with all the fundamentals the enable the delayed operations, block processing, and realization that underpin the DelayedArray framework.

HDF5Array

Implements the HDF5Array and TENxMatrix classes, two convenient and memory-efficient array-like containers for on-disk representation of HDF5 datasets. HDF5Array is for datasets that use the conventional (i.e. dense) HDF5 representation. TENxMatrix is for datasets that use the HDF5-based sparse matrix representation from 10x Genomics.

VCFArray and GDSArray

Implements the VCFArray and GDSArray classes, types of DelayedArray, to represent VCF files and GDS-files in an array-like representation. VCF and GDS files are widely used to represent genotyping or sequence data.

rhdf5client and restfulSE

Provide functions and classes to interface with remote data stores by operating on SummarizedExperiment-like objects. These data are HDF5 files living on a remote server running h5serv, a REST-based service for HDF5 data.

User-focused/user-facing packages

These are the packages that as a user you might directly load/attach with library() as part of a data analysis. Alternatively, these may be loaded/attached as a dependency17 of another package you load/attach as part of an analysis.

DropletUtils

Provides a number of utility functions for handling single-cell (RNA-seq) data from droplet technologies such as 10X Genomics. This includes read10xCounts() for data loading from the count matrices produced by 10x Genomics’ CellRanger software, which may be stored in an HDF5 file. To do this, it makes use of the TENxMatrix class.

LoomExperiment

Provides a means to convert from ‘loom’ files to standard Bioconductor classes and back again. The Loom file format uses HDF5 to store experimental data and is used by some tools and labs producing data using single-cell assays. This includes the import() function for data loading from loom files into an HDF5Matrix.

scater

A collection of tools for doing various analyses of single-cell RNA-seq gene expression data, with a focus on quality control and visualization. These methods work on DelayedMatrix objects as well as ordinary matrix objects and some sparse matrix objects from the Matrix package.

batchelor

Implements a variety of methods for batch correction of single-cell (RNA sequencing) data, such asmultiBatchPCA() and fastMNN(). These methods work on DelayedMatrix objects as well as ordinary matrix objects and some sparse matrix objects from the Matrix package.

BiocSingular

Implements exact and approximate methods for singular value decomposition and principal components analysis using a framework that allows them to be easily switched within Bioconductor packages or workflows. These methods work on DelayedMatrix objects as well as ordinary matrix objects and some sparse matrix objects from the Matrix package.

mbkmeans

Implements the mini-batch k-means algorithm for large datasets, including support for on-disk data representation (i.e. HDF5Matrix objects).

bsseq

A collection of tools for analyzing and visualizing bisulfite sequencing data. This was one of the first packages to make use of the DelayedArray framework and it supports these throughout the package. This was needed in order to store and analyse large non-CpG methylation datasets (> 300 million loci, hundreds of samples) using HDF5 files. Disclaimer: I did this re-write of bsseq and learnt a lot along the way.

minfi

Tools to analyze & visualize Illumina Infinium methylation arrays. This doesn’t have the same level of support for DelayedMatrix objects as bsseq, but perhaps one day. This is needed in order to store and analyse large methylation datasets (> 850,000 loci, tens of thousands of) using HDF5 files. Disclaimer: This was the second package, after bsseq, I started to re-write to support the DelayedArray framework. Here, it is rather more difficult because it is a ‘widely’ used package and has code from lots of different authors with different styles.

DelayedMatrixStats

A port of the matrixStats API for use with DelayedMatrix objects. High-performing functions operating on rows and columns of DelayedMatrix objects, e.g. col / rowMedians(), col / rowRanks(), and col / rowSds(). Functions optimized per data type and for subsetted calculations such that both memory usage and processing time is minimized. Disclaimer: I wrote this one, too.

Developer-focused packages

beachmat

Provides a consistent C++ class interface for reading from and writing data to a variety of commonly used matrix types. Ordinary matrices and several sparse/dense Matrix classes are directly supported, third-party S4 classes may be supported by external linkage (such as the HDF5Matrix class), while all other matrices are handled by DelayedArray block processing.

Real world encounter with DelayedArray analysing scRNA-seq data

We will perform a brief analysis of a peripheral blood mononuclear cell (PBMC) scRNA-seq dataset from 10X Genomics (Zheng et al. 2017) to illustrate a real world analysis that makes use of the DelayedArray framework18.

This dataset contains more than 4000 cells, which isn’t that large in the grand scheme of things, but does allow us to demonstrate some important concepts in a workshop format. It has been pre-packaged and is available from the TENxPBMCData package as a SingleCellExperiment object with a HDF5Matrix in the counts assay. In order to efficiently analyse this data, we would like to use software that supports the HDF5Matrix class via the DelayedArray framework. Thanks to the tireless efforts of Aaron Lun, this is true of many of the key Bioconductor packages for analysing scRNA-seq data: DropletUtils, scater, scran, and BiocSingular.

Loading the data

A typical analysis of a 10x Genomics dataset would start by using the DropletUtils::read10xCounts() function to load the output of 10x Genomics’ CellRanger software into R. One of the CellRanger outputs is an HDF5 file, which DropletUtils::read10xCounts() can read and return as a TENxMatrix, an HDF5-backed DelayedMatrix. For this workshop, we’ll use one we prepared earlier and simply load the ‘filtered’ dataset from the TENxPBMCData package which stores the count data in an HDF5Matrix.

library(TENxPBMCData)
# NOTE: This will download the data and may take a little while on the first 
#       run. The result will be cached, however, so subsequent runs avoid 
#       re-downloading the data.
sce <- TENxPBMCData("pbmc4k")
#> snapshotDate(): 2019-06-20
#> see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation
#> downloading 0 resources
#> loading from cache 
#>     'EH1613 : 1613'
sce
#> class: SingleCellExperiment 
#> dim: 33694 4340 
#> metadata(0):
#> assays(1): counts
#> rownames(33694): ENSG00000243485 ENSG00000237613 ...
#>   ENSG00000277475 ENSG00000268674
#> rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
#> colnames: NULL
#> colData names(11): Sample Barcode ... Individual Date_published
#> reducedDimNames(0):
#> spikeNames(0):
# NOTE: The counts data are stored in an HDF5Matrix.
class(assay(sce, withDimnames = FALSE))
#> [1] "HDF5Matrix"
#> attr(,"package")
#> [1] "HDF5Array"

We’ll also switch to use the gene symbols provided by 10x Genomics rather than referring to genes by their Ensembl IDs.

library(scater)
#> Loading required package: ggplot2
rownames(sce) <- uniquifyFeatureNames(
  rowData(sce)$ENSEMBL_ID,
  rowData(sce)$Symbol_TENx)

Quality control on the cells

After loading the data into R, we compute some quality control metrics of the data using scater::calculateQCMetrics(). Under the hood, this involves computing column sums, comparisons of values, and subsetting of the HDF5-backed count matrix, all seamlessly handled by this function.

# NOTE: Need to know which genes are mitochondrial to perform QC.
library(EnsDb.Hsapiens.v86)
#> Loading required package: ensembldb
#> Loading required package: GenomicFeatures
#> Loading required package: AnnotationDbi
#> Loading required package: AnnotationFilter
#> 
#> Attaching package: 'ensembldb'
#> The following object is masked from 'package:stats':
#> 
#>     filter
location <- mapIds(
  EnsDb.Hsapiens.v86,
  keys = rowData(sce)$ENSEMBL_ID, 
  column = "SEQNAME", 
  keytype = "GENEID")
is_mito <- location == "MT"

library(scater)
sce <- calculateQCMetrics(sce, feature_controls = list(Mito = which(is_mito)))
multiplot(
  plotColData(sce, "log10_total_counts") + 
    ylab("Log-total UMI count"),
  plotColData(sce, "total_features_by_counts") + 
    ylab("Log-total number of expressed features"),
  plotColData(sce, "pct_counts_Mito") +
    ylab("Proportion of reads in mitochondrial genes"),
  cols = 2)