Public data resources and Bioconductor

Instructor names and contact information

  • Levi Waldron <levi.waldron at> (City University of New York, New York, NY, USA)
  • Benjamin Haibe-Kains <benjamin.haibe.kains at> (Princess Margaret Cancer Center, Toronto, Canada)
  • Sean Davis <sdavis2 at> (Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA)


Workshop Description

The goal of this workshop is to introduce Bioconductor packages for finding, accessing, and using large-scale public data resources including the Gene Expression Omnibus GEO, the NCI Genomic Data Commons GDC, and Bioconductor-hosted curated data resources for metagenomics, pharmacogenomics PharmacoDB, and The Cancer Genome Atlas.


  • Basic knowledge of R syntax
  • Familiarity with the ExpressionSet and SummarizedExperiment classes
  • Basic familiarity with ’omics technologies such as microarray and NGS sequencing

Interested students can prepare by reviewing vignettes of the packages listed in “R/Bioconductor packages used” to gain background on aspects of interest to them.

Some more general background on these resources is published in Kannan et al. (2016).

Workshop Participation

Each component will include runnable examples of typical usage that students are encouraged to run during demonstration of that component.

R/Bioconductor packages used

  • GEOquery: Access to the NCBI Gene Expression Omnibus (GEO), a public repository of gene expression (primarily microarray) data.
  • GenomicDataCommons: Access to the NIH / NCI Genomic Data Commons RESTful service.
  • curatedTCGAData: Curated data from The Cancer Genome Atlas (TCGA) as MultiAssayExperiment Objects
  • curatedMetagenomicData: Curated metagenomic data of the human microbiome
  • HMP16SData: Curated metagenomic data of the human microbiome
  • PharmacoGx: Curated large-scale preclinical pharmacogenomic data and basic analysis tools

Time outline

This is a 1h45m workshop.

Activity Time
Overview 10m
GEOquery 15m
GenomicDataCommons 20m
curatedTCGAData 10m
curatedMetagenomicData and HMP16SData 15m
PharmacoGx 20m
Q & A 20m

Workshop goals and objectives

Bioconductor provides access to significant amounts of publicly available experimental data. This workshop introduces students to Bioconductor interfaces to the NCBI’s Gene Expression Omnibus, Genomic Data Commons, and PharmacoDB. It additionally introduces curated resources providing The Cancer Genome Atlas, the Human Microbiome Project and other microbiome studies, and major pharmacogenomic studies, as native Bioconductor objects ready for analysis and comparison to in-house datasets.

Learning goals

  • search NCBI resources for publicly available ’omics data
  • quickly use data from the TCGA and the Human Microbiome Project

Learning objectives

  • find and download processed microarray and RNA-seq datasets from the Gene Expression Omnibus
  • find and download ’omics data from the Genomic Data Commons
  • download and manipulate data from The Cancer Genome Atlas and Human Microbiome Project
  • download and explore pharmacogenomics data


Before proceeding, ensure that the following packages are installed.

required_pkgs = c(


(Davis and Meltzer 2007)

The NCBI Gene Expression Omnibus (GEO) serves as a public repository for a wide range of high-throughput experimental data. These data include single and dual channel microarray-based experiments measuring mRNA, genomic DNA, and protein abundance, as well as non-array techniques such as serial analysis of gene expression (SAGE), mass spectrometry proteomic data, and high-throughput sequencing data. The GEOquery package (Davis and Meltzer 2007) forms a bridge between this public repository and the analysis capabilities in Bioconductor.

Overview of GEO

At the most basic level of organization of GEO, there are four basic entity types. The first three (Sample, Platform, and Series) are supplied by users; the fourth, the dataset, is compiled and curated by GEO staff from the user-submitted data. See the GEO home page for more information.


A Platform record describes the list of elements on the array (e.g., cDNAs, oligonucleotide probesets, ORFs, antibodies) or the list of elements that may be detected and quantified in that experiment (e.g., SAGE tags, peptides). Each Platform record is assigned a unique and stable GEO accession number (GPLxxx). A Platform may reference many Samples that have been submitted by multiple submitters.


A Sample record describes the conditions under which an individual Sample was handled, the manipulations it underwent, and the abundance measurement of each element derived from it. Each Sample record is assigned a unique and stable GEO accession number (GSMxxx). A Sample entity must reference only one Platform and may be included in multiple Series.


A Series record defines a set of related Samples considered to be part of a group, how the Samples are related, and if and how they are ordered. A Series provides a focal point and description of the experiment as a whole. Series records may also contain tables describing extracted data, summary conclusions, or analyses. Each Series record is assigned a unique and stable GEO accession number (GSExxx). Series records are available in a couple of formats which are handled by GEOquery independently. The smaller and new GSEMatrix files are quite fast to parse; a simple flag is used by GEOquery to choose to use GSEMatrix files (see below).


GEO DataSets (GDSxxx) are curated sets of GEO Sample data. There are hundreds of GEO datasets available, but GEO discontinued creating GDS records several years ago. We mention them here for completeness only.

Getting Started using GEOquery

Getting data from GEO is really quite easy. There is only one command that is needed, getGEO. This one function interprets its input to determine how to get the data from GEO and then parse the data into useful R data structures.


With the library loaded, we are free to access any GEO accession.

Use case: MDS plot of cancer data

The data we are going to access are from this paper.

Background: The tumor microenvironment is an important factor in cancer immunotherapy response. To further understand how a tumor affects the local immune system, we analyzed immune gene expression differences between matching normal and tumor tissue.Methods: We analyzed public and new gene expression data from solid cancers and isolated immune cell populations. We also determined the correlation between CD8, FoxP3 IHC, and our gene signatures.Results: We observed that regulatory T cells (Tregs) were one of the main drivers of immune gene expression differences between normal and tumor tissue. A tumor-specific CD8 signature was slightly lower in tumor tissue compared with normal of most (12 of 16) cancers, whereas a Treg signature was higher in tumor tissue of all cancers except liver. Clustering by Treg signature found two groups in colorectal cancer datasets. The high Treg cluster had more samples that were consensus molecular subtype 1/4, right-sided, and microsatellite-instable, compared with the low Treg cluster. Finally, we found that the correlation between signature and IHC was low in our small dataset, but samples in the high Treg cluster had significantly more CD8+ and FoxP3+ cells compared with the low Treg cluster.Conclusions: Treg gene expression is highly indicative of the overall tumor immune environment.Impact: In comparison with the consensus molecular subtype and microsatellite status, the Treg signature identifies more colorectal tumors with high immune activation that may benefit from cancer immunotherapy.

In this little exercise, we will:

  1. Access public omics data using the GEOquery package
  2. Convert the public omics data to a SummarizedExperiment object.
  3. Perform a simple unsupervised analysis to visualize these public data.

Use the GEOquery package to fetch data about GSE103512.

gse = getGEO("GSE103512")[[1]]
## Warning: 102 parsing failures.
##   row     col           expected    actual         file
## 54614 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
## 54615 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
## 54616 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
## 54617 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
## 54618 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
## ..... ....... .................. ......... ............
## See problems(...) for more details.

Note that getGEO, when used to retrieve GSE records, returns a list. The members of the list each represent one GEO Platform, since each GSE record can contain multiple related datasets (eg., gene expression and DNA methylation). In this case, the list is of length one, but it is still necessary to grab the first elment.

The first step–a detail–is to convert from the older Bioconductor data structure (GEOquery was written in 2007), the ExpressionSet, to the newer SummarizedExperiment. One line suffices.

se = as(gse, "SummarizedExperiment")

Examine two variables of interest, cancer type and tumor/normal status.

##                normal.ch1
## cancer.type.ch1 no yes
##           BC    65  10
##           CRC   57  12
##           NSCLC 60   9
##           PCA   60   7

Filter gene expression by variance to find most informative genes.

sds = apply(assay(se, 'exprs'),1,sd)
dat = assay(se, 'exprs')[order(sds,decreasing = TRUE)[1:500],]

Perform multidimensional scaling and prepare for plotting. We will be using ggplot2, so we need to make a data.frame before plotting.

mdsvals = cmdscale(dist(t(dat)))
mdsvals =
mdsvals$Normal = factor(colData(se)[,'normal.ch1'])
##                   V1       V2 Type Normal
## GSM2772660  8.531331 18.57115   BC     no
## GSM2772661  8.991591 13.63764   BC     no
## GSM2772662 10.788973 13.48403   BC     no
## GSM2772663  3.127105 19.13529   BC     no
## GSM2772664 13.056599 13.88711   BC     no
## GSM2772665  7.903717 13.24731   BC     no

And do the plot.

ggplot(mdsvals, aes(x=V1,y=V2,shape=Normal,color=Type)) + 
    geom_point( alpha=0.6) + theme(text=element_text(size = 18))

Accessing Raw Data from GEO

NCBI GEO accepts (but has not always required) raw data such as .CEL files, .CDF files, images, etc. It is also not uncommon for some RNA-seq or other sequencing datasets to supply only raw data (with accompanying sample information, of course), necessitating Sometimes, it is useful to get quick access to such data. A single function, getGEOSuppFiles, can take as an argument a GEO accession and will download all the raw data associate with that accession. By default, the function will create a directory in the current working directory to store the raw data for the chosen GEO accession.


From the Genomic Data Commons (GDC) website:

The National Cancer Institute’s (NCI’s) Genomic Data Commons (GDC) is a data sharing platform that promotes precision medicine in oncology. It is not just a database or a tool; it is an expandable knowledge network supporting the import and standardization of genomic and clinical data from cancer research programs. The GDC contains NCI-generated data from some of the largest and most comprehensive cancer genomic datasets, including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Therapies (TARGET). For the first time, these datasets have been harmonized using a common set of bioinformatics pipelines, so that the data can be directly compared. As a growing knowledge system for cancer, the GDC also enables researchers to submit data, and harmonizes these data for import into the GDC. As more researchers add clinical and genomic data to the GDC, it will become an even more powerful tool for making discoveries about the molecular basis of cancer that may lead to better care for patients.

The data model for the GDC is complex, but it worth a quick overview and a graphical representation is included here.

The data model is encoded as a so-called property graph. Nodes represent entities such as Projects, Cases, Diagnoses, Files (various kinds), and Annotations. The relationships between these entities are maintained as edges. Both nodes and edges may have Properties that supply instance details.

The data model is encoded as a so-called property graph. Nodes represent entities such as Projects, Cases, Diagnoses, Files (various kinds), and Annotations. The relationships between these entities are maintained as edges. Both nodes and edges may have Properties that supply instance details.

The GDC API exposes these nodes and edges in a somewhat simplified set of RESTful endpoints.


This quickstart section is just meant to show basic functionality. More details of functionality are included further on in this vignette and in function-specific help.

To report bugs or problems, either submit a new issue or submit a'GenomicDataCommons') from within R (which will redirect you to the new issue on GitHub).


Installation of the GenomicDataCommons package is identical to installation of other Bioconductor packages.


After installation, load the library in order to use it.


Check connectivity and status

The GenomicDataCommons package relies on having network connectivity. In addition, the NCI GDC API must also be operational and not under maintenance. Checking status can be used to check this connectivity and functionality.

## $commit
## [1] "3e22a4257d5079ae9f7193950b51ed9dfc561ed1"
## $data_release
## [1] "Data Release 17.0 - June 05, 2019"
## $status
## [1] "OK"
## $tag
## [1] "1.21.0"
## $version
## [1] 1

Find data

The following code builds a manifest that can be used to guide the download of raw data. Here, filtering finds gene expression files quantified as raw counts using HTSeq from ovarian cancer patients.

ge_manifest = files() %>%
    filter( ~ cases.project.project_id == 'TCGA-OV' &
                type == 'gene_expression' &
                analysis.workflow_type == 'HTSeq - Counts') %>%

Download data

After the 379 gene expression files specified in the query above. Using multiple processes to do the download very significantly speeds up the transfer in many cases. On a standard 1Gb connection, the following completes in about 30 seconds. The first time the data are downloaded, R will ask to create a cache directory (see ?gdc_cache for details of setting and interacting with the cache). Resulting downloaded files will be stored in the cache directory. Future access to the same files will be directly from the cache, alleviating multiple downloads.

fnames = lapply(ge_manifest$id[1:20],gdcdata)

If the download had included controlled-access data, the download above would have needed to include a token. Details are available in the authentication section below.

Metadata queries

The GenomicDataCommons can access the significant clinical, demographic, biospecimen, and annotation information contained in the NCI GDC.

expands = c("diagnoses","annotations",
projResults = projects() %>%
## List of 9
##  $ dbgap_accession_number: chr [1:10] "phs001287" "phs001374" "phs001628" "phs000466" ...
##  $ disease_type          :List of 10
##   ..$ CPTAC-3              : chr "Adenomas and Adenocarcinomas"
##   ..$ VAREPOP-APOLLO       : chr [1:2] "Epithelial Neoplasms, NOS" "Squamous Cell Neoplasms"
##   ..$ BEATAML1.0-CRENOLANIB: chr "Myeloid Leukemias"
##   ..$ TARGET-CCSK          : chr "Clear Cell Sarcoma of the Kidney"
##   ..$ TARGET-NBL           : chr "Neuroblastoma"
##   .. [list output truncated]
##  $ releasable            : logi [1:10] FALSE FALSE FALSE TRUE TRUE FALSE ...
##  $ released              : logi [1:10] TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ state                 : chr [1:10] "open" "open" "open" "open" ...
##   [list output truncated]
##  - attr(*, "row.names")= int [1:10] 1 2 3 4 5 6 7 8 9 10
##  - attr(*, "class")= chr [1:3] "GDCprojectsResults" "GDCResults" "list"
## [1] "dbgap_accession_number" "disease_type"          
## [3] "releasable"             "released"              
## [5] "state"                  "primary_site"          
## [7] "project_id"             "id"                    
## [9] "name"
# or listviewer::jsonedit(clinResults)

Basic design

This package design is meant to have some similarities to the “hadleyverse” approach of dplyr. Roughly, the functionality for finding and accessing files and metadata can be divided into:

  1. Simple query constructors based on GDC API endpoints.
  2. A set of verbs that when applied, adjust filtering, field selection, and faceting (fields for aggregation) and result in a new query object (an endomorphism)
  3. A set of verbs that take a query and return results from the GDC

In addition, there are exhiliary functions for asking the GDC API for information about available and default fields, slicing BAM files, and downloading actual data files. Here is an overview of functionality1.

  • Creating a query
    • projects()
    • cases()
    • files()
    • annotations()
  • Manipulating a query
    • filter()
    • facet()
    • select()
  • Introspection on the GDC API fields
    • mapping()
    • available_fields()
    • default_fields()
    • grep_fields()
    • field_picker()
    • available_values()
    • available_expand()
  • Executing an API call to retrieve query results
    • results()
    • count()
    • response()
  • Raw data file downloads
    • gdcdata()
    • transfer()
    • gdc_client()
  • Summarizing and aggregating field values (faceting)
    • aggregations()
  • Authentication
    • gdc_token()
  • BAM file slicing
    • slicing()


There are two main classes of operations when working with the NCI GDC.

  1. Querying metadata and finding data files (e.g., finding all gene expression quantifications data files for all colon cancer patients).
  2. Transferring raw or processed data from the GDC to another computer (e.g., downloading raw or processed data)

Both classes of operation are reviewed in detail in the following sections.

Querying metadata

Vast amounts of metadata about cases (patients, basically), files, projects, and so-called annotations are available via the NCI GDC API. Typically, one will want to query metadata to either focus in on a set of files for download or transfer or to perform so-called aggregations (pivot-tables, facets, similar to the R table() functionality).

Querying metadata starts with creating a “blank” query. One will often then want to filter the query to limit results prior to retrieving results. The GenomicDataCommons package has helper functions for listing fields that are available for filtering.

In addition to fetching results, the GDC API allows faceting, or aggregating,, useful for compiling reports, generating dashboards, or building user interfaces to GDC data (see GDC web query interface for a non-R-based example).

Creating a query

The GenomicDataCommons package accesses the same API as the GDC website. Therefore, a useful approach, particularly for beginning users is to examine the filters available on the GDC repository pages to find appropriate filtering criteria. From there, converting those checkboxes to a GenomicDataCommons query() is relatively straightforward. Note that only a small subset of the available_fields() are available by default on the website.