Public data resources and Bioconductor
Instructor names and contact information
- Levi Waldron <levi.waldron at sph.cuny.edu> (City University of New York, New York, NY, USA)
- Benjamin Haibe-Kains <benjamin.haibe.kains at utoronto.ca> (Princess Margaret Cancer Center, Toronto, Canada)
- Sean Davis <sdavis2 at mail.nih.gov> (Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA)
The goal of this workshop is to introduce Bioconductor packages for finding, accessing, and using large-scale public data resources including the Gene Expression Omnibus GEO, the NCI Genomic Data Commons GDC, and Bioconductor-hosted curated data resources for metagenomics, pharmacogenomics PharmacoDB, and The Cancer Genome Atlas.
- Basic knowledge of R syntax
- Familiarity with the ExpressionSet and SummarizedExperiment classes
- Basic familiarity with ’omics technologies such as microarray and NGS sequencing
Interested students can prepare by reviewing vignettes of the packages listed in “R/Bioconductor packages used” to gain background on aspects of interest to them.
Some more general background on these resources is published in Kannan et al. (2016).
Each component will include runnable examples of typical usage that students are encouraged to run during demonstration of that component.
R/Bioconductor packages used
- GEOquery: Access to the NCBI Gene Expression Omnibus (GEO), a public repository of gene expression (primarily microarray) data.
- GenomicDataCommons: Access to the NIH / NCI Genomic Data Commons RESTful service.
- curatedTCGAData: Curated data from The Cancer Genome Atlas (TCGA) as MultiAssayExperiment Objects
- curatedMetagenomicData: Curated metagenomic data of the human microbiome
- HMP16SData: Curated metagenomic data of the human microbiome
- PharmacoGx: Curated large-scale preclinical pharmacogenomic data and basic analysis tools
This is a 1h45m workshop.
|curatedMetagenomicData and HMP16SData||15m|
|Q & A||20m|
Workshop goals and objectives
Bioconductor provides access to significant amounts of publicly available experimental data. This workshop introduces students to Bioconductor interfaces to the NCBI’s Gene Expression Omnibus, Genomic Data Commons, and PharmacoDB. It additionally introduces curated resources providing The Cancer Genome Atlas, the Human Microbiome Project and other microbiome studies, and major pharmacogenomic studies, as native Bioconductor objects ready for analysis and comparison to in-house datasets.
- search NCBI resources for publicly available ’omics data
- quickly use data from the TCGA and the Human Microbiome Project
- find and download processed microarray and RNA-seq datasets from the Gene Expression Omnibus
- find and download ’omics data from the Genomic Data Commons
- download and manipulate data from The Cancer Genome Atlas and Human Microbiome Project
- download and explore pharmacogenomics data
Before proceeding, ensure that the following packages are installed.
required_pkgs = c( "TCGAbiolinks", "GEOquery", "GenomicDataCommons", "limma", "curatedTCGAData", "recount", "curatedMetagenomicData", "phyloseq", "HMP16SData", "caTools", "piano", "isa", "VennDiagram", "downloader", "gdata", "AnnotationDbi", "hgu133a.db", "PharmacoGx") BiocManager::install(required_pkgs)
(Davis and Meltzer 2007)
The NCBI Gene Expression Omnibus (GEO) serves as a public repository for a wide range of high-throughput experimental data. These data include single and dual channel microarray-based experiments measuring mRNA, genomic DNA, and protein abundance, as well as non-array techniques such as serial analysis of gene expression (SAGE), mass spectrometry proteomic data, and high-throughput sequencing data. The GEOquery package (Davis and Meltzer 2007) forms a bridge between this public repository and the analysis capabilities in Bioconductor.
Overview of GEO
At the most basic level of organization of GEO, there are four basic entity types. The first three (Sample, Platform, and Series) are supplied by users; the fourth, the dataset, is compiled and curated by GEO staff from the user-submitted data. See the GEO home page for more information.
A Platform record describes the list of elements on the array (e.g., cDNAs, oligonucleotide probesets, ORFs, antibodies) or the list of elements that may be detected and quantified in that experiment (e.g., SAGE tags, peptides). Each Platform record is assigned a unique and stable GEO accession number (GPLxxx). A Platform may reference many Samples that have been submitted by multiple submitters.
A Sample record describes the conditions under which an individual Sample was handled, the manipulations it underwent, and the abundance measurement of each element derived from it. Each Sample record is assigned a unique and stable GEO accession number (GSMxxx). A Sample entity must reference only one Platform and may be included in multiple Series.
A Series record defines a set of related Samples considered to be part of a group, how the Samples are related, and if and how they are ordered. A Series provides a focal point and description of the experiment as a whole. Series records may also contain tables describing extracted data, summary conclusions, or analyses. Each Series record is assigned a unique and stable GEO accession number (GSExxx). Series records are available in a couple of formats which are handled by GEOquery independently. The smaller and new GSEMatrix files are quite fast to parse; a simple flag is used by GEOquery to choose to use GSEMatrix files (see below).
GEO DataSets (GDSxxx) are curated sets of GEO Sample data. There are hundreds of GEO datasets available, but GEO discontinued creating GDS records several years ago. We mention them here for completeness only.
Getting Started using GEOquery
Getting data from GEO is really quite easy. There is only one command that is needed,
getGEO. This one function interprets its input to determine how to get the data from GEO and then parse the data into useful R data structures.
With the library loaded, we are free to access any GEO accession.
Use case: MDS plot of cancer data
The data we are going to access are from this paper.
Background: The tumor microenvironment is an important factor in cancer immunotherapy response. To further understand how a tumor affects the local immune system, we analyzed immune gene expression differences between matching normal and tumor tissue.Methods: We analyzed public and new gene expression data from solid cancers and isolated immune cell populations. We also determined the correlation between CD8, FoxP3 IHC, and our gene signatures.Results: We observed that regulatory T cells (Tregs) were one of the main drivers of immune gene expression differences between normal and tumor tissue. A tumor-specific CD8 signature was slightly lower in tumor tissue compared with normal of most (12 of 16) cancers, whereas a Treg signature was higher in tumor tissue of all cancers except liver. Clustering by Treg signature found two groups in colorectal cancer datasets. The high Treg cluster had more samples that were consensus molecular subtype 1/4, right-sided, and microsatellite-instable, compared with the low Treg cluster. Finally, we found that the correlation between signature and IHC was low in our small dataset, but samples in the high Treg cluster had significantly more CD8+ and FoxP3+ cells compared with the low Treg cluster.Conclusions: Treg gene expression is highly indicative of the overall tumor immune environment.Impact: In comparison with the consensus molecular subtype and microsatellite status, the Treg signature identifies more colorectal tumors with high immune activation that may benefit from cancer immunotherapy.
In this little exercise, we will:
- Access public omics data using the GEOquery package
- Convert the public omics data to a
- Perform a simple unsupervised analysis to visualize these public data.
gse = getGEO("GSE103512")[]
## Warning: 102 parsing failures. ## row col expected actual file ## 54614 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data ## 54615 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data ## 54616 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data ## 54617 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data ## 54618 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data ## ..... ....... .................. ......... ............ ## See problems(...) for more details.
getGEO, when used to retrieve GSE records, returns a list. The members of the list each represent
one GEO Platform, since each GSE record can contain multiple related datasets (eg., gene expression and DNA methylation).
In this case, the list is of length one, but it is still necessary to grab the first elment.
The first step–a detail–is to convert from the older Bioconductor data structure (GEOquery was written in 2007), the
ExpressionSet, to the newer
SummarizedExperiment. One line suffices.
library(SummarizedExperiment) se = as(gse, "SummarizedExperiment")
Examine two variables of interest, cancer type and tumor/normal status.
## normal.ch1 ## cancer.type.ch1 no yes ## BC 65 10 ## CRC 57 12 ## NSCLC 60 9 ## PCA 60 7
Filter gene expression by variance to find most informative genes.
sds = apply(assay(se, 'exprs'),1,sd) dat = assay(se, 'exprs')[order(sds,decreasing = TRUE)[1:500],]
Perform multidimensional scaling and prepare for plotting. We will be using ggplot2, so we need to make a data.frame before plotting.
mdsvals = cmdscale(dist(t(dat))) mdsvals = as.data.frame(mdsvals) mdsvals$Type=factor(colData(se)[,'cancer.type.ch1']) mdsvals$Normal = factor(colData(se)[,'normal.ch1']) head(mdsvals)
## V1 V2 Type Normal ## GSM2772660 8.531331 18.57115 BC no ## GSM2772661 8.991591 13.63764 BC no ## GSM2772662 10.788973 13.48403 BC no ## GSM2772663 3.127105 19.13529 BC no ## GSM2772664 13.056599 13.88711 BC no ## GSM2772665 7.903717 13.24731 BC no
And do the plot.
library(ggplot2) ggplot(mdsvals, aes(x=V1,y=V2,shape=Normal,color=Type)) + geom_point( alpha=0.6) + theme(text=element_text(size = 18))
Accessing Raw Data from GEO
NCBI GEO accepts (but has not always required) raw data such as .CEL files, .CDF files, images, etc. It is also not uncommon for some RNA-seq or other sequencing datasets to supply only raw data (with accompanying sample information, of course), necessitating Sometimes, it is useful to get quick access to such data. A single function,
getGEOSuppFiles, can take as an argument a GEO accession and will download all the raw data associate with that accession. By default, the function will create a directory in the current working directory to store the raw data for the chosen GEO accession.
From the Genomic Data Commons (GDC) website:
The National Cancer Institute’s (NCI’s) Genomic Data Commons (GDC) is a data sharing platform that promotes precision medicine in oncology. It is not just a database or a tool; it is an expandable knowledge network supporting the import and standardization of genomic and clinical data from cancer research programs. The GDC contains NCI-generated data from some of the largest and most comprehensive cancer genomic datasets, including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Therapies (TARGET). For the first time, these datasets have been harmonized using a common set of bioinformatics pipelines, so that the data can be directly compared. As a growing knowledge system for cancer, the GDC also enables researchers to submit data, and harmonizes these data for import into the GDC. As more researchers add clinical and genomic data to the GDC, it will become an even more powerful tool for making discoveries about the molecular basis of cancer that may lead to better care for patients.
The data model for the GDC is complex, but it worth a quick overview and a graphical representation is included here.
The GDC API exposes these nodes and edges in a somewhat simplified set of RESTful endpoints.
This quickstart section is just meant to show basic functionality. More details of functionality are included further on in this vignette and in function-specific help.
To report bugs or problems, either
submit a new issue
or submit a
bug.report(package='GenomicDataCommons') from within R (which will
redirect you to the new issue on GitHub).
Installation of the GenomicDataCommons package is identical to installation of other Bioconductor packages.
After installation, load the library in order to use it.
Check connectivity and status
The GenomicDataCommons package relies on having network
connectivity. In addition, the NCI GDC API must also be operational
and not under maintenance. Checking
status can be used to check this
connectivity and functionality.
## $commit ##  "3e22a4257d5079ae9f7193950b51ed9dfc561ed1" ## ## $data_release ##  "Data Release 17.0 - June 05, 2019" ## ## $status ##  "OK" ## ## $tag ##  "1.21.0" ## ## $version ##  1
The following code builds a
manifest that can be used to guide the
download of raw data. Here, filtering finds gene expression files
quantified as raw counts using
HTSeq from ovarian cancer patients.
ge_manifest = files() %>% filter( ~ cases.project.project_id == 'TCGA-OV' & type == 'gene_expression' & analysis.workflow_type == 'HTSeq - Counts') %>% manifest()
After the 379 gene expression files
specified in the query above. Using multiple processes to do the download very
significantly speeds up the transfer in many cases. On a standard 1Gb
connection, the following completes in about 30 seconds. The first time the
data are downloaded, R will ask to create a cache directory (see
for details of setting and interacting with the cache). Resulting
downloaded files will be stored in the cache directory. Future access to
the same files will be directly from the cache, alleviating multiple downloads.
fnames = lapply(ge_manifest$id[1:20],gdcdata)
If the download had included controlled-access data, the download above would
have needed to include a
token. Details are available in
the authentication section below.
The GenomicDataCommons can access the significant clinical, demographic, biospecimen, and annotation information contained in the NCI GDC.
expands = c("diagnoses","annotations", "demographic","exposures") projResults = projects() %>% results(size=10) str(projResults,list.len=5)
## List of 9 ## $ dbgap_accession_number: chr [1:10] "phs001287" "phs001374" "phs001628" "phs000466" ... ## $ disease_type :List of 10 ## ..$ CPTAC-3 : chr "Adenomas and Adenocarcinomas" ## ..$ VAREPOP-APOLLO : chr [1:2] "Epithelial Neoplasms, NOS" "Squamous Cell Neoplasms" ## ..$ BEATAML1.0-CRENOLANIB: chr "Myeloid Leukemias" ## ..$ TARGET-CCSK : chr "Clear Cell Sarcoma of the Kidney" ## ..$ TARGET-NBL : chr "Neuroblastoma" ## .. [list output truncated] ## $ releasable : logi [1:10] FALSE FALSE FALSE TRUE TRUE FALSE ... ## $ released : logi [1:10] TRUE TRUE TRUE TRUE TRUE TRUE ... ## $ state : chr [1:10] "open" "open" "open" "open" ... ## [list output truncated] ## - attr(*, "row.names")= int [1:10] 1 2 3 4 5 6 7 8 9 10 ## - attr(*, "class")= chr [1:3] "GDCprojectsResults" "GDCResults" "list"
##  "dbgap_accession_number" "disease_type" ##  "releasable" "released" ##  "state" "primary_site" ##  "project_id" "id" ##  "name"
# or listviewer::jsonedit(clinResults)
This package design is meant to have some similarities to the “hadleyverse” approach of dplyr. Roughly, the functionality for finding and accessing files and metadata can be divided into:
- Simple query constructors based on GDC API endpoints.
- A set of verbs that when applied, adjust filtering, field selection, and faceting (fields for aggregation) and result in a new query object (an endomorphism)
- A set of verbs that take a query and return results from the GDC
In addition, there are exhiliary functions for asking the GDC API for information about available and default fields, slicing BAM files, and downloading actual data files. Here is an overview of functionality1.
- Creating a query
- Manipulating a query
- Introspection on the GDC API fields
- Executing an API call to retrieve query results
- Raw data file downloads
- Summarizing and aggregating field values (faceting)
- BAM file slicing
There are two main classes of operations when working with the NCI GDC.
- Querying metadata and finding data files (e.g., finding all gene expression quantifications data files for all colon cancer patients).
- Transferring raw or processed data from the GDC to another computer (e.g., downloading raw or processed data)
Both classes of operation are reviewed in detail in the following sections.
Vast amounts of metadata about cases (patients, basically), files, projects, and
so-called annotations are available via the NCI GDC API. Typically, one will
want to query metadata to either focus in on a set of files for download or
transfer or to perform so-called aggregations (pivot-tables, facets, similar
to the R
Querying metadata starts with creating a “blank” query. One
will often then want to
filter the query to limit results prior
to retrieving results. The GenomicDataCommons package has
helper functions for listing fields that are available for
In addition to fetching results, the GDC API allows faceting, or aggregating,, useful for compiling reports, generating dashboards, or building user interfaces to GDC data (see GDC web query interface for a non-R-based example).
Creating a query
The GenomicDataCommons package accesses the same
API as the GDC website. Therefore, a useful approach, particularly for
beginning users is to examine the filters available on the
GDC repository pages
to find appropriate filtering criteria. From there, converting those checkboxes
to a GenomicDataCommons
query() is relatively straightforward. Note that only
a small subset of the
available_fields() are available by default on the website.