Bioinformatics tools for integrating and understanding molecular changes associated with with Immune Response, Stemness and Oncogenic processes

Instructors

Workshop Description

Recently, The Cancer Genome Atlas (TCGA’s) Pan-Cancer Atlas initiative presented a comprehensive collection of 27 studies covering 11,000 patient tumors from 33 cancer types. These studies investigated cancer complexity from different angles by integrating multi -omics and clinical data. In particular, computational analyses have led to the identification of 299 cancer-driver genes and over 3,400 driver mutations. However, it still remains critical to clarify the consequences of each alteration and the underlying biological effects. We will discuss several computational tools that are useful in clarifying gene functions when performing integrative analysis of multi-omics datasets. In order to deal with the challenges of data retrieval and integration, TCGAbiolinks (Colaprico et al. 2018, PMID: 26704973) and DeepBlueR (Albrecht et al., 2017, PMID: 28334349) were developed to retrieve data from TCGA, CPTAC, GTEx, GEO and IHEC, Blueprint, ENCODE, Roadmap, respectively. Tumor-specific cancer-driver-gene events and downstream impact can be elucidated by integrative analyses of these datasets using MoonlightR (Colaprico et al. 2018). TCGAbiolinks and MoolightR have been used successfully in multiple studies for oncogenic processes identification (Ding et al., 2018, PMID: 29625049), oncogenic clinically actionable driver genes discovery (Bailey et al., 2018, PMID: 29625053), and comprehensive immune landscape characterization (Thorsson et al., 2018, PMID: 29628290)

Workshop: overview

In this workshop we will demonstrate the capability of TCGAbiolinks, deepblueR and MoonlightR for (1) retrieving -omics data from different consortia (TCGA, GTEx, GEO and IHEC); (2) discovering cancer driver genes by integrating multi-omics datasets; (3) Identifying the six immune subtypes from TCGA PanCancer dataset and (4) Estimating stemness scores using gene expression data. Our goal is for the end users to understand how biological features (Immune Subtypes, Oncogenic Processes, Driver Genes and Stemness) can be used to expand understating of their own datasets.

The packages we need for this session are

library(SummarizedExperiment)
library(dplyr)
library(Bioc2019PanCancerStudy)

Data retrieval from (GDC-TCGA, GEO and IHEC)

Using TCGAbiolinks, you can easily query, download, and prepare multi-omics data from NCI’s Genomic Data Commons (GDC). The GDC provides the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine. The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (CCG), including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Treatments (TARGET). [https://gdc.cancer.gov/]

library(TCGAbiolinks)

We will illustrate retrieval of different types of –omics datasets including: 1. Gene expression 2. Copy number 3. Protein expression (RRPA) 4. Methylation 5. Clinical data 6. microRNA expression 7. Mutation 8. Chromatin Accessibility

TCGA Gene Expression (IlluminaHiSeq_RNASeqV2) and multi -omics data

The end user can search TCGA’s samples for a specific cancer of origin, download and prepare a matrix of gene expression. This code is an example; we provide the output of this code in this workshop package (see below).

cancerType <- "BRCA"

# Query platform Illumina HiSeq with a list of barcode 
query <- GDCquery(
  project = paste0("TCGA-",cancerType),
  data.category = "Gene expression",
  data.type = "Gene expression quantification",
  experimental.strategy = "RNA-Seq",
  platform = "Illumina HiSeq",
  file.type = "results",
  legacy = TRUE
)
# 38 seconds on GO's machine

# We select 10 tumor and 10 normal samples
Sample_sel <- query$results[[1]]$cases

# TP is Primary Tumor
# NT is Solid Tissue Normal

Sample_sel_TP <- TCGAquery_SampleTypes(barcode = Sample_sel,typesample = "TP")
Sample_sel_NT <- TCGAquery_SampleTypes(barcode = Sample_sel,typesample = "NT")                               

Sample_sel_short <- c(Sample_sel_TP[1:10], Sample_sel_NT[1:10])

# we need to create a new query with the selected barcodes
system.time(
query_down <- GDCquery(
  project = paste0("TCGA-",cancerType),
  data.category = "Gene expression",
  data.type = "Gene expression quantification",
  experimental.strategy = "RNA-Seq",
  platform = "Illumina HiSeq",
  file.type = "results",
  barcode = Sample_sel_short,
  legacy = TRUE
)
# 44.30 seconds on GO's machine
)

Now that we have defined the query, we can download the respective data. Makse sure you have a directory to save this data defined beforehand. The data will be downloaded to the folder TCGA-BRCA (the first value of query_down$project) as a subdirectory of the data directory specified. In addition to downloading the data, this function will write a file called “MANIFEST.txt” to your current working directory, not to the directory you specify.

# DataDir is a directory to save your datasets
DataDir <- "TCGAanalysis"

# Download a list of barcodes with platform IlluminaHiSeq_RNASeqV2
GDCdownload(query_down, directory = DataDir)

# Prepare expression matrix with geneID in the rows and samples (barcode) in the columns
# rsem.genes.results as values
BRCARnaseqSE <- GDCprepare(query_down, directory = DataDir)

Next, the TCGAanalyze_Preprocessing() function performs Array-Array Intensity correlation (AAIC). It defines a square symmetric matrix of Spearman correlations among samples. According this matrix and boxplot of correlation samples by samples it is possible to find samples with low correlation that can be identified as possible outliers. The cor.cut controls the correlation threshold to filter out outliers with correlation lower than the threshold. (Rather than require that you execute the above code, we have included the output as the BRCARnaseqSE object.)

data("BRCARnaseqSE")

dataPrep <- TCGAanalyze_Preprocessing(
  BRCARnaseqSE, cor.cut = 0.6
)

dataNorm <- TCGAanalyze_Normalization(
  tabDF = dataPrep,
  geneInfo = geneInfo,
  method = "gcContent"
) 
#> I Need about  5.3 seconds for this Complete Normalization Upper Quantile  [Processing 80k elements /s]
#> Step 1 of 4: newSeqExpressionSet ...
#> Step 2 of 4: withinLaneNormalization ...
#> Step 3 of 4: betweenLaneNormalization ...
#> Step 4 of 4: .quantileNormalization ...

Now that we have preprocessed and normalized our data, we will inspect the sample boxplots.

colnames(dataPrep) <- substr(colnames(dataPrep), 1, 15)
colnames(dataNorm) <- substr(colnames(dataNorm), 1, 15)

par(mfrow = c(2, 1))
boxplot(dataPrep, outline = FALSE, las = 2)
boxplot(dataNorm, outline = FALSE, las = 2)

par(mfrow = c(1, 1))

TCGAanalyze_Filtering allows the user to filter genes selecting a threshold. For istance returns all mRNA with mean across all samples, higher than the threshold defined as quantile mean across all samples.

dataFilt <- TCGAanalyze_Filtering(
  tabDF = dataNorm,
  method = "quantile", 
  qnt.cut =  0.25
)  

The result is shown below:

Table 1: Example of a matrix of gene expression (10 genes in rows and 2 samples in columns)
TCGA-A7-A13G-11A-51R-A13Q-07 TCGA-C8-A1HJ-01A-11R-A13Q-07
SPINK9|643394 0 0
PPM1H|57460 924 335
CEACAM5|1048 4 32
ATP5L2|267020 3 1
SSX3|10214 0 0
ADAMTS14|140766 18 115
ZC3H3|23144 701 826
CNNM2|54805 544 520
MRPL28|10573 798 1519
FOLR2|2350 3168 103

The result from TCGAanalyze_Preprocessing and TCGAanalyze_Normalization is shown below:

To retrieve other datatypes using TCGAbiolinks it is possible to follow the example for gene expression but replacing the parameters as shown in the following table:

data.type data.category platform file.type
1 Gene expression quantification Gene expression Illumina HiSeq results
2 Protein expression quantification Protein expression MDA_RPPA_Core expression
3 Copy number segmentation Copy number variation Affymetrix SNP Array 6.0 nocnv_hg19.seg
4 Methylation Beta Value DNA methylation Illumina Human Methylation 450 HumanMethylation450
5 Simple somatic mutation Simple nucleotide variation NA NA
6 miRNA gene quantification Gene expression Illumina HiSeq hg19.mirbase20.mirna.quantification
7 NA Clinical NA xml

GTEx gene expression data

The Genotype-Tissue Expression (GTEx) project is an ongoing effort to build a comprehensive public resource to study tissue-specific gene expression and regulation. Samples were collected from 53 non-diseased tissue sites across nearly 1000 individuals, primarily for molecular assays including WGS, WES, and RNA-Seq. [https://gtexportal.org/home/] You can easily search GTEx samples, download and prepare a matrix of gene expression [RNA-Seq].

# Genotype-Tissue Expression (GTEx) project, which is another large-scale
#   repository cataloging gene expression from healthy individuals.

# In the following example we will retrieve data from brain's tissue.

data_gtex_brain <- TCGAquery_recount2("gtex", tissue = "brain")

Gtex_brain_annotation <- colData(data_gtex_brain$gtex_brain)
sort(table(Gtex_brain_annotation$smtsd))

data_gtex_brain_GeneExpr <- assay(data_gtex_brain$gtex_brain,1)

# data_gtex_brain_GeneExpr[1:5,1:5]
#SRR2166176 SRR2167642 SRR607445 SRR663753 SRR1485892
#ENSG00000000003.14      98669     122656     74728     56933      69009
#ENSG00000000005.5         764       1018       760      2105        151
#ENSG00000000419.12    1076085    1034842     56668    207245     124647
#ENSG00000000457.13     746248     752634     48502    121776      66762
#ENSG00000000460.16     603111     588176     25143     55928      34473

IHEC gene expression data

The International Human Epigenome Consortium (IHEC) is a global consortium with the primary goal of providing free access to high-resolution reference human epigenome maps for normal and disease cell types to the research community. [http://ihec-epigenomes.org/] IHEC provides data currently available from the following consortia: * AMED-CREST (Japan) * Blueprint (Europe) * CEEHRC (Canada) * DEEP (Germany) * ENCODE (USA) * GIS (Singapore) * KNIH (South Korea) * Roadmap (USA)

Within IHEC, the BLUEPRINT consortium has been formed with the aim to further the understanding of how genes are activated or repressed in both healthy and diseased human cells. Specifically, BLUEPRINT includes data on distinct types of haematopoietic cells from healthy individuals and their malignant leukaemic counterparts.

library(DeepBlueR)
#> Loading required package: XML
#> Loading required package: RCurl
#> Loading required package: bitops
#> Welcome to the DeepBlueR package
#> DeepBlue is online

You can easily search IHEC BLUEPRINT samples, download and prepare a matrix of gene expression.

# List all IHEC samples 
IHEC_samples <- deepblue_list_samples()

#table(IHEC_samples$source)

#          BLUEPRINT Epigenome Blueprint HSC differentiation         BLUEPRINT Progenitors 
#                         1451                            76                            76 
#                       CEEHRC                    ChIP-Atlas                         CREST 
#                          529                           497                            38 
#                         DEEP                   DEEP (IHEC)                        ENCODE 
#                          194                            39                          3317 
#                   ENCODE FTP           Roadmap Epigenomics   TEPIC reprocessed IHEC data 
#                            9                           144                            28 


# List all BLUEPRINT samples
blueprint_samples <- deepblue_list_samples(
    extra_metadata = list("source" = "BLUEPRINT Epigenome"))

# Extract their ids
blueprint_samples_ids <- deepblue_extract_ids(blueprint_samples)

# Select gene expression data. We assign gene names using Gencode 22
gene_exprs_query <- deepblue_select_expressions(sample_ids = blueprint_samples_ids,
                                                expression_type = "gene",
                                                gene_model = "gencode v22")

# We request the data and define the output format
request = deepblue_get_regions(query_id = gene_exprs_query, 
                               "@GENE_ID(gencode v22),FPKM,@BIOSOURCE,@SAMPLE_ID")

# We download the data
gene_regions <- deepblue_download_request_data(request)

# We retain a table mapping sample ids to bisources
sample_names <- dplyr::select(gene_regions, `@BIOSOURCE`, `@SAMPLE_ID`) %>%
    dplyr::distinct()

# We filter out duplicated gene entries
genes_one_sample <- dplyr::filter(gene_regions, `@SAMPLE_ID` == "s10678")
duplicated_genes <- genes_one_sample[
    which(duplicated(genes_one_sample$`@GENE_ID(gencode v22)`)),
    "@GENE_ID(gencode v22)"]

# We convert the gene expression from a list to a data frame and subsequently...
genes_matrix = dplyr::filter(gene_regions,
                             !(`@GENE_ID(gencode v22)` %in% duplicated_genes)) %>%
    dplyr::select(-`@BIOSOURCE`) %>%
    tidyr::spread(key = `@SAMPLE_ID`, value = FPKM)

# ...to a numeric matrix
genes <- genes_matrix[,1]
genes_matrix <- data.matrix(genes_matrix[,-1])
rownames(genes_matrix) <- genes

### OUTPUT
### genes_matrix : The gene expression matrix for all 276 BLUEPRINT samples
### sample_names : A mapping table from sample id to cell type / biosource
IHEC_genes_matrix <- genes_matrix
IHEC_sample_names <- sample_names

save(IHEC_genes_matrix,  IHEC_sample_names, file = "IHEC_BLUEPRINT_geneExpr.Rdata")

#> IHEC_genes_matrix[1:5,1:5]
#                   s10271 s10272 s10275 s10276 s10279
#ENSG00000000003.13   0.01   0.03   2.39   0.70   0.05
#ENSG00000000005.5    0.00   0.00   0.00   0.00   0.00
#ENSG00000000419.11  30.04  17.49  16.36   5.80  11.99
#ENSG00000000457.12   3.29   2.63   2.70   0.78   1.91
#ENSG00000000460.15   3.16   2.06   1.36   0.49   1.65

#> head(IHEC_sample_names)
#                          @BIOSOURCE @SAMPLE_ID
#1:   CD8-positive, alpha-beta T cell     s10506
#2:            germinal center B cell     s10678
#3:         common myeloid progenitor     s10518
#4: adult endothelial progenitor cell     s11176
#5:            neutrophilic myelocyte     s10331
#6:        common lymphoid progenitor     s10486

GEO gene expression dataset

GEO Gene Expression Omnibus (GEO) is the public repository for high throughput data at NCBI. [https://www.ncbi.nlm.nih.gov/geo/] Here we show how to retrieve GEOs’ samples, download and prepare a matrix of gene expression. In particular to retrieve the gene expression matrix for different cell types from Nazor et al. (2012, PMID: 22560082).

library(MoonlightR)
library(GEOquery)
library(devtools)

# get gene expression data
dataNazor <- getDataGEO(GEOobject = "GSE30652",
                        platform =  "GPL6947")

require(SummarizedExperiment)
GSE30652 <- as.data.frame(exprs(dataNazor))

# summarize gene expression data from multiple probes using means
GSE30652_non_norm <- cbind(ILMN = rownames(GSE30652),
                           IDmean = rowMeans(GSE30652),
                           GSE30652)


# get phenotype data
dataNazor_samples <- pData(dataNazor)
dataNazor_samples <- as.data.frame(dataNazor_samples)
dataNazor_samples <- subset(dataNazor_samples,
                              select = c("geo_accession","characteristics_ch1.2"))
colnames(dataNazor_samples)[2] <- "CellType"

dataNazor_samples$CellType <- gsub("cell type: ","",dataNazor_samples$CellType)
dataNazor_samples$CellType <- gsub(", undifferentiated","",dataNazor_samples$CellType)

GPL6947_13512 <- fData(dataNazor)
GPL6947_13512_annot <- as.data.frame(GPL6947_13512)
GPL6947_13512_annot <- subset(GPL6947_13512_annot,
                              select = c("ID","Gene.symbol"))



GSE30652_merge <- merge(x = GPL6947_13512_annot,
                        y = GSE30652_non_norm,
                        by.x = "ID",
                        by.y = "ILMN")


GSE30652_merge <- GSE30652_merge[order(GSE30652_merge$IDmean,decreasing = TRUE),]
GSE30652_merge <- GSE30652_merge[!duplicated(GSE30652_merge$Gene.symbol),]
NazorMatrix <- GSE30652_merge
rownames(NazorMatrix) <- NazorMatrix$Gene.symbol

NazorMatrix <- NazorMatrix[,dataNazor_samples$geo_accession]

save(NazorMatrix,dataNazor_samples, file = "Nazor_GSE30652.Rdata")

Oncogenic processes and driver genes using MoonlightR

MoonlightR overview

In this section, we illustrate the moonlightR analysis pipeline using the BRCA data we downloaded from previous section. In order to make light of cancer development, it is crucial to identify the genes involved in the disease mechanisms and moreover understand what functional roles it plays. Common biological processes such as proliferation and apoptosis have been linked to cancer progression. MoonlightR is a new approach to define tumor suppressor genes (TSGs) and oncogenes (OCGs) based on functional enrichment analysis, inference of gene regulatory networks and upstream regulator analysis to score the importance of well-known biological processes with respect to the studied cancer.

In the MoonlightR pipeline, first, based on expression data we perform functional enrichment analysis, infer gene regulatory networks and upstream regulator analysis to score the importance of well-known biological processes with respect to the studied cancer. We then use these scores to predict two specific roles for genes: genes that act as TSGs and genes that act as OCGs. This methodology not only allows us to identify genes with dual roles (TSG in one cancer type and OCG in another) but also to elucidate the underlying biological processes they are involved with.

library(MoonlightR)

MoonlightR pipeline

Here we describe two Moonlight approaches: Moonlight-EB (expert based) and Moonlight-ML (machine learning). The EB and ML approaches share the following three initial steps (Fig 1b, Methods): (i) DPA: Moonlight identifies a set of Differentially Expressed Genes (DEGs) between two conditions;
(ii) GRN: The gene expression data is used to infer a Gene Regulatory Network (GRN) with the DEGs as vertices and (iii) FEA: using Functional Enrichment Analysis (FEA), Moonlight considers a DEG in a biological system and quantifies the DEG-BP (biological process) association with a Moonlight Z-score. (iv) URA/PRA: Finally, we input DEGs and their GRN to Upstream Regulatory Analysis (URA), yielding upstream regulators of BPs mediated by the DEG and its targets. The second part of the pipeline’s tool provides Pattern Recognition Analysis (PRA) that incorporates two approaches.

In the first approach, PRA takes in two objects: (i) URA’s output and (ii) selection of a subset of the BP provided by the end-user. In contrast, if the BPs are not provided, their selection is automated by a ML method (e.g. random forest model) trained on gold standard oncogenes (OCG) and tumor-suppressor genes (TSG) in the second approach. In addition, dynamic recognition analysis (DRA) detects multiple patterns of BPs when different conditions are selected.

Moonlight’s pipeline is shown below::