Bioinformatics tools for integrating and understanding molecular changes associated with with Immune Response, Stemness and Oncogenic processes
Recently, The Cancer Genome Atlas (TCGA’s) Pan-Cancer Atlas initiative presented a comprehensive collection of 27 studies covering 11,000 patient tumors from 33 cancer types. These studies investigated cancer complexity from different angles by integrating multi -omics and clinical data. In particular, computational analyses have led to the identification of 299 cancer-driver genes and over 3,400 driver mutations. However, it still remains critical to clarify the consequences of each alteration and the underlying biological effects. We will discuss several computational tools that are useful in clarifying gene functions when performing integrative analysis of multi-omics datasets. In order to deal with the challenges of data retrieval and integration, TCGAbiolinks (Colaprico et al. 2018, PMID: 26704973) and DeepBlueR (Albrecht et al., 2017, PMID: 28334349) were developed to retrieve data from TCGA, CPTAC, GTEx, GEO and IHEC, Blueprint, ENCODE, Roadmap, respectively. Tumor-specific cancer-driver-gene events and downstream impact can be elucidated by integrative analyses of these datasets using MoonlightR (Colaprico et al. 2018). TCGAbiolinks and MoolightR have been used successfully in multiple studies for oncogenic processes identification (Ding et al., 2018, PMID: 29625049), oncogenic clinically actionable driver genes discovery (Bailey et al., 2018, PMID: 29625053), and comprehensive immune landscape characterization (Thorsson et al., 2018, PMID: 29628290)
In this workshop we will demonstrate the capability of TCGAbiolinks, deepblueR and MoonlightR for (1) retrieving -omics data from different consortia (TCGA, GTEx, GEO and IHEC); (2) discovering cancer driver genes by integrating multi-omics datasets; (3) Identifying the six immune subtypes from TCGA PanCancer dataset and (4) Estimating stemness scores using gene expression data. Our goal is for the end users to understand how biological features (Immune Subtypes, Oncogenic Processes, Driver Genes and Stemness) can be used to expand understating of their own datasets.
The packages we need for this session are
library(SummarizedExperiment) library(dplyr) library(Bioc2019PanCancerStudy)
Data retrieval from (GDC-TCGA, GEO and IHEC)
Using TCGAbiolinks, you can easily query, download, and prepare multi-omics data from NCI’s Genomic Data Commons (GDC). The GDC provides the cancer research community with a unified data repository that enables data sharing across cancer genomic studies in support of precision medicine. The GDC supports several cancer genome programs at the NCI Center for Cancer Genomics (CCG), including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Treatments (TARGET). [https://gdc.cancer.gov/]
We will illustrate retrieval of different types of –omics datasets including: 1. Gene expression 2. Copy number 3. Protein expression (RRPA) 4. Methylation 5. Clinical data 6. microRNA expression 7. Mutation 8. Chromatin Accessibility
TCGA Gene Expression (IlluminaHiSeq_RNASeqV2) and multi -omics data
The end user can search TCGA’s samples for a specific cancer of origin, download and prepare a matrix of gene expression. This code is an example; we provide the output of this code in this workshop package (see below).
cancerType <- "BRCA" # Query platform Illumina HiSeq with a list of barcode query <- GDCquery( project = paste0("TCGA-",cancerType), data.category = "Gene expression", data.type = "Gene expression quantification", experimental.strategy = "RNA-Seq", platform = "Illumina HiSeq", file.type = "results", legacy = TRUE ) # 38 seconds on GO's machine # We select 10 tumor and 10 normal samples Sample_sel <- query$results[]$cases # TP is Primary Tumor # NT is Solid Tissue Normal Sample_sel_TP <- TCGAquery_SampleTypes(barcode = Sample_sel,typesample = "TP") Sample_sel_NT <- TCGAquery_SampleTypes(barcode = Sample_sel,typesample = "NT") Sample_sel_short <- c(Sample_sel_TP[1:10], Sample_sel_NT[1:10]) # we need to create a new query with the selected barcodes system.time( query_down <- GDCquery( project = paste0("TCGA-",cancerType), data.category = "Gene expression", data.type = "Gene expression quantification", experimental.strategy = "RNA-Seq", platform = "Illumina HiSeq", file.type = "results", barcode = Sample_sel_short, legacy = TRUE ) # 44.30 seconds on GO's machine )
Now that we have defined the query, we can download the respective data. Makse sure you have a directory to save this data defined beforehand. The data will be downloaded to the folder
TCGA-BRCA (the first value of
query_down$project) as a subdirectory of the data directory specified. In addition to downloading the data, this function will write a file called “MANIFEST.txt” to your current working directory, not to the directory you specify.
# DataDir is a directory to save your datasets DataDir <- "TCGAanalysis" # Download a list of barcodes with platform IlluminaHiSeq_RNASeqV2 GDCdownload(query_down, directory = DataDir) # Prepare expression matrix with geneID in the rows and samples (barcode) in the columns # rsem.genes.results as values BRCARnaseqSE <- GDCprepare(query_down, directory = DataDir)
TCGAanalyze_Preprocessing() function performs Array-Array Intensity correlation (AAIC). It defines a square symmetric matrix of Spearman correlations among samples. According this matrix and boxplot of correlation samples by samples it is possible to find samples with low correlation that can be identified as possible outliers. The
cor.cut controls the correlation threshold to filter out outliers with correlation lower than the threshold. (Rather than require that you execute the above code, we have included the output as the
data("BRCARnaseqSE") dataPrep <- TCGAanalyze_Preprocessing( BRCARnaseqSE, cor.cut = 0.6 ) dataNorm <- TCGAanalyze_Normalization( tabDF = dataPrep, geneInfo = geneInfo, method = "gcContent" ) #> I Need about 5.3 seconds for this Complete Normalization Upper Quantile [Processing 80k elements /s] #> Step 1 of 4: newSeqExpressionSet ... #> Step 2 of 4: withinLaneNormalization ... #> Step 3 of 4: betweenLaneNormalization ... #> Step 4 of 4: .quantileNormalization ...
Now that we have preprocessed and normalized our data, we will inspect the sample boxplots.
colnames(dataPrep) <- substr(colnames(dataPrep), 1, 15) colnames(dataNorm) <- substr(colnames(dataNorm), 1, 15) par(mfrow = c(2, 1)) boxplot(dataPrep, outline = FALSE, las = 2) boxplot(dataNorm, outline = FALSE, las = 2)
par(mfrow = c(1, 1))
TCGAanalyze_Filtering allows the user to filter genes selecting a threshold. For istance returns all mRNA with mean across all samples, higher than the threshold defined as quantile mean across all samples.
dataFilt <- TCGAanalyze_Filtering( tabDF = dataNorm, method = "quantile", qnt.cut = 0.25 )
The result is shown below:
The result from TCGAanalyze_Preprocessing and TCGAanalyze_Normalization is shown below:
To retrieve other datatypes using TCGAbiolinks it is possible to follow the example for gene expression but replacing the parameters as shown in the following table:
|1||Gene expression quantification||Gene expression||Illumina HiSeq||results|
|2||Protein expression quantification||Protein expression||MDA_RPPA_Core||expression|
|3||Copy number segmentation||Copy number variation||Affymetrix SNP Array 6.0||nocnv_hg19.seg|
|4||Methylation Beta Value||DNA methylation||Illumina Human Methylation 450||HumanMethylation450|
|5||Simple somatic mutation||Simple nucleotide variation||NA||NA|
|6||miRNA gene quantification||Gene expression||Illumina HiSeq||hg19.mirbase20.mirna.quantification|
GTEx gene expression data
The Genotype-Tissue Expression (GTEx) project is an ongoing effort to build a comprehensive public resource to study tissue-specific gene expression and regulation. Samples were collected from 53 non-diseased tissue sites across nearly 1000 individuals, primarily for molecular assays including WGS, WES, and RNA-Seq. [https://gtexportal.org/home/] You can easily search GTEx samples, download and prepare a matrix of gene expression [RNA-Seq].
# Genotype-Tissue Expression (GTEx) project, which is another large-scale # repository cataloging gene expression from healthy individuals. # In the following example we will retrieve data from brain's tissue. data_gtex_brain <- TCGAquery_recount2("gtex", tissue = "brain") Gtex_brain_annotation <- colData(data_gtex_brain$gtex_brain) sort(table(Gtex_brain_annotation$smtsd)) data_gtex_brain_GeneExpr <- assay(data_gtex_brain$gtex_brain,1) # data_gtex_brain_GeneExpr[1:5,1:5] #SRR2166176 SRR2167642 SRR607445 SRR663753 SRR1485892 #ENSG00000000003.14 98669 122656 74728 56933 69009 #ENSG00000000005.5 764 1018 760 2105 151 #ENSG00000000419.12 1076085 1034842 56668 207245 124647 #ENSG00000000457.13 746248 752634 48502 121776 66762 #ENSG00000000460.16 603111 588176 25143 55928 34473
IHEC gene expression data
The International Human Epigenome Consortium (IHEC) is a global consortium with the primary goal of providing free access to high-resolution reference human epigenome maps for normal and disease cell types to the research community. [http://ihec-epigenomes.org/] IHEC provides data currently available from the following consortia: * AMED-CREST (Japan) * Blueprint (Europe) * CEEHRC (Canada) * DEEP (Germany) * ENCODE (USA) * GIS (Singapore) * KNIH (South Korea) * Roadmap (USA)
Within IHEC, the BLUEPRINT consortium has been formed with the aim to further the understanding of how genes are activated or repressed in both healthy and diseased human cells. Specifically, BLUEPRINT includes data on distinct types of haematopoietic cells from healthy individuals and their malignant leukaemic counterparts.
library(DeepBlueR) #> Loading required package: XML #> Loading required package: RCurl #> Loading required package: bitops #> Welcome to the DeepBlueR package #> DeepBlue is online
You can easily search IHEC BLUEPRINT samples, download and prepare a matrix of gene expression.
# List all IHEC samples IHEC_samples <- deepblue_list_samples() #table(IHEC_samples$source) # BLUEPRINT Epigenome Blueprint HSC differentiation BLUEPRINT Progenitors # 1451 76 76 # CEEHRC ChIP-Atlas CREST # 529 497 38 # DEEP DEEP (IHEC) ENCODE # 194 39 3317 # ENCODE FTP Roadmap Epigenomics TEPIC reprocessed IHEC data # 9 144 28 # List all BLUEPRINT samples blueprint_samples <- deepblue_list_samples( extra_metadata = list("source" = "BLUEPRINT Epigenome")) # Extract their ids blueprint_samples_ids <- deepblue_extract_ids(blueprint_samples) # Select gene expression data. We assign gene names using Gencode 22 gene_exprs_query <- deepblue_select_expressions(sample_ids = blueprint_samples_ids, expression_type = "gene", gene_model = "gencode v22") # We request the data and define the output format request = deepblue_get_regions(query_id = gene_exprs_query, "@GENE_ID(gencode v22),FPKM,@BIOSOURCE,@SAMPLE_ID") # We download the data gene_regions <- deepblue_download_request_data(request) # We retain a table mapping sample ids to bisources sample_names <- dplyr::select(gene_regions, `@BIOSOURCE`, `@SAMPLE_ID`) %>% dplyr::distinct() # We filter out duplicated gene entries genes_one_sample <- dplyr::filter(gene_regions, `@SAMPLE_ID` == "s10678") duplicated_genes <- genes_one_sample[ which(duplicated(genes_one_sample$`@GENE_ID(gencode v22)`)), "@GENE_ID(gencode v22)"] # We convert the gene expression from a list to a data frame and subsequently... genes_matrix = dplyr::filter(gene_regions, !(`@GENE_ID(gencode v22)` %in% duplicated_genes)) %>% dplyr::select(-`@BIOSOURCE`) %>% tidyr::spread(key = `@SAMPLE_ID`, value = FPKM) # ...to a numeric matrix genes <- genes_matrix[,1] genes_matrix <- data.matrix(genes_matrix[,-1]) rownames(genes_matrix) <- genes ### OUTPUT ### genes_matrix : The gene expression matrix for all 276 BLUEPRINT samples ### sample_names : A mapping table from sample id to cell type / biosource IHEC_genes_matrix <- genes_matrix IHEC_sample_names <- sample_names save(IHEC_genes_matrix, IHEC_sample_names, file = "IHEC_BLUEPRINT_geneExpr.Rdata") #> IHEC_genes_matrix[1:5,1:5] # s10271 s10272 s10275 s10276 s10279 #ENSG00000000003.13 0.01 0.03 2.39 0.70 0.05 #ENSG00000000005.5 0.00 0.00 0.00 0.00 0.00 #ENSG00000000419.11 30.04 17.49 16.36 5.80 11.99 #ENSG00000000457.12 3.29 2.63 2.70 0.78 1.91 #ENSG00000000460.15 3.16 2.06 1.36 0.49 1.65 #> head(IHEC_sample_names) # @BIOSOURCE @SAMPLE_ID #1: CD8-positive, alpha-beta T cell s10506 #2: germinal center B cell s10678 #3: common myeloid progenitor s10518 #4: adult endothelial progenitor cell s11176 #5: neutrophilic myelocyte s10331 #6: common lymphoid progenitor s10486
GEO gene expression dataset
GEO Gene Expression Omnibus (GEO) is the public repository for high throughput data at NCBI. [https://www.ncbi.nlm.nih.gov/geo/] Here we show how to retrieve GEOs’ samples, download and prepare a matrix of gene expression. In particular to retrieve the gene expression matrix for different cell types from Nazor et al. (2012, PMID: 22560082).
library(MoonlightR) library(GEOquery) library(devtools) # get gene expression data dataNazor <- getDataGEO(GEOobject = "GSE30652", platform = "GPL6947") require(SummarizedExperiment) GSE30652 <- as.data.frame(exprs(dataNazor)) # summarize gene expression data from multiple probes using means GSE30652_non_norm <- cbind(ILMN = rownames(GSE30652), IDmean = rowMeans(GSE30652), GSE30652) # get phenotype data dataNazor_samples <- pData(dataNazor) dataNazor_samples <- as.data.frame(dataNazor_samples) dataNazor_samples <- subset(dataNazor_samples, select = c("geo_accession","characteristics_ch1.2")) colnames(dataNazor_samples) <- "CellType" dataNazor_samples$CellType <- gsub("cell type: ","",dataNazor_samples$CellType) dataNazor_samples$CellType <- gsub(", undifferentiated","",dataNazor_samples$CellType) GPL6947_13512 <- fData(dataNazor) GPL6947_13512_annot <- as.data.frame(GPL6947_13512) GPL6947_13512_annot <- subset(GPL6947_13512_annot, select = c("ID","Gene.symbol")) GSE30652_merge <- merge(x = GPL6947_13512_annot, y = GSE30652_non_norm, by.x = "ID", by.y = "ILMN") GSE30652_merge <- GSE30652_merge[order(GSE30652_merge$IDmean,decreasing = TRUE),] GSE30652_merge <- GSE30652_merge[!duplicated(GSE30652_merge$Gene.symbol),] NazorMatrix <- GSE30652_merge rownames(NazorMatrix) <- NazorMatrix$Gene.symbol NazorMatrix <- NazorMatrix[,dataNazor_samples$geo_accession] save(NazorMatrix,dataNazor_samples, file = "Nazor_GSE30652.Rdata")
Oncogenic processes and driver genes using MoonlightR
In this section, we illustrate the moonlightR analysis pipeline using the BRCA data we downloaded from previous section. In order to make light of cancer development, it is crucial to identify the genes involved in the disease mechanisms and moreover understand what functional roles it plays. Common biological processes such as proliferation and apoptosis have been linked to cancer progression. MoonlightR is a new approach to define tumor suppressor genes (TSGs) and oncogenes (OCGs) based on functional enrichment analysis, inference of gene regulatory networks and upstream regulator analysis to score the importance of well-known biological processes with respect to the studied cancer.
In the MoonlightR pipeline, first, based on expression data we perform functional enrichment analysis, infer gene regulatory networks and upstream regulator analysis to score the importance of well-known biological processes with respect to the studied cancer. We then use these scores to predict two specific roles for genes: genes that act as TSGs and genes that act as OCGs. This methodology not only allows us to identify genes with dual roles (TSG in one cancer type and OCG in another) but also to elucidate the underlying biological processes they are involved with.
Here we describe two Moonlight approaches: Moonlight-EB (expert based) and Moonlight-ML (machine learning). The EB and ML approaches share the following three initial steps (Fig 1b, Methods):
(i) DPA: Moonlight identifies a set of Differentially Expressed Genes (DEGs) between two conditions;
(ii) GRN: The gene expression data is used to infer a Gene Regulatory Network (GRN) with the DEGs as vertices and (iii) FEA: using Functional Enrichment Analysis (FEA), Moonlight considers a DEG in a biological system and quantifies the DEG-BP (biological process) association with a Moonlight Z-score. (iv) URA/PRA: Finally, we input DEGs and their GRN to Upstream Regulatory Analysis (URA), yielding upstream regulators of BPs mediated by the DEG and its targets. The second part of the pipeline’s tool provides Pattern Recognition Analysis (PRA) that incorporates two approaches.
In the first approach, PRA takes in two objects: (i) URA’s output and (ii) selection of a subset of the BP provided by the end-user. In contrast, if the BPs are not provided, their selection is automated by a ML method (e.g. random forest model) trained on gold standard oncogenes (OCG) and tumor-suppressor genes (TSG) in the second approach. In addition, dynamic recognition analysis (DRA) detects multiple patterns of BPs when different conditions are selected.
Moonlight’s pipeline is shown below::