Integrative Pathway Analysis with pathwayPCA
With the advance in high-throughput technology for molecular assays, multi-omics datasets have become increasingly available. In this workshop, we will demonstrate using the pathwayPCA package to perform integrative pathway-based analyses of multi-omics datasets. In particular, we will demonstrate through three case studies the capabilities of
- perform pathway analysis with gene selection,
- estimate and visualize sample-specific pathway activities in ovarian cancer,
- integrate multi-omics datasets to identify driver genes, and
- identify pathways with sex-specific effects in kidney cancer.
- Basic knowledge of R syntax
- Familiarity with RStudio
- Familiarity with pathway analysis
- Knowledge of Principal Components Analysis (PCA)8 is helpful but not required
This will be a hands-on workshop, students will live-code with us. It would be helpful for participants to bring a laptop with RStudio and pathwayPCA package installed.9
R/Bioconductor packages used
To install the workshop dependencies, assuming you have already installed Bioconductor:10
install.packages(c("survival", "survminer", "tidyverse")) BiocManager::install("rWikiPathways") BiocManager::install("pathwayPCA")
We also strongly recommend the RStudio11 integrated development environment as a user-friendly graphical wrapper for R. We require R version 3.6 or later.
This workshop package,
Bioc2019pathwayPCA, depends on two packages:
tidyverse package suite is a collection of utility packages for data science:
forcats. We will make use of some data constructs and ideas from these packages, but we do not expect users to be intimately familiar with them.
library(survival) library(survminer) library(Bioc2019pathwayPCA)
(OPTIONAL) Development version
Because we are currently in the development phase for version 2 of this package, you can install the package from GitHub. In order to install a package from GitHub, you will need the
devtools::16 package and either Rtools17 (for Windows) or Xcode18 (for Mac). Then you can install the development version of the
pathwayPCA package from GitHub via:
|Case Study 1: Estimating sample-specific pathway activities||15m|
|Case Study 2: Pathway analysis for multi-omics datasets||10m|
|Case Study 3: Analyzing experiments with covariates and interactions||10m|
|Summary and Conclusion||5m|
|Questions and Comments||5m|
Workshop Goals and Objectives
- Describe PCA-based pathway analysis
- Apply the
pathwayPCAworkflow to typical gene expression and copy number variations data
- Perform integrative pathway-based analysis for multiple types of -omics data jointly
pathwayPCAto experiments with complex designs, such as those with covariate and/or interaction effects
- Identify pathways significantly associated with survival, binary, or continuous outcome using the
- Estimate and visualize individual gene effects within a significant pathway
- Estimate and visualize sample-specific pathway activities
- Prioritize genes most likely to be driver (instead of passenger) genes using integrative analysis
- Identify pathways with significant sex-specific effects
pathwayPCA allows users to:
- Test pathway association with binary, continuous, or survival phenotypes.
- Compute principal components (PCs) based on selected genes. These estimated latent variables represent pathway activities for individual subjects, which can then be used to perform integrative pathway analysis, such as multi-omics analysis.
- Extract relevant genes that drive pathway significance (as well as data corresponding to these relevant genes) using the SuperPCA and AESPCA approaches for additional in-depth analysis.
- Analyze studies with complex experimental designs, with multiple covariates, and with interaction effects, e.g., testing whether pathway association with clinical phenotype is different between male and female subjects.
pathwayPCA fit into the plethora of pathway analysis tools available?
pathwayPCAtests the self-contained null hypothesis (Q1) (PMID: 2156526522); that is, the features (e.g. genes) in a pathway are not associated with the disease phenotype.
- In contrast, many popular pathway analysis tools such as DAVID,23 IPA,24 Enrichr,25 and
goseq,26 perform over-representation analysis and test for competitive null hypothesis (Q2); that is, the genes in a pathway show the same magnitude of associations with the disease phenotype compared with genes in the rest of the genome.
- When the “real” causal genes are fully contained in one particular pathway, testing Q1 and Q2 are approximately the same. However, when genes in multiple pathways are associated with the disease (as in many cancer studies) or when causal genes are shared by multiple gene sets, using competitive tests that compare pathway association signals with the rest of the genome may result in loss of power.
pathwayPCAmodels correlations within pathways when constructing PCA scores.
- Many tools, including
globaltest,27 ignore expert-based biological information and fail to consider gene-gene correlations contained in biological pathways.
- Other tools, such as
mogsa30, also perform integrative analysis using PCA. However, there is one important difference: these tools typically extract PCs on all the genes in dataset, and then map selected genes (e.g. those with nonzero loadings) to each pathway. In contrast,
pathwayPCAsubsets genes for each pathway first, then extracts PCs for each pathway: this helps to reduce noisy signals from irrelevant genes and improves sensitivity and specificity (on association with phenotype) of the PCs.
pathwayPCAprovides seamless multi-omics integration via estimated subject-specific pathway activity
- Some tools require additional software other than R packages to run, such as
CoGAPS31 (requires OpenMP) or
Caveats for using
- Missing values in -omics data need to be imputed prior to analysis.
- In general, PCA is a poor choice for binary data. Therefore,
pathwayPCAis a poor choice for GISTIC calls (copy number data) or mutation data.
pathwayPCA uses four main groups of functions: data import and wrangling, object creation, object analysis, and analysis inspection. The main function groups are
.gmtfile as a pathway collection.
SE2Tidy()extracts an assay from a
SummarizedExperimentobject,32 and turns it into a “tidy” data frame.
TransposeAssay()is a variant of the base
t()function designed specifcially for data frames and tibbles. It preserves row and column names after transposition.
CreateOmics()takes in a collection of pathways, a single -omics assay, and a clinical response data frame and creates a data object of class
SubsetPathwayData()can extract the pathway-specific assay values and responses for a given pathway from an
AESPCA_pVals()takes in an
Omics*object and calculates pathway p-values (parametrically or non-parametrically), principal components, and loadings via AESPCA. This returns an object of class
SuperPCA_pVals()takes in an
Omics*object with valid response information and calculates pathway parametric p-values, principal components, and loadings via SuperPCA. This returns an object of class
getPathPCLs()takes in an object of class
TERMSname of a pathway. This function extracts 1) the data frame of principal components and subject IDs for the given pathway, and 2) a data frame of sparse loadings and feature names for the given pathway.
getPathpVals()takes in an object of class
superpcOutand returns a table of the p-values and false discovery rates for each pathway.
We define our assay data format in R as follows:
- Let X ∈ ℝn × p be the observed single-omics data (gene expression, protein expression, copy-number variation, etc.).
- We follow Tidy Data design33 and record subjects in the rows and -ome features in the columns.
Use the two utility functions,
SE2Tidy(), to wrangle your assay data into appropriate form for further analysis. Here is an example of “tidying” the first assay from a
# DONT RUN library(SummarizedExperiment) data(airway, package = "airway") airway_df <- SE2Tidy(airway, whichAssay = 1)
Based on expert knowledge of the problem at hand, curate a set of biological pathways 1, …, K, where pathway k contains pk -ome measurements. These pathway collections are often stored in a
.gmt file, a text file with each row corresponding to one pathway. Each row contains an ID (column
TERMS), an optional description (column
description), and the genes in the pathway (all subsequent columns). This file format is independently defined by the Broad Institute.34 Pathway collections in
.gmt form can be downloaded from the MSigDB database.35
For WikiPathways,36 a collection of community-defined biological pathways, one can download monthly data releases in
.gmt format using the
dowloadPathwayArchive() function in the
rWikiPathways package from Bioconductor.37 For example, the following commands38 download the latest release of the human pathways from WikiPathways to your current directory (we do not execute this code):
# DONT RUN library(rWikiPathways) # library(XML) # necessary if you encounter an error with readHTMLTable downloadPathwayArchive( organism = "Homo sapiens", format = "gmt" )
trying URL 'http://data.wikipathways.org/current/gmt/wikipathways-20190110-gmt-Homo_sapiens.gmt' Content type '' length 174868 bytes (170 KB) downloaded 170 KB
#>  "wikipathways-20190110-gmt-Homo_sapiens.gmt"
pathwayPCA includes the June 2018 Wikipathways collection for homo sapiens, which can be loaded using the
dataDir_path <- system.file( "extdata", package = "pathwayPCA", mustWork = TRUE ) wikipathways_PC <- read_gmt( paste0(dataDir_path, "/wikipathways_human_symbol.gmt"), description = TRUE )
This pathway collection contains:
pathways: a list of character vectors matching some of the names of the -omes recorded in the assay
TERMS: a character vector of the names of the pathways
description: (OPTIONAL) descriptions or URLs of the pathways
wikipathways_PC #> Object with Class(es) 'pathwayCollection', 'list' [package 'pathwayPCA'] with 3 elements: #> $ pathways :List of 457 #> $ TERMS : chr [1:457] "WP23" ... #> $ description: chr [1:457] "B Cell Receptor Signaling Pathway" ...
Pathway-based Adaptive, Elastic-net, Sparse PCA (AESPCA)39 combines the following methods into a single objective function: Elastic-Net40, Adaptive Lasso41, and Sparse PCA42. AESPCA extracts principal components from pathway k which minimize this composite objective function. These extracted PCs represent activities within each pathway. The estimated latent variables are then tested against phenotypes using either a parametric or permutation test (permuting the sample responses). Note that the AESPCA approach does not use the response information to estimate pathway PCs, so it is an unsupervised approach.
Pathway-based Supervised PCA (SuperPCA):43 44 ranks each feature in pathway k by its univariate relationship with the outcome of interest (survival time, tumor size, cancer subtype, etc.), then extracts principal components from the features most related to the outcome. Because of this gene selection step, this method is “supervised”. Therefore, test statistics from the SuperPCA model can no longer be approximated using the Student’s t-distribution. To account for the gene selection step, SuperPCA as implemented in
pathwayPCA estimates p-values from a two-component mixture of Gumbel extreme value distributions instead.
Case Study 1: Identifying Significant Pathways in Protein Expressions Associated with Survival Outcome in Ovarian Cancer
Ovarian cancer dataset
For this example, we will use the mass spectrometry based global proteomics data for ovarian cancer recently generated by the Clinical Proteomic Tumor Analysis Consortium (CPTAC).45 The gene-level, log-ratio normalized protein abundance expression dataset for tumor samples can be obtained from the LinkedOmics database.46 We used the dataset “Proteome (PNNL, Gene level)” which was generated by the Pacific Northwest National Laboratory (PNNL).47 One subject was removed due to missing survival outcome. Proteins with greater than 20% missingness were removed from the assay. The remaining missing values were imputed using the Bioconductor package
impute under default settings. The final dataset consisted of 5162 protein expression values for 83 samples.
Omics data object for pathway analysis
First, we need to create an
Omics-class data object that stores
- the expression dataset
- phenotype information for the samples
- a collection of pathways
Expression and phenotype data
We can obtain protein expression and phenotype datasets for the TCGA ovarian cancer dataset by downloading the raw files from the LinkedOmics website. However, for ease and to save time, we include cleaned and imputed versions of these datasets within this package (for the cleaning process, please see the
clean_multi_omics.R file in
ovProteome_df dataset is a data frame with protein expression levels and TCGA sample IDs. The variables (columns) include expression data for 4763 proteins for each of the 83 primary tumour samples.
data("ovProteome_df") ovProteome_df[, 1:5] #> # A tibble: 83 x 5 #> Sample A1BG A2M AAAS AAK1 #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 TCGA-13-1489 0.0154 -0.419 0.0359 0.014 #> 2 TCGA-42-2590 -0.508 -0.769 -0.0443 0.109 #> 3 TCGA-36-2529 -0.148 0.209 -0.212 0.599 #> 4 TCGA-24-1105 0.302 -0.343 0.116 -0.403 #> 5 TCGA-29-1785 -0.584 -0.353 -0.192 -0.220 #> 6 TCGA-24-2290 -0.954 -0.670 -0.0364 0.590 #> 7 TCGA-24-1428 -0.352 -1.34 0.0283 -0.200 #> 8 TCGA-24-1923 0.245 -0.471 0.0789 -0.0568 #> 9 TCGA-24-1563 0.0212 0.242 0.0636 0.785 #> 10 TCGA-24-1430 -0.205 -0.154 0.0028 0.105 #> # … with 73 more rows
ovPheno_df data frame contains TCGA sample IDs, overall survival time, and overall survival censoring status.
data("ovPheno_df") ovPheno_df #> # A tibble: 565 x 3 #> Sample OS_time OS_status #> <chr> <dbl> <int> #> 1 TCGA-3P-A9WA 420 0 #> 2 TCGA-59-A5PD 624 1 #> 3 TCGA-5X-AA5U 361 0 #> 4 TCGA-04-1331 1336 1 #> 5 TCGA-04-1332 1247 1 #> 6 TCGA-04-1335 55 1 #> 7 TCGA-04-1336 1495 0 #> 8 TCGA-04-1337 61 1 #> 9 TCGA-04-1338 1418 0 #> 10 TCGA-04-1341 33 0 #> # … with 555 more rows
OmicsSurv data container
Now that we have these three data components (pathway collection, proteomics, and clinical responses), we create an
OmicsSurv data container. Note that when
response are supplied from two different files, the user must match and merge these data sets by sample IDs. The assay and response must have matching row names, otherwise the function will error.
ov_OmicsSurv <- CreateOmics( # protein expression data assayData_df = ovProteome_df, # pathway collection pathwayCollection_ls = wikipathways_PC, # survival phenotypes response = ovPheno_df, # phenotype is survival data respType = "survival", # retain pathways with > 5 proteins minPathSize = 5 ) There are 83 samples shared by the assay and phenotype data. ====== Creating object of class OmicsSurv ======= The input pathway database included 5831 unique features. The input assay dataset included 4763 features. Only pathways with at least 5 or more features included in the assay dataset are tested (specified by minPathSize parameter). There are 312 pathways which meet this criterion. Because pathwayPCA is a self-contained test (PMID: 17303618), only features in both assay data and pathway database are considered for analysis. There are 1936 such features shared by the input assay and pathway database.
To see a summary of the
Omics data object we just created, simply type the name of the object:
ov_OmicsSurv #> Formal class 'OmicsSurv' [package "pathwayPCA"] with 6 slots #> ..@ eventTime : num [1:83] 2279 3785 2154 2553 2856 ... #> ..@ eventObserved : logi [1:83] TRUE TRUE TRUE TRUE TRUE TRUE ... #> ..@ assayData_df :Classes 'tbl_df', 'tbl' and 'data.frame': 83 obs. of 4763 variables: #> ..@ sampleIDs_char : chr [1:83] "TCGA-09-1664" "TCGA-13-1484" "TCGA-13-1488" "TCGA-13-1489" ... #> ..@ pathwayCollection :List of 4 #> .. ..- attr(*, "class")= chr [1:2] "pathwayCollection" "list" #> ..@ trimPathwayCollection:List of 5 #> .. ..- attr(*, "class")= chr [1:2] "pathwayCollection" "list"
Testing pathway association with phenotypes
Once we have a valid
Omics object, we can perform pathway analysis using the AESPCA or SuperPCA methodology as described above. Because the syntax for performing SuperPCA is nearly identical to the AESPCA syntax, we will illustrate only the AESPCA workflow below.
We estimate pathway significance via the following model:
phenotype ~ intercept + PC1. Pathway p-values are estimated based on a likelihood ratio test that compares this model to the null model
phenotype ~ intercept. Note that when the value supplied to the
numReps argument is greater than 0, the
AESPCA_pvals() function employs a non-parametric, permutation-based test to assign the statistical significance of this model.
ovarian_aespcOut <- AESPCA_pVals( # The Omics data container object = ov_OmicsSurv, # One principal component per pathway numPCs = 1, # Use parallel computing with 2 cores parallel = TRUE, numCores = 2, # # Use serial computing # parallel = FALSE, # Estimate the p-values parametrically (AESPCA only) numReps = 0, # Control FDR via Benjamini-Hochberg adjustment = "BH" ) Part 1: Calculate Pathway AES-PCs Initializing Computing Cluster: DONE Extracting Pathway PCs in Parallel: DONE Part 2: Calculate Pathway p-Values Initializing Computing Cluster: DONE Extracting Pathway p-Values in Parallel: DONE Part 3: Adjusting p-Values and Sorting Pathway p-Value Data Frame DONE
ovarian_aespcOut object contains 3 components: a table of pathway p-values, AESPCA-estimated PCs of each sample from each pathway, and the loadings of each protein onto the AESPCs.
names(ovarian_aespcOut) #>  "pVals_df" "PCs_ls" "loadings_ls"
We haven’t yet added a
print() method for these analysis outputs, so be careful: use the two accessor functions,
getPathPCLs() instead. We will show examples of these functions next.
Table of pathway analysis results
For this ovarian cancer dataset, the top 20 most significant pathways identified by AESPCA are:
|WP2036||TNF related weak inducer of apoptosis (TWEAK) Signaling Pathway||0.0019066||0.3301101|
|WP363||Wnt Signaling Pathway||0.0046961||0.3301101|
|WP3850||Factors and pathways affecting insulin-like growth factor (IGF1)-Akt signaling||0.0079715||0.3301101|
|WP2447||Amyotrophic lateral sclerosis (ALS)||0.0082310||0.3301101|
|WP262||EBV LMP1 signaling||0.0116029||0.3301101|
|WP2795||Cardiac Hypertrophic Response||0.0116402||0.3301101|
|WP2840||Hair Follicle Development: Cytodifferentiation (Part 3 of 3)||0.0120476||0.3301101|
|WP3617||Photodynamic therapy-induced NF-kB survival signaling||0.0128125||0.3301101|
|WP195||IL-1 signaling pathway||0.0144856||0.3301101|
|WP3680||Association Between Physico-Chemical Features and Toxicity Associated Pathways||0.0157976||0.3301101|
|WP75||Toll-like Receptor Signaling Pathway||0.0163042||0.3301101|
|WP2018||RANKL/RANK (Receptor activator of NFKB (ligand)) Signaling Pathway||0.0175655||0.3301101|
|WP2203||Thymic Stromal LymphoPoietin (TSLP) Signaling Pathway||0.0181492||0.3301101|
|WP2064||Neural Crest Differentiation||0.0181831||0.3301101|
|WP2877||Vitamin D Receptor Pathway||0.0204298||0.3301101|
|WP4136||Fibrin Complement Receptor 3 Signaling Pathway||0.0231842||0.3301101|
|WP4148||Splicing factor NOVA regulated synaptic proteins||0.0238131||0.3301101|
(OPTIONAL) Column chart of significant pathways
Before constructing a graph of the p-values, we extract the top 10 pathways (the default value for
numPaths is 20). The
score = TRUE returns the negative natural logarithm of the unadjusted p-values for each pathway (to enhance graphical display of the top pathways).
ovOutGather_df <- getPathpVals(ovarian_aespcOut, score = TRUE, numPaths = 10)
Now we plot the pathway significance level for the top 20 pathways.
ggplot(ovOutGather_df) + # set overall appearance of the plot theme_bw() + # Define the dependent and independent variables aes(x = reorder(terms, score), y = score) + # From the defined variables, create a vertical bar chart geom_col(position = "dodge", fill = "#005030", width = 0.7) + # Add a line showing the alpha = 0.0001 level geom_hline(yintercept = -log10(0.0001), size = 1, color = "#f47321") + # Add pathway labels geom_text( aes(x = reorder(terms, score), label = reorder(description, score), y = 0.1), color = "white", size = 4, hjust = 0 ) + # Set main and axis titles ggtitle("AESPCA Significant Pathways: Ovarian Cancer") + xlab("Pathways") + ylab("Negative Log10 (p-Value)") + # Flip the x and y axes coord_flip()
Extract relevant genes from significant pathways
Because pathways are defined a priori, typically only a subset of genes within each pathway are relevant to the phenotype and contribute to pathway significance. In AESPCA, these relevant genes are the genes with nonzero loadings in the first PC of AESPCs.
For example, we know that the IL-1 Signaling Pathway is one of cancer’s “usual suspects”.48 For the “IL-1 signaling pathway” (Wikipathways WP19549), we can extract the PCs and their protein Loadings using the
wp195PCLs_ls <- getPathPCLs(ovarian_aespcOut, "WP195") wp195PCLs_ls #> $PCs #> # A tibble: 83 x 2 #> sampleID V1 #> <chr> <dbl> #> 1 TCGA-09-1664 -2.90 #> 2 TCGA-13-1484 0.0985 #> 3 TCGA-13-1488 -0.0724 #> 4 TCGA-13-1489 0.201 #> 5 TCGA-13-1494 -0.506 #> 6 TCGA-13-1495 -0.360 #> 7 TCGA-13-1499 -0.529 #> 8 TCGA-13-2071 2.11 #> 9 TCGA-23-1123 -0.874 #> 10 TCGA-23-1124 1.52 #> # … with 73 more rows #> #> $Loadings #> # A tibble: 25 x 2 #> featureID PC1 #> <chr> <dbl> #> 1 MAPK14 0 #> 2 AKT1 0 #> 3 IKBKB 0.531 #> 4 NFKB1 0.528 #> 5 PIK3R1 0.191 #> 6 PLCG1 0 #> 7 MAPK1 0.0882 #> 8 MAP2K1 0.300 #> 9 MAP2K2 0 #> 10 MAP2K6 0 #> # … with 15 more rows #> #> $pathway #>  "path22" #> #> $term #>  "WP195" #> #> $description #>  "IL-1 signaling pathway"
The proteins with non-zero loadings can be extracted as follows:
wp195Loadings_df <- wp195PCLs_ls$Loadings %>% filter(PC1 != 0) wp195Loadings_df #> # A tibble: 8 x 2 #> featureID PC1 #> <chr> <dbl> #> 1 IKBKB 0.531 #> 2 NFKB1 0.528 #> 3 PIK3R1 0.191 #> 4 MAPK1 0.0882 #> 5 MAP2K1 0.300 #> 6 TAB1 0.0992 #> 7 MAPK3 0.324 #> 8 SQSTM1 0.436
(OPTIONAL) Plot protein loadings for a single pathway
We can also prepare these loadings for graphics:
wp195Loadings_df <- wp195Loadings_df %>% # Sort Loading from Strongest to Weakest arrange(desc(abs(PC1))) %>% # Order the Genes by Loading Strength mutate(featureID = factor(featureID, levels = featureID)) %>% # Add Directional Indicator (for Colour) mutate(Direction = factor(ifelse(PC1 > 0, "Up", "Down")))
Now we will construct a column chart with
ggplot(data = wp195Loadings_df) + # Set overall appearance theme_bw() + # Define the dependent and independent variables aes(x = featureID, y = PC1, fill = Direction) + # From the defined variables, create a vertical bar chart geom_col(width = 0.5, fill = "#005030", color = "#f47321") + # Set main and axis titles labs( title = "Gene Loadings on IL-1 Signaling Pathway", x = "Protein", y = "Loadings of PC1" ) + # Remove the legend guides(fill = FALSE)
Alternatively, we can also plot the correlation of each gene with first PC for each gene. These correlations can be computed by using the
TidyCorrelation() function in
pathwayPCA’s vignette “Chapter 5 - Visualizing the Results”, Section 3.3.50
Subject-specific PCA estimates
In the study of complex diseases, there is often considerable heterogeneity among different subjects with regard to underlying causes of disease and benefit of particular treatment. Therefore, in addition to identifying disease-relevant pathways for the entire patient group, successful (personalized) treatment regimens will also depend upon knowing if a particular pathway is dysregulated for an individual patient.
To this end, we can also assess subject-specific pathway activity. As we saw earlier, the
getPathPCLs() function also returns subject-specific estimates for the individual pathway PCs. We can plot these as follows.
ggplot(data = wp195PCLs_ls$PCs) + # Set overall appearance theme_classic() + # Define the independent variable aes(x = V1) + # Add the histogram layer geom_histogram(bins = 10, color = "#005030", fill = "#f47321") + # Set main and axis titles labs( title = "Distribution of Sample-specific Estimate of Pathway Activities", subtitle = paste0(wp195PCLs_ls$pathway, ": ", wp195PCLs_ls$description), x = "PC1 Score for Each Sample", y = "Count" )
This graph shows there can be considerable heterogeneity in pathway activities between the patients.
Extract analysis dataset for entire pathway
Users are often also interested in examining the actual data used for analysis of the top pathways, especially for the relevant genes with the pathway. To extract this dataset, we can use the
SubsetPathwayData() function. These commands extract data for the IL-1 signaling pathway:
wp195Data_df <- SubsetPathwayData(ov_OmicsSurv, "WP195") wp195Data_df #> # A tibble: 83 x 28 #> sampleID EventTime EventObs MAPK14 AKT1 IKBKB NFKB1 PIK3R1 #> <chr> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 TCGA-09… 2279 TRUE -0.995 0.958 -1.70 -0.730 -1.35 #> 2 TCGA-13… 3785 TRUE 1.71 0.575 -0.744 0.783 1.18 #> 3 TCGA-13… 2154 TRUE -0.127 -0.542 -0.908 0.418 0.324 #> 4 TCGA-13… 2553 TRUE -0.0591 -0.291 -0.416 -0.488 -2.37 #> 5 TCGA-13… 2856 TRUE 0.610 1.25 -0.943 0.394 0.309 #> 6 TCGA-13… 2749 TRUE 0.265 1.79 -0.515 -0.0499 -0.0504 #> 7 TCGA-13… 3500 FALSE 0.516 -0.519 -0.487 0.104 1.54 #> 8 TCGA-13… 773 TRUE -0.107 0.323 1.05 1.27 1.35 #> 9 TCGA-23… 1018 TRUE -0.419 -3.63 0.0398 -0.114 -0.0345 #> 10 TCGA-23… 1768 TRUE -1.81 -0.0215 0.817 1.17 0.567 #> # … with 73 more rows, and 20 more variables: PLCG1 <dbl>, MAPK1 <dbl>, #> # MAP2K1 <dbl>, MAP2K2 <dbl>, MAP2K6 <dbl>, PTPN11 <dbl>, RELA <dbl>, #> # MAP3K7 <dbl>, IKBKG <dbl>, TAB1 <dbl>, IRAK4 <dbl>, ECSIT <dbl>, #> # TOLLIP <dbl>, MAPK3 <dbl>, MAP2K3 <dbl>, MAP2K4 <dbl>, TRAF6 <dbl>, #> # UBE2N <dbl>, SQSTM1 <dbl>, MAPKAPK2 <dbl>
Gene-specific CoxPH model
We can also perform analysis for individual genes belonging to the pathway:
library(survival) NFKB1_df <- wp195Data_df %>% select(EventTime, EventObs, NFKB1) wp195_mod <- coxph( Surv(EventTime, EventObs) ~ NFKB1, data = NFKB1_df )
Now we inspect the output from the Cox PH model (we’ve included the pretty version using the
prettify() function from the
|coef||Hazard Ratio||CI (lower)||CI (upper)||se(coef)||z||Pr(>|z|)|
(OPTIONAL) Gene-specific survival curves
Additionally, we can estimate Kaplan-Meier survival curves for patients with high or low expression values for individual genes:
NFKB1_df <- NFKB1_df %>% # Group subjects by gene expression mutate(NFKB1_Expr = ifelse(NFKB1 > median(NFKB1), "High", "Low")) %>% # Re-code time to years mutate(EventTime = EventTime / 365.25) %>% # Ignore any events past 10 years filter(EventTime <= 10) # Fit the survival model NFKB1_fit <- survfit( Surv(EventTime, EventObs) ~ NFKB1_Expr, data = NFKB1_df )
Finally, we can plot these K-M curves over NFKB1 protein expression.
ggsurvplot( NFKB1_fit, # No confidence intervals; add the p-value conf.int = FALSE, pval = TRUE, # Show times to median survival surv.median.line = "hv", xlab = "Time in Years", palette = c("#f47321", "#005030") )
Case study 2: An Integrative Multi-Omics Pathway Analysis of Ovarian Cancer Survival
While copy number alterations are common genomic aberrations in ovarian carcer, recent studies have shown these changes do not necessarily lead to concordant changes in protein expression. In Section 1.5.3 above, we illustrated testing pathway activities in protein expression against survival outcome. In this section, we will additionally test pathway activities in copy number against survival outcome. Moreover, we will perform integrative analysis to identify those survival-associated protein pathways, genes, and samples driven by copy number alterations.
Creating copy number
Omics data object
We can identify copy number (CNV) pathways significantly associated with survival in the same way as we did for protein expressions. This gene-level CNV data was downloaded from the same LinkedOmics database.
data("ovCNV_df") ovCNV_df[, 1:5] #> # A tibble: 579 x 5 #> Sample ACAP3 ACTRT2 AGRN ANKRD65 #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 TCGA-04-1331 -0.703 -0.703 -0.703 -0.703 #> 2 TCGA-04-1332 0.08 0.08 0.08 0.08 #> 3 TCGA-04-1335 -0.807 -0.807 -0.807 -0.807 #> 4 TCGA-04-1336 0.101 0.101 0.101 0.101 #> 5 TCGA-04-1337 0.021 0.021 0.021 0.021 #> 6 TCGA-04-1338 -0.999 -0.999 -0.999 -0.999 #> 7 TCGA-04-1341 -0.421 -0.421 -0.421 -0.421 #> 8 TCGA-04-1342 0.089 0.089 0.089 0.089 #> 9 TCGA-04-1343 0.279 0.279 0.279 0.279 #> 10 TCGA-04-1346 -0.396 -0.396 -0.396 -0.396 #> # … with 569 more rows
And now we create an
Omics data container. (Note: because these analysis steps take a little longer than the steps shown previously–3 minutes over 20 cores, we include the AESPCA output directly. You don’t need to execute the code in the next two chunks.)
# DONT RUN ovCNV_Surv <- CreateOmics( assayData_df = ovCNV_df, pathwayCollection_ls = wikipathways_PC, response = ovPheno_df, respType = "survival", minPathSize = 5 )
1230 gene name(s) are invalid. Invalid name(s) are: ... There are 549 samples shared by the assay and phenotype data. ====== Creating object of class OmicsSurv ======= The input pathway database included 5831 unique features. The input assay dataset included 24776 features. Only pathways with at least 5 or more features included in the assay dataset are tested (specified by minPathSize parameter). There are 424 pathways which meet this criterion. Because pathwayPCA is a self-contained test (PMID: 17303618), only features in both assay data and pathway database are considered for analysis. There are 5637 such features shared by the input assay and pathway database.
AESPCA pathway analysis for copy-number data
Finally, we can apply the AESPCA method to this copy-number data container. Due to the large sample size, this will take a few moments.
# DONT RUN ovCNV_aespcOut <- AESPCA_pVals( object = ovCNV_Surv, numPCs = 1, parallel = TRUE, numCores = 20, numReps = 0, adjustment = "BH" )
Part 1: Calculate Pathway AES-PCs Initializing Computing Cluster: DONE Extracting Pathway PCs in Parallel: DONE Part 2: Calculate Pathway p-Values Initializing Computing Cluster: DONE Extracting Pathway p-Values in Parallel: DONE Part 3: Adjusting p-Values and Sorting Pathway p-Value Data Frame DONE
Rather than execute this code yourself, we have included the output object with this package:
Combine significant pathways from CNV and protein analyses
Next, we identify the intersection of significant pathways based on both CNV and protein data. First, we will create a data frame of the pathway p-values from both CNV and proteomics.
# Copy Number CNVpvals_df <- getPathpVals(ovCNV_aespcOut, alpha = 0.05) %>% mutate(rawp_CNV = rawp) %>% select(description, rawp_CNV) # Proteomics PROTpvals_df <- getPathpVals(ovarian_aespcOut, alpha = 0.05) %>% mutate(rawp_PROT = rawp) %>% select(description, rawp_PROT) # Intersection SigBoth_df <- inner_join(PROTpvals_df, CNVpvals_df, by = "description") # WnT Signaling Pathway is listed as WP363 and WP428
The results showed there are 32 pathways significantly associated with survival in both CNV and protein data, which is significantly more than expected by chance (p-value = 0.00065; Fisher’s Exact Test; shown in
multi_pathway_overlap_fishers.R). Here are the top-10 most significant pathways (sorted by protein data significance):
|TNF related weak inducer of apoptosis (TWEAK) Signaling Pathway||0.0019066||0.0128430|
|Wnt Signaling Pathway||0.0046961||0.0303257|
|Wnt Signaling Pathway||0.0046961||0.0371408|
|Factors and pathways affecting insulin-like growth factor (IGF1)-Akt signaling||0.0079715||0.0272851|
|EBV LMP1 signaling||0.0116029||0.0498388|
|Cardiac Hypertrophic Response||0.0116402||0.0274414|
|Hair Follicle Development: Cytodifferentiation (Part 3 of 3)||0.0120476||0.0006223|
|IL-1 signaling pathway||0.0144856||0.0141512|
|RANKL/RANK (Receptor activator of NFKB (ligand)) Signaling Pathway||0.0175655||0.0003343|
|Thymic Stromal LymphoPoietin (TSLP) Signaling Pathway||0.0181492||0.0153220|
Integrative pathway-based gene detection
Similar to the protein pathway analysis shown in Section 1.5.3, we can also identify relevant genes with nonzero loadings that drives pathway significance in CNV. The “IL-1 signaling pathway” (WP195) is significant in both CNV and protein data, so we look for genes with non-zero loadings in both.
# Copy Number Loadings CNVwp195_ls <- getPathPCLs(ovCNV_aespcOut, "WP195") CNV195load_df <- CNVwp195_ls$Loadings %>% filter(PC1 != 0) %>% rename(PC1_CNV = PC1) # Protein Loadings PROTwp195_ls <- getPathPCLs(ovarian_aespcOut, "WP195") PROT195load_df <- PROTwp195_ls$Loadings %>% filter(PC1 != 0) %>% rename(PC1_PROT = PC1) # Intersection inner_join(CNV195load_df, PROT195load_df) #> Joining, by = "featureID" #> # A tibble: 5 x 3 #> featureID PC1_CNV PC1_PROT #> <chr> <dbl> <dbl> #> 1 IKBKB 0.0615 0.531 #> 2 NFKB1 0.133 0.528 #> 3 PIK3R1 0.0704 0.191 #> 4 TAB1 0.0384 0.0992 #> 5 MAPK3 0.0560 0.324
The result showed that NFKB1, IKBKB, and other genes are selected by AESPCA when testing IL-1 signaling pathway (WP195) against survival outcome in both CNV and protein pathway analysis.
(OPTIONAL) Integrating sample-specific pathway activities
Also in Section 1.5.3, we have seen that there can be considerable heterogeneity in pathway activities between patients. One possible reason could be that copy number changes might not directly result in changes in protein expression for some of the patients.
pathwayPCA can be used to estimate pathway activities for each patient, for copy number, gene, and protein expressions separately. These estimates can then be viewed jointly using a Circos plot.