In this vignette, we would like to apply ezcox to a gene list containing oncogenes and tumor suppressors with TCGA datasets. All core features provided in ezcox will be presented. To focus on the presentation of ezcox features, we would only use mRNA expression data here. Without doubt, the features can be applied to any suitable variables, including:
- mRNA expression
- miRNA expression
- transcript expression
- protein expression
- mutation status
- methylation status
- promoter activity
- APOBEC activity
- clinical features
- …
All molecular profile data for public databases like TCGA can be obtained from UCSC Xena with UCSCXenaTools or UCSCXenaShiny.
library(UCSCXenaTools)
## =========================================================================================
## UCSCXenaTools version 1.4.7
## Project URL: https://github.com/ropensci/UCSCXenaTools
## Usages: https://cran.r-project.org/web/packages/UCSCXenaTools/vignettes/USCSXenaTools.html
##
## If you use it in published research, please cite:
## Wang et al., (2019). The UCSCXenaTools R package: a toolkit for accessing genomics data
## from UCSC Xena platform, from cancer multi-omics to single-cell RNA-seq.
## Journal of Open Source Software, 4(40), 1627, https://doi.org/10.21105/joss.01627
## =========================================================================================
## --Enjoy it--
library(UCSCXenaShiny)
## =========================================================================================
## UCSCXenaShiny version 1.1.1
## Project URL: https://github.com/openbiox/UCSCXenaShiny
## Usages: https://openbiox.github.io/UCSCXenaShiny/
##
## If you use it in published research, please cite:
## Shixiang Wang, Yi Xiong, Longfei Zhao, Kai Gu, Yin Li, Fei Zhao, Jianfeng Li,
## Mingjie Wang, Haitao Wang, Ziyu Tao, Tao Wu, Yichao Zheng, Xuejun Li, Xue-Song Liu,
## UCSCXenaShiny: An R/CRAN Package for Interactive Analysis of UCSC Xena Data,
## Bioinformatics, 2021;, btab561, https://doi.org/10.1093/bioinformatics/btab561.
## =========================================================================================
## --Enjoy it--
library(ezcox)
## Welcome to 'ezcox' package!
## =======================================================================
## You are using ezcox version 1.0.0
##
## Project home : https://github.com/ShixiangWang/ezcox
## Documentation: https://shixiangwang.github.io/ezcox
##
## Run citation("ezcox") to see how to cite 'ezcox'.
## =======================================================================
##
library(readr)
library(dplyr)
##
## 载入程辑包:'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Data preparation
Read gene list
Firstly, we obtain driver gene list from IntOGen (referece: A compendium of mutational cancer driver genes).
<- read_tsv("IntOGen-DriverGenes.tsv", col_types = cols())
gene_table <- gene_table$Symbol gene_list
Read gene expression
Next, we obtain the gene expression data in TCGA.
For just one gene, we can use the query_pancan_value()
in UCSCXenaShiny
package.
<- query_pancan_value(gene_list[1])
expr ## Try querying data #1
## -> Checking if the dataset has probeMap...
## -> Done. ProbeMap is found.
## More info about dataset please run following commands:
## library(UCSCXenaTools)
## XenaGenerate(subset = XenaDatasets == "TcgaTargetGtex_rsem_gene_tpm") %>% XenaBrowse()
$expression[1:5]
expr## GTEX-S4Q7-0003-SM-3NM8M TCGA-19-1787-01 TCGA-S9-A7J2-01
## 4.785 5.887 5.517
## GTEX-QV31-1626-SM-2S1QC TCGA-G3-A3CH-11
## 4.431 2.382
For a few genes, fetch_dense_values()
in UCSCXenaTools
is more convenient. However, you need to previously confirm the UCSC Xena host and dataset ID (you can obtain this from UCSC Xena or UCSCXenaShiny
Shiny interface).
<- "https://toil.xenahubs.net"
host <- "tcga_RSEM_gene_tpm"
dataset
<- fetch_dense_values(host, dataset, gene_list, use_probeMap = TRUE)
expr
save(expr, file = "expr.RData")
-> Obtaining gene symbols...
-> Checking identifiers...
://toil.xenahubs.net dataset tcga_RSEM_gene_tpm
The following identifiers have been removed from host https1] "CNOT9" "NSD2"
[-> Done.
-> Checking samples...
-> Done.
-> Checking if the dataset has probeMap...
-> Done. ProbeMap is found.
-> Query done.
Merge data
Here, we merge TCGA clinical data, survival data, expression data with inner join.
load("expr.RData")
<- setdiff(gene_list, c("CNOT9", "NSD2"))
gene_list
<- expr %>%
df t() %>%
as.data.frame() %>%
::rownames_to_column("sample") %>%
tibbleas_tibble() %>%
inner_join(tcga_surv, by = "sample") %>%
inner_join(tcga_clinical, by = "sample") %>%
rename(age = age_at_initial_pathologic_diagnosis)
head(df)
## # A tibble: 6 × 595
## sample TP53 KRAS PIK3CA BRAF LRP1B KMT2D APC ARID1A KMT2C PTEN NRAS
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 TCGA-19-… 5.89 3.95 2.75 3.71 0.537 4.00 3.01 5.09 3.90 4.59 5.06
## 2 TCGA-S9-… 5.52 3.98 3.91 3.59 3.86 4.77 5.8 4.51 4.95 4.37 3.61
## 3 TCGA-G3-… 2.38 1.82 1.05 1.00 -4.61 1.25 0.576 2.28 1.55 3.86 2.55
## 4 TCGA-EK-… 4.59 3.61 2.48 2.15 -9.97 3.98 2.16 4.24 2.18 5.16 4.56
## 5 TCGA-44-… 5.30 4.06 3.35 3.47 -4.04 5.12 2.78 5.67 4.17 4.79 5.14
## 6 TCGA-F4-… 6.86 3.39 2.44 2.90 -9.97 4.66 1.18 4.34 3.94 5.28 4.44
## # … with 583 more variables: CTNNB1 <dbl>, CDKN2A <dbl>, IDH1 <dbl>, RB1 <dbl>,
## # ATM <dbl>, NF1 <dbl>, SMAD4 <dbl>, FAT1 <dbl>, FBXW7 <dbl>, NOTCH1 <dbl>,
## # FAT4 <dbl>, EGFR <dbl>, ARID2 <dbl>, CREBBP <dbl>, KDM6A <dbl>, FAT3 <dbl>,
## # ATRX <dbl>, SETD2 <dbl>, ERBB2 <dbl>, NFE2L2 <dbl>, SMARCA4 <dbl>,
## # CDH1 <dbl>, SF3B1 <dbl>, VHL <dbl>, KEAP1 <dbl>, SPOP <dbl>, PBRM1 <dbl>,
## # EP300 <dbl>, FOXA1 <dbl>, PIK3R1 <dbl>, AR <dbl>, MAP3K1 <dbl>,
## # TRRAP <dbl>, ERBB3 <dbl>, GATA3 <dbl>, FGFR3 <dbl>, HRAS <dbl>, …
Scaling
To obtain pretty HR value, here we scale gene expression to 1-5, so every 1 increase in gene expression indicates 20% expression increase. As for HR, e.g., HR = 2
means hazard ratio would double when a gene expression per 20% increase.
<- function(x, by = 0.2) {
scaling <- quantile(x, seq(0, 1, by)) # some variables may have same value for 10% 20% etc.
y <- y[!c(FALSE, diff(y) == 0)]
y as.integer(cut(x, breaks = y, include.lowest = TRUE))
}
<- df %>%
df mutate_at(gene_list, scaling)
Simple usage
In this section, we will show simple usage of ezcox by constructing single unvariable and multivariable Cox models.
Single unvariable model
Here we simply construct a Cox model for overall survival (OS) only with TP53
.
<- ezcox(df, covariates = "TP53", time = "OS.time", status = "OS")
rv ## => Processing variable TP53
## ==> Building Surv object...
## ==> Building Cox model...
## ==> Done.
rv## # A tibble: 1 × 12
## Variable is_control contrast_level ref_level n_contrast n_ref beta HR
## <chr> <lgl> <chr> <chr> <int> <int> <dbl> <dbl>
## 1 TP53 FALSE TP53 TP53 10496 10496 0.0393 1.04
## # … with 4 more variables: lower_95 <dbl>, upper_95 <dbl>, p.value <dbl>,
## # global.pval <dbl>
Useful information about the result model has been output as a data.frame
.
If you just want to show a forest plot for this model, you can use the following command. You only need to change the function name!
show_forest(df, covariates = "TP53", time = "OS.time", status = "OS")
## => Processing variable TP53
## ==> Building Surv object...
## ==> Building Cox model...
## ==> Done.
## Resized limits to included dashed line in forest panel
To simplify, use:
show_forest(df,
covariates = "TP53", time = "OS.time", status = "OS",
merge_models = TRUE, add_caption = FALSE
)## => Processing variable TP53
## ==> Building Surv object...
## ==> Building Cox model...
## ==> Done.
## Resized limits to included dashed line in forest panel
The data shows when TP53 gene expression increases, the patient would have poor clinical outcome.
Actually, you may want to obtain the model behind the plot, you can use the command to return model firstly:
<- ezcox(df, covariates = "TP53", time = "OS.time", status = "OS", return_models = TRUE)
rv ## => Processing variable TP53
## ==> Building Surv object...
## ==> Building Cox model...
## ==> Done.
rv## $res
## # A tibble: 1 × 12
## Variable is_control contrast_level ref_level n_contrast n_ref beta HR
## <chr> <lgl> <chr> <chr> <int> <int> <dbl> <dbl>
## 1 TP53 FALSE TP53 TP53 10496 10496 0.0393 1.04
## # … with 4 more variables: lower_95 <dbl>, upper_95 <dbl>, p.value <dbl>,
## # global.pval <dbl>
##
## $models
## # A tibble: 1 × 5
## Variable control model_file model status
## <chr> <chr> <chr> <list> <lgl>
## 1 TP53 <NA> /Volumes/Extra/R/Rtmp//RtmpCWQCzZ/ezcox/ezcox_… <coxp… TRUE
##
## attr(,"class")
## [1] "ezcox" "list"
## attr(,"controls")
## character(0)
Then you can get it:
get_models(rv)
## $`Surv ~ TP53`
## Call:
## coxph(formula = fm, data = data)
##
## coef exp(coef) se(coef) z p
## TP53 0.03933 1.04012 0.01252 3.142 0.00168
##
## Likelihood ratio test=9.88 on 1 df, p=0.001669
## n= 10434, number of events= 3251
## (因为不存在,62个观察量被删除了)
##
## attr(,"class")
## [1] "ezcox_models" "list"
## attr(,"has_control")
## [1] FALSE
Single multivariable model
It’s very easy to construct a multivariable model by introducing another parameter controls
. For example, if we want to explore the TP53
expression with patients’ age and clinical stage also taking into consideration.
Let’s clean the data firstly:
<- df %>%
df mutate(
stage = case_when(
%in% c("I", "Stage I", "Stage IA1", "Stage IA2", "Stage IB", "Stage IB1", "Stage IB2", "Stage IC") ~ "I",
clinical_stage %in% c("IIa", "IIb", "Stage II", "Stage IIA", "Stage IIA1", "Stage IIA2", "Stage IIB", "Stage IIC") ~ "II",
clinical_stage %in% c("III", "Stage III", "Stage IIIA", "Stage IIIB", "Stage IIIC", "Stage IIIC1", "Stage IIIC2") ~ "III",
clinical_stage %in% c("IVa", "IVb", "Stage IV", "Stage IVA", "Stage IVB", "Stage IVC") ~ "IV",
clinical_stage TRUE ~ NA_character_
),stage = factor(stage, levels = c("I", "II", "III", "IV"))
)
<- ezcox(df,
rv covariates = "TP53", controls = c("age", "stage"),
time = "OS.time", status = "OS", return_models = TRUE
)## => Processing variable TP53
## ==> Building Surv object...
## ==> Building Cox model...
## ==> Done.
rv## $res
## # A tibble: 5 × 12
## Variable is_control contrast_level ref_level n_contrast n_ref beta HR
## <chr> <lgl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 TP53 FALSE TP53 TP53 10496 10496 -0.0404 0.96
## 2 TP53 TRUE age age 10445 10445 0.022 1.02
## 3 TP53 TRUE II I 423 463 0.457 1.58
## 4 TP53 TRUE III I 711 463 1.1 3
## 5 TP53 TRUE IV I 445 463 1.37 3.93
## # … with 4 more variables: lower_95 <dbl>, upper_95 <dbl>, p.value <dbl>,
## # global.pval <dbl>
##
## $models
## # A tibble: 1 × 5
## Variable control model_file model status
## <chr> <chr> <chr> <list> <lgl>
## 1 TP53 age,stage /Volumes/Extra/R/Rtmp//RtmpCWQCzZ/ezcox/ezco… <coxp… TRUE
##
## attr(,"class")
## [1] "ezcox" "list"
## attr(,"controls")
## [1] "age" "stage"
The code constructed a model as below:
names(get_models(rv))
## [1] "Surv ~ TP53 + age + stage"
For single model, we can exchange the role of covariable and control variables, they are the same model. The key point is how what we want to explore and how we treat and explain the model.
# If you passing the parameters below to ezcox() will return the same model
= "TP53", controls = c("age", "stage")
covariates = "age", controls = c("TP53", "stage")
covariates = "stage", controls = c("age", "TP53") covariates
Still, let’s show the result more clearly with forest plot:
show_forest(df,
covariates = "TP53", controls = c("age", "stage"),
time = "OS.time", status = "OS", merge_models = TRUE,
show_global_p = "bottom", add_caption = FALSE
)## => Processing variable TP53
## ==> Building Surv object...
## ==> Building Cox model...
## ==> Done.
## Warning in recalculate_width_panels(panel_positions, mapped_text =
## mapped_text, : Unable to resize forest panel to be smaller than its heading;
## consider a smaller text size
Ooops! Why TP53 has no significant association with OS when the clinical features have been included?
There are some reasons:
- The first reason we can read from the forest plots above. TP53 is not an independent variable, it has association with patient’s age and clinical stages. You could explore this with correlation analysis and box plots.
- In the model above, we ignored a very strong factor, i.e., “cancer type”. We already know that different tissues have different gene expression status. Actually, TP53 plays different roles in different cancer types. You can read some papers for more details.
Batch processing
Now, we are moving to the most powerful part of ezcox. Cox analysis is a common analysis technique to link variable to clinical outcome. In the omics era, Cox model batch processing is a basic strategy for screening clinical relevant variables, biomarker discovery and gene signature identification. However, all such analyses have been implemented with homebrew code in research community, thus lack of transparency and reproducibility. ezcox is born to address this issue for open science.
In batch processing, we provide execution in either sequential or parallel ways.
Considering only ~2,000 patients have stage status. The following code will only include age, cancer type, gender as control variables in the model construction, so most of the data (~10,000) can be utilized.
Sequential execution
The only parameter we need to change is covariates
, instead one variable, we pass a vector for batch processing.
<- ezcox(df,
bt covariates = gene_list, controls = c("age", "gender", "type"),
time = "OS.time", status = "OS",
keep_models = TRUE, return_models = FALSE
)# don't return models if you have many models in processing
That’s it.
::datatable(bt$res, caption = "Cox batch processing result for ~500 genes in TCGA dataset")
DT## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html