Shixiang Wang

>上士闻道
勤而行之

一个函数获取 TCGA/CCLE 单基因分子数据

王诗翔 · 2020-09-26

分类: r  
标签: r   单基因分析   Xena   cancer   genomics  

这篇文章很久之前就想写了,但一直因为一些事情拖延了。

在开发 UCSCXenaShiny 的基础上,我将其中支持的 UCSCXena TCGA/CCLE 单基因数据下载函数进行了整理,构建了一个单一的入口。这样即使用户无需加载 Shiny,也能够简单自在的下载 癌症单基因数据了。

这里单独说的 TCGA 不太全面,实际包含了 TCGA TARGET GTEx 3 个数据库,它们是个体水平的数据。而 CCLE 是细胞水平数据。

下载安装包

在国内我们统一推荐下载 Gitee 上的包:

remotes::install_git("https://gitee.com/XenaShiny/UCSCXenaShiny")

如果你已经安装 CRAN 上的 UCSCXenaShiny,也需要进行上面的操作,否则无法使用最新的函数。

参数说明

函数就一个 query_value()

简单看看有哪些参数:

library(UCSCXenaShiny)
args(query_value)
#> function (identifier, data_type = c("gene", "transcript", "protein", 
#>     "mutation", "cnv", "methylation"), database = c("toil", "ccle")) 
#> NULL

非常简单哈,只有 3 个:

使用

了解函数参数后,使用就根据自己所需就行了。如果还不懂,可以不断试错。

我们以 TP53 基因为例下载一些数据看看。

gene_expr <- query_value("TP53")
#> Running mode: client
#> =========================================================================================
#> UCSCXenaTools version 1.3.3
#> Project URL: https://github.com/ropensci/UCSCXenaTools
#> Usages: https://cran.r-project.org/web/packages/UCSCXenaTools/vignettes/USCSXenaTools.html
#> 
#> If you use it in published research, please cite:
#> Wang et al., (2019). The UCSCXenaTools R package: a toolkit for accessing genomics data
#>   from UCSC Xena platform, from cancer multi-omics to single-cell RNA-seq.
#>   Journal of Open Source Software, 4(40), 1627, https://doi.org/10.21105/joss.01627
#> =========================================================================================
#>                               --Enjoy it--
#> Try querying data #1
#> -> Checking if the dataset has probeMap...
#> -> Done. ProbeMap is found.
#> Running mode: client
#> More info about dataset please run following commands:
#>   library(UCSCXenaTools)
#>   XenaGenerate(subset = XenaDatasets == "TcgaTargetGtex_rsem_gene_tpm") %>% XenaBrowse()

这个返回结果的结构:

str(gene_expr)
#> List of 2
#>  $ expression: Named num [1:19131] 4.79 5.89 5.52 4.43 2.38 ...
#>   ..- attr(*, "names")= chr [1:19131] "GTEX-S4Q7-0003-SM-3NM8M" "TCGA-19-1787-01" "TCGA-S9-A7J2-01" "GTEX-QV31-1626-SM-2S1QC" ...
#>  $ unit      : chr "log2(tpm+0.001)"

可以查看部分数据:

gene_expr$expression[1:5]
#> GTEX-S4Q7-0003-SM-3NM8M         TCGA-19-1787-01         TCGA-S9-A7J2-01 
#>                    4.79                    5.89                    5.52 
#> GTEX-QV31-1626-SM-2S1QC         TCGA-G3-A3CH-11 
#>                    4.43                    2.38

有了这个数据就可以结合病人的各种表型去做分析啦。

我们再看下它的 CNV 和 突变情况。

# CNV
gene_cnv <- query_value("TP53", data_type = "cnv")
#> Running mode: client
#> Try querying data #1
#> -> Checking if the dataset has probeMap...
#> -> Done. No probeMap found or error happened, use old way...
#> Running mode: client
#> More info about dataset please run following commands:
#>   library(UCSCXenaTools)
#>   XenaGenerate(subset = XenaDatasets == "TCGA.PANCAN.sampleMap/Gistic2_CopyNumber_Gistic2_all_thresholded.by_genes") %>% XenaBrowse()
gene_cnv$data[1:5]
#> TCGA-A5-A0GI-01 TCGA-S9-A7J2-01 TCGA-06-0150-01 TCGA-AR-A1AH-01 TCGA-EK-A2RE-01 
#>               0               0               0              -1               0

# 突变
gene_mut <- query_value("TP53", data_type = "mutation")
#> More info about dataset please run following commands:
#>   library(UCSCXenaTools)
#>   XenaGenerate(subset = XenaDatasets == "mc3.v0.2.8.PUBLIC.nonsilentGene.xena") %>% XenaBrowse()
#> Running mode: client
#> Try querying data #1
#> -> Checking if the dataset has probeMap...
#> -> Done. ProbeMap is found.
#> Running mode: client
gene_mut[1:5]
#> TCGA-02-0003-01 TCGA-02-0033-01 TCGA-02-0047-01 TCGA-02-0055-01 TCGA-02-2470-01 
#>               1               1               0               1               0

简单利用 tidyverse 包,我们可以将几种分子数据整合到一起:

library(tidyverse)
#> ── Attaching packages ──────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
#> ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
#> ✓ tibble  3.0.3     ✓ dplyr   1.0.2
#> ✓ tidyr   1.1.2     ✓ stringr 1.4.0
#> ✓ readr   1.3.1     ✓ forcats 0.5.0
#> ── Conflicts ─────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()

expr <- dplyr::tibble(
  sample = names(gene_expr$expression),
  expr = as.numeric(gene_expr$expression)
)

cnv <- dplyr::tibble(
  sample = names(gene_cnv$data),
  cnv = as.numeric(gene_cnv$data)
)

mut <- dplyr::tibble(
  sample = names(gene_mut),
  mut = as.numeric(gene_mut)
)

tp53 <- purrr::reduce(list(expr, cnv, mut), dplyr::full_join, by = "sample")

最后简单画个图吧:

GGally::ggpairs(tp53[, -1])
#> Registered S3 method overwritten by 'GGally':
#>   method from   
#>   +.gg   ggplot2
#> Warning: Removed 1370 rows containing non-finite values (stat_density).
#> Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
#> Removed 11009 rows containing missing values
#> Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
#> Removed 12038 rows containing missing values
#> Warning: Removed 11009 rows containing missing values (geom_point).
#> Warning: Removed 9656 rows containing non-finite values (stat_density).
#> Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
#> Removed 11608 rows containing missing values
#> Warning: Removed 12038 rows containing missing values (geom_point).
#> Warning: Removed 11608 rows containing missing values (geom_point).
#> Warning: Removed 11397 rows containing non-finite values (stat_density).