Shixiang Wang

>上士闻道
勤而行之

Obtain RNAseq Values for a Specific Gene in Xena Database

王诗翔 · 2020-07-22

Categories: bioinformatics  
Tags: r   package   UCSCXenaTools  

When using UCSCXenaTools package, you may want to focus on single gene analysis, a typical case has been shown in my previous blog UCSCXenaTools: Retrieve Gene Expression and Clinical Information from UCSC Xena for Survival Analysis. Here I will describe how to get single gene values (especially RNAseq data) in details.

Let’s load package.

library(UCSCXenaTools)
#> =========================================================================================
#> UCSCXenaTools version 1.3.3
#> Project URL: https://github.com/ropensci/UCSCXenaTools
#> Usages: https://cran.r-project.org/web/packages/UCSCXenaTools/vignettes/USCSXenaTools.html
#> 
#> If you use it in published research, please cite:
#> Wang et al., (2019). The UCSCXenaTools R package: a toolkit for accessing genomics data
#>   from UCSC Xena platform, from cancer multi-omics to single-cell RNA-seq.
#>   Journal of Open Source Software, 4(40), 1627, https://doi.org/10.21105/joss.01627
#> =========================================================================================
#>                               --Enjoy it--

First, Find Your Interest Dataset

UCSC Xena provides more than 1000 datasets, when you want to get values for single gene, you must select a target dataset. You can find them in the following table or from UCSC Xena datasets page.

DT::datatable(UCSCXenaTools::XenaData)

Pick up a dataset and get its XenaHosts and XenaDatasets, i.e. get its data hub host URL and dataset ID. You can copy them or you can use your R skill to get and store them in a object. For example, I got a reader want to study RNASeq values of TCGA LUAD gene.

I can use R:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
ge <- XenaData %>%
  filter(XenaHostNames == "tcgaHub") %>% # select TCGA Hub
  XenaScan("TCGA Lung Adenocarcinoma") %>%
  filter(DataSubtype == "gene expression RNAseq", Label == "IlluminaHiSeq")
str(ge)
#> tibble [1 × 17] (S3: tbl_df/tbl/data.frame)
#>  $ XenaHosts       : chr "https://tcga.xenahubs.net"
#>  $ XenaHostNames   : Named chr "tcgaHub"
#>   ..- attr(*, "names")= chr "https://tcga.xenahubs.net"
#>  $ XenaCohorts     : chr "TCGA Lung Adenocarcinoma (LUAD)"
#>  $ XenaDatasets    : chr "TCGA.LUAD.sampleMap/HiSeqV2"
#>  $ SampleCount     : int 576
#>  $ DataSubtype     : chr "gene expression RNAseq"
#>  $ Label           : chr "IlluminaHiSeq"
#>  $ Type            : chr "genomicMatrix"
#>  $ AnatomicalOrigin: chr "Lung"
#>  $ SampleType      : chr "tumor"
#>  $ Tags            : chr "cancer,non-small cell lung cancer"
#>  $ ProbeMap        : chr "probeMap/hugo_gencode_good_hg19_V24lift37_probemap"
#>  $ LongTitle       : chr "TCGA lung adenocarcinoma (LUAD) gene expression by RNAseq (polyA+ IlluminaHiSeq)"
#>  $ Citation        : chr NA
#>  $ Version         : chr "2017-10-13"
#>  $ Unit            : chr "log2(norm_count+1)"
#>  $ Platform        : chr "IlluminaHiSeq_RNASeqV2"

Or I just copy https://tcga.xenahubs.net and TCGA.LUAD.sampleMap/HiSeqV2.

Get Your Gene Values

Once you got dataset information, you can get a specific gene expression (it also works for gene-level CNV, mutation, etc based on your dataset) by fetch_dense_values. Run ?fetch in your R console to see more details.

For example, I will query the gene TP53.

TP53 <- fetch_dense_values(
  host = ge$XenaHosts, # You can also set "https://tcga.xenahubs.net"
  dataset = ge$XenaDatasets, # You can also set "TCGA.LUAD.sampleMap/HiSeqV2"
  identifiers = "TP53",
  use_probeMap = TRUE
) %>%
  .[1, ]
#> -> Checking identifiers...
#> -> use_probeMap is TRUE, skipping checking identifiers...
#> -> Done.
#> -> Checking samples...
#> -> Done.
#> -> Checking if the dataset has probeMap...
#> -> Done. ProbeMap is found.
head(TP53)
#> TCGA-69-7978-01 TCGA-62-8399-01 TCGA-78-7539-01 TCGA-50-5931-11 TCGA-73-4658-01 
#>            9.89            8.31           10.35            9.62           10.02 
#> TCGA-44-6775-01 
#>           10.16

Typically, the TCGA sample ID have 15 letters, and the 14-15th letters mark a sample type. When it <10, it is a tumor sample, otherwise it is a normal sample.

table(as.integer(substr(names(TP53), 14, 15)))
#> 
#>   1   2  11 
#> 515   2  59

Now you can start your analysis with this data.

Other Things May Help

In addition to fetch_* functions, I generated many low-level API functions for UCSC Xena database, which described at https://shixiangwang.github.io/home/en/tools/ucscxenatools-api/. These functions can access different levels of data information in UCSC Xena. Some of them are combined to construct the core functionalities provided by UCSCXenaTools for now.

NOTE: not API functions work well, I haven’t tested them all, they are all generated by dynamic code based on XQuery.

An R Shiny package UCSCXenaShiny provides a web-based platform to download datasets and analyze single genes. Besides, we have constructed some functions to get pan-cancer level single gene expression, CNV and mutation etc.

You can download recent development version in GitHub with:

remotes::install_github("openbiox/XenaShiny")

After you load this package, you can use the following functions to get data easily.

get_ccle_cn_value: Fetch copy number value from CCLE dataset

get_ccle_gene_value: Fetch gene expression value from CCLE dataset

get_ccle_protein_value: Fetch gene protein expression value from CCLE dataset

get_ccle_mutation_status: Fetch gene mutation info from CCLE dataset

get_pancan_value: Fetch identifier value from pan-cancer dataset

get_pancan_gene_value: Fetch gene expression value from pan-cancer dataset

get_pancan_protein_value: Fetch protein expression value from pan-cancer dataset

get_pancan_mutation_status: Fetch mutation status value from pan-cancer dataset

get_pancan_cn_value: Fetch gene copy number value from pan-cancer dataset processed by GISTIC 2.0

Any questions can be posted online at https://github.com/openbiox/UCSCXenaShiny/issues or https://github.com/ropensci/UCSCXenaTools/issues.

References