Obtain RNAseq Values for a Specific Gene in Xena Database
王诗翔 · 2020-07-22
Categories:
bioinformatics  
Tags:
r  
package  
UCSCXenaTools  
When using UCSCXenaTools
package, you may want to focus on single gene analysis, a typical case has been shown in my previous blog UCSCXenaTools: Retrieve Gene Expression and Clinical Information from UCSC Xena for Survival Analysis. Here I will describe how to get single gene values (especially RNAseq data) in details.
Let’s load package.
library(UCSCXenaTools)
#> =========================================================================================
#> UCSCXenaTools version 1.3.3
#> Project URL: https://github.com/ropensci/UCSCXenaTools
#> Usages: https://cran.r-project.org/web/packages/UCSCXenaTools/vignettes/USCSXenaTools.html
#>
#> If you use it in published research, please cite:
#> Wang et al., (2019). The UCSCXenaTools R package: a toolkit for accessing genomics data
#> from UCSC Xena platform, from cancer multi-omics to single-cell RNA-seq.
#> Journal of Open Source Software, 4(40), 1627, https://doi.org/10.21105/joss.01627
#> =========================================================================================
#> --Enjoy it--
First, Find Your Interest Dataset
UCSC Xena provides more than 1000 datasets, when you want to get values for single gene, you must select a target dataset. You can find them in the following table or from UCSC Xena datasets page.
DT::datatable(UCSCXenaTools::XenaData)
Pick up a dataset and get its XenaHosts
and XenaDatasets
, i.e. get its data hub host URL and dataset ID. You can copy them or you can use your R skill to get and store them in a object. For example, I got a reader want to study RNASeq values of TCGA LUAD gene.
I can use R:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
ge <- XenaData %>%
filter(XenaHostNames == "tcgaHub") %>% # select TCGA Hub
XenaScan("TCGA Lung Adenocarcinoma") %>%
filter(DataSubtype == "gene expression RNAseq", Label == "IlluminaHiSeq")
str(ge)
#> tibble [1 × 17] (S3: tbl_df/tbl/data.frame)
#> $ XenaHosts : chr "https://tcga.xenahubs.net"
#> $ XenaHostNames : Named chr "tcgaHub"
#> ..- attr(*, "names")= chr "https://tcga.xenahubs.net"
#> $ XenaCohorts : chr "TCGA Lung Adenocarcinoma (LUAD)"
#> $ XenaDatasets : chr "TCGA.LUAD.sampleMap/HiSeqV2"
#> $ SampleCount : int 576
#> $ DataSubtype : chr "gene expression RNAseq"
#> $ Label : chr "IlluminaHiSeq"
#> $ Type : chr "genomicMatrix"
#> $ AnatomicalOrigin: chr "Lung"
#> $ SampleType : chr "tumor"
#> $ Tags : chr "cancer,non-small cell lung cancer"
#> $ ProbeMap : chr "probeMap/hugo_gencode_good_hg19_V24lift37_probemap"
#> $ LongTitle : chr "TCGA lung adenocarcinoma (LUAD) gene expression by RNAseq (polyA+ IlluminaHiSeq)"
#> $ Citation : chr NA
#> $ Version : chr "2017-10-13"
#> $ Unit : chr "log2(norm_count+1)"
#> $ Platform : chr "IlluminaHiSeq_RNASeqV2"
Or I just copy https://tcga.xenahubs.net
and TCGA.LUAD.sampleMap/HiSeqV2
.
Get Your Gene Values
Once you got dataset information, you can get a specific gene expression (it also works for gene-level CNV, mutation, etc based on your dataset) by fetch_dense_values
. Run ?fetch
in your R console to see more details.
For example, I will query the gene TP53
.
TP53 <- fetch_dense_values(
host = ge$XenaHosts, # You can also set "https://tcga.xenahubs.net"
dataset = ge$XenaDatasets, # You can also set "TCGA.LUAD.sampleMap/HiSeqV2"
identifiers = "TP53",
use_probeMap = TRUE
) %>%
.[1, ]
#> -> Checking identifiers...
#> -> use_probeMap is TRUE, skipping checking identifiers...
#> -> Done.
#> -> Checking samples...
#> -> Done.
#> -> Checking if the dataset has probeMap...
#> -> Done. ProbeMap is found.
head(TP53)
#> TCGA-69-7978-01 TCGA-62-8399-01 TCGA-78-7539-01 TCGA-50-5931-11 TCGA-73-4658-01
#> 9.89 8.31 10.35 9.62 10.02
#> TCGA-44-6775-01
#> 10.16
Typically, the TCGA sample ID have 15 letters, and the 14-15th letters mark a sample type. When it <10
, it is a tumor sample, otherwise it is a normal sample.
table(as.integer(substr(names(TP53), 14, 15)))
#>
#> 1 2 11
#> 515 2 59
Now you can start your analysis with this data.
Other Things May Help
In addition to fetch_*
functions, I generated many low-level API functions for UCSC Xena database, which described at https://shixiangwang.github.io/home/en/tools/ucscxenatools-api/. These functions can access different levels of data information in UCSC Xena. Some of them are combined to construct the core functionalities provided by UCSCXenaTools
for now.
NOTE: not API functions work well, I haven’t tested them all, they are all generated by dynamic code based on XQuery.
An R Shiny package UCSCXenaShiny provides a web-based platform to download datasets and analyze single genes. Besides, we have constructed some functions to get pan-cancer level single gene expression, CNV and mutation etc.
You can download recent development version in GitHub with:
remotes::install_github("openbiox/XenaShiny")
After you load this package, you can use the following functions to get data easily.
get_ccle_cn_value: Fetch copy number value from CCLE dataset
get_ccle_gene_value: Fetch gene expression value from CCLE dataset
get_ccle_protein_value: Fetch gene protein expression value from CCLE dataset
get_ccle_mutation_status: Fetch gene mutation info from CCLE dataset
get_pancan_value: Fetch identifier value from pan-cancer dataset
get_pancan_gene_value: Fetch gene expression value from pan-cancer dataset
get_pancan_protein_value: Fetch protein expression value from pan-cancer dataset
get_pancan_mutation_status: Fetch mutation status value from pan-cancer dataset
get_pancan_cn_value: Fetch gene copy number value from pan-cancer dataset processed by GISTIC 2.0
Any questions can be posted online at https://github.com/openbiox/UCSCXenaShiny/issues or https://github.com/ropensci/UCSCXenaTools/issues.
References
- Wang et al., (2019). The UCSCXenaTools R package: a toolkit for accessing genomics data from UCSC Xena platform, from cancer multi-omics to single-cell RNA-seq. Journal of Open Source Software, 4(40), 1627, https://doi.org/10.21105/joss.01627
- Wang, S.; Xiong, Y.; Gu, K.; Zhao, L.; Li, Y.; Zhao, F.; Li, X.; Liu, X. UCSCXenaShiny: An R Package for Exploring and Analyzing UCSC Xena Public Datasets in Web Browser. Preprints 2020, 2020070179 (doi: 10.20944/preprints202007.0179.v1).