Coverage Status GitHub issues Closed issues

UCSCXenaTools is a R package download and explore data from UCSC Xena data hubs, which are

A collection of UCSC-hosted public databases such as TCGA, ICGC, TARGET, GTEx, CCLE, and others. Databases are normalized so they can be combined, linked, filtered, explored and downloaded.

UCSC Xena

If you use this package in academic field, please cite:

Wang, Shixiang, et al. “APOBEC3B and APOBEC mutational signature as potential predictive markers for immunotherapy response in non-small cell lung cancer.” Oncogene (2018).

Installation

Install stable release from CRAN with:

install.packages("UCSCXenaTools")

You can also install devel version of UCSCXenaTools from github with:

# install.packages("devtools")
devtools::install_github("ShixiangWang/UCSCXenaTools", build_vignettes = TRUE)

Read this vignettes.

Data Hub List

All datasets are available at https://xenabrowser.net/datapages/.

Currently, UCSCXenaTools support all 9 data hubs of UCSC Xena.

If the API changed, please remind me by email to or open an issue on GitHub.

Usage

Download UCSC Xena Datasets and load them into R by UCSCXenaTools is a workflow in generate, filter, query, download and prepare 5 steps, which are implemented as XenaGenerate, XenaFilter, XenaQuery, XenaDownload and XenaPrepare, respectively. They are very clear and easy to use and combine with other packages like dplyr.

The following use clinical data download of LUNG, LUAD, LUSC from TCGA (hg19 version) as an example.

XenaData data.frame

Begin from version 0.2.0, UCSCXenaTools use a data.frame object (built in package) XenaData to generate an instance of XenaHub class, which communicate with API of UCSC Xena Data Hubs.

You can load XenaData after loading UCSCXenaTools into R.

library(UCSCXenaTools)
#> =========================================================================
#> UCSCXenaTools version 0.2.7
#> Github page: https://github.com/ShixiangWang/UCSCXenaTools
#> Documentation: https://github.com/ShixiangWang/UCSCXenaTools
#> If you use it in published research, please cite:
#> Wang, Shixiang, et al. "APOBEC3B and APOBEC mutational signature
#>     as potential predictive markers for immunotherapy
#>     response in non-small cell lung cancer." Oncogene (2018).
#> =========================================================================
#> 
data(XenaData)

head(XenaData)
#>                         XenaHosts XenaHostNames
#> 1 https://ucscpublic.xenahubs.net   UCSC_Public
#> 2 https://ucscpublic.xenahubs.net   UCSC_Public
#> 3 https://ucscpublic.xenahubs.net   UCSC_Public
#> 4 https://ucscpublic.xenahubs.net   UCSC_Public
#> 5 https://ucscpublic.xenahubs.net   UCSC_Public
#> 6 https://ucscpublic.xenahubs.net   UCSC_Public
#>                                     XenaCohorts
#> 1 Acute lymphoblastic leukemia (Mullighan 2008)
#> 2 Acute lymphoblastic leukemia (Mullighan 2008)
#> 3 Acute lymphoblastic leukemia (Mullighan 2008)
#> 4                   Breast Cancer (Caldas 2007)
#> 5                   Breast Cancer (Caldas 2007)
#> 6                   Breast Cancer (Caldas 2007)
#>                                               XenaDatasets
#> 1    mullighan2008_public/mullighan2008_500K_genomicMatrix
#> 2 mullighan2008_public/mullighan2008_public_clinicalMatrix
#> 3    mullighan2008_public/mullighan2008_SNP6_genomicMatrix
#> 4              Caldas2007/chinSF2007_public_clinicalMatrix
#> 5             Caldas2007/chinSFGenomeBio2007_genomicMatrix
#> 6                   Caldas2007/naderi2007Exp_genomicMatrix

Generate a XenaHub object

This can be implemented by XenaGenerate function, which generate XenaHub object from XenaData data frame.

We can set subset argument to narrow datasets.

You can use XenaHub() to generate a XenaHub object for API communication, but it is not recommended.

It’s possible to explore hosts(), cohorts() and datasets().

Pipe operator %>% can also be used here.

> library(tidyverse)
> XenaData %>% filter(XenaHostNames == "TCGA", grepl("BRCA", XenaCohorts), grepl("Path", XenaDatasets)) %>% XenaGenerate()
class: XenaHub 
hosts():
  https://tcga.xenahubs.net
cohorts() (1 total):
  TCGA Breast Cancer (BRCA)
datasets() (4 total):
  TCGA.BRCA.sampleMap/Pathway_Paradigm_mRNA_And_Copy_Number
  TCGA.BRCA.sampleMap/Pathway_Paradigm_RNASeq
  TCGA.BRCA.sampleMap/Pathway_Paradigm_RNASeq_And_Copy_Number
  TCGA.BRCA.sampleMap/Pathway_Paradigm_mRNA

Filter

There are too many datasets, we filter them by XenaFilter function.

Regular expression can be used to filter XenaHub object to what we want.

Then select LUAD, LUSC and LUNG 3 datasets.

XenaFilter(xe2, filterDatasets = "LUAD|LUSC|LUNG") -> xe2

Pipe can be used here.

xe %>% 
    XenaFilter(filterDatasets = "clinical") %>% 
    XenaFilter(filterDatasets = "luad|lusc|lung")

Prepare

There are 4 ways to prepare data to R.

# way1:  directory
cli1 = XenaPrepare("E:/Github/XenaData/test/")
names(cli1)
## [1] "TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz"
## [2] "TCGA.LUNG.sampleMap__LUNG_clinicalMatrix.gz"
## [3] "TCGA.LUSC.sampleMap__LUSC_clinicalMatrix.gz"
# way2: local files
cli2 = XenaPrepare("E:/Github/XenaData/test/TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz")
class(cli2)
## [1] "tbl_df"     "tbl"        "data.frame"

cli2 = XenaPrepare(c("E:/Github/XenaData/test/TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz",
                     "E:/Github/XenaData/test/TCGA.LUNG.sampleMap__LUNG_clinicalMatrix.gz"))
class(cli2)
## [1] "list"
names(cli2)
## [1] "TCGA.LUAD.sampleMap__LUAD_clinicalMatrix.gz"
## [2] "TCGA.LUNG.sampleMap__LUNG_clinicalMatrix.gz"
# way3: urls
cli3 = XenaPrepare(xe2_download$url[1:2])
names(cli3)
## [1] "LUSC_clinicalMatrix.gz" "LUNG_clinicalMatrix.gz"

From v0.2.6, XenaPrepare() can enable chunk feature when file is too big and user only need subset of file.

Following code show how to subset some rows or columns of files, sample is the name of the first column, user can directly use it in logical expression, x can be a representation of data frame user wanna do subset operation. More custom operation can be set as a function and pass to callback option.

# select rows which sample (gene symbol here) in "HIF3A" or "RNF17"
testRNA = UCSCXenaTools::XenaPrepare("~/Download/HiSeqV2.gz", use_chunk = TRUE, subset_rows = sample %in% c("HIF3A", "RNF17"))
# only keep 1 to 3 columns
testRNA = UCSCXenaTools::XenaPrepare("~/Download/HiSeqV2.gz", use_chunk = TRUE, select_cols = colnames(x)[1:3])

TCGA Common Data Easy Download

getTCGAdata

getTCGAdata provide a quite easy download way for TCGA datasets, user can specify multiple options to select which data and corresponding file type want to download. Default this function will return a list include XenaHub object and selective datasets information. Once you are sure the datasets is exactly what you want, download can be set to TRUE to download the data.

Check arguments of getTCGAdata:

Select one or more projects, default will select only clinical datasets:

Set download=TRUE to download data, default data will be downloaded to system temp directory (you can specify the path with destdir option):

# only download clinical data
getTCGAdata(c("UVM", "LUAD"), download = TRUE)

Support Data Type and Options

  • clinical information: clinical
  • mRNA Sequencing: mRNASeq
  • mRNA microarray: mRNAArray
  • miRNA Sequencing: miRNASeq
  • exon Sequencing: exonRNASeq
  • RPPA array: RPPAArray
  • DNA Methylation: Methylation
  • Gene mutation: GeneMutation
  • Somatic mutation: SomaticMutation
  • Gistic2 Copy Number: GisticCopyNumber
  • Copy Number Segment: CopyNumberSegment

other data type supported by Xena cannot download use this function. Please refer to downloadTCGA function or XenaGenerate function.

NOTE: Sequencing data are all based on Illumina Hiseq platform, other platform (Illumina GA) data supported by Xena cannot download using this function. This is for building consistent data download flow. Mutation use broad automated version (except PANCAN use MC3 Public Version). If you wan to download other datasets, please refer to downloadTCGA function or XenaGenerate function.

download any TCGA data by datatypes and filetypes

downloadTCGA function can used to download any TCGA data supported by Xena, but in a way different from getTCGAdata function.

See the arguments:

Except destdir option, you only need to select three arguments for downloading data. Even throught the number is far less than getTCGAdata, it is more complex than the latter.

Before you download data, you need spare some time to figure out what data type and file type available and what your datasets have.

availTCGA can return all information you need:

availTCGA()
#> Note not all projects have listed data types and file types, you can use showTCGA function to check if exist
#> $ProjectID
#>  [1] "LAML"     "ACC"      "CHOL"     "BLCA"     "BRCA"     "CESC"    
#>  [7] "COADREAD" "COAD"     "UCEC"     "ESCA"     "FPPP"     "GBM"     
#> [13] "HNSC"     "KICH"     "KIRC"     "KIRP"     "DLBC"     "LIHC"    
#> [19] "LGG"      "GBMLGG"   "LUAD"     "LUNG"     "LUSC"     "SKCM"    
#> [25] "MESO"     "UVM"      "OV"       "PANCAN"   "PAAD"     "PCPG"    
#> [31] "PRAD"     "READ"     "SARC"     "STAD"     "TGCT"     "THYM"    
#> [37] "THCA"     "UCS"     
#> 
#> $DataType
#>  [1] "DNA Methylation"                       
#>  [2] "Gene Level Copy Number"                
#>  [3] "Somatic Mutation"                      
#>  [4] "Gene Expression RNASeq"                
#>  [5] "miRNA Mature Strand Expression RNASeq" 
#>  [6] "Gene Somatic Non-silent Mutation"      
#>  [7] "Copy Number Segments"                  
#>  [8] "Exon Expression RNASeq"                
#>  [9] "Phenotype"                             
#> [10] "PARADIGM Pathway Activity"             
#> [11] "Protein Expression RPPA"               
#> [12] "Transcription Factor Regulatory Impact"
#> [13] "Gene Expression Array"                 
#> [14] "Signatures"                            
#> [15] "iCluster"                              
#> 
#> $FileType
#>  [1] "Methylation27K"                            
#>  [2] "Methylation450K"                           
#>  [3] "Gistic2"                                   
#>  [4] "wustl hiseq automated"                     
#>  [5] "IlluminaGA RNASeq"                         
#>  [6] "IlluminaHiSeq RNASeqV2 in percentile rank" 
#>  [7] "IlluminaHiSeq RNASeqV2 pancan normalized"  
#>  [8] "IlluminaHiSeq RNASeqV2"                    
#>  [9] "After remove germline cnv"                 
#> [10] "PANCAN AWG analyzed"                       
#> [11] "Clinical Information"                      
#> [12] "wustl automated"                           
#> [13] "Gistic2 thresholded"                       
#> [14] "Before remove germline cnv"                
#> [15] "Use only RNASeq"                           
#> [16] "Use RNASeq plus Copy Number"               
#> [17] "bcm automated"                             
#> [18] "IlluminaHiSeq RNASeq"                      
#> [19] "bcm curated"                               
#> [20] "broad curated"                             
#> [21] "RPPA"                                      
#> [22] "bsgsc automated"                           
#> [23] "broad automated"                           
#> [24] "bcgsc automated"                           
#> [25] "ucsc automated"                            
#> [26] "RABIT Use IlluminaHiSeq RNASeqV2"          
#> [27] "RABIT Use IlluminaHiSeq RNASeq"            
#> [28] "RPPA normalized by RBN"                    
#> [29] "RABIT Use Agilent 244K Microarray"         
#> [30] "wustl curated"                             
#> [31] "Use Microarray plus Copy Number"           
#> [32] "Use only Microarray"                       
#> [33] "Agilent 244K Microarray"                   
#> [34] "IlluminaGA RNASeqV2"                       
#> [35] "bcm SOLiD"                                 
#> [36] "RABIT Use IlluminaGA RNASeqV2"             
#> [37] "RABIT Use IlluminaGA RNASeq"               
#> [38] "RABIT Use Affymetrix U133A Microarray"     
#> [39] "Affymetrix U133A Microarray"               
#> [40] "MethylMix"                                 
#> [41] "bcm SOLiD curated"                         
#> [42] "Gene Expression Subtype"                   
#> [43] "Platform-corrected PANCAN12 dataset"       
#> [44] "Batch effects normalized"                  
#> [45] "MC3 Public Version"                        
#> [46] "TCGA Sample Type and Primary Disease"      
#> [47] "RPPA pancan normalized"                    
#> [48] "Tumor copy number"                         
#> [49] "Genome-wide DNA Damage Footprint HRD Score"
#> [50] "TCGA Molecular Subtype"                    
#> [51] "iCluster cluster assignments"              
#> [52] "iCluster latent variables"                 
#> [53] "RNA based StemnessScore"                   
#> [54] "DNA methylation based StemnessScore"       
#> [55] "Pancan Gene Programs"                      
#> [56] "Immune Model Based Subtype"                
#> [57] "Immune Signature Scores"

Note not all datasets have these property, showTCGA can help you to check it. It will return all data in TCGA, you can use following code in RStudio and search your data.

OR you can use shiny app provided by UCSCXenaTools to search.

Run shiny by:

Download by shiny is under consideration, I am try learning more about how to operate shiny.

Bug Report

I have no time to test if all condition are right and all datasets can normally be downloaded. So if you have any question or suggestion, please open an issue on Github at https://github.com/ShixiangWang/UCSCXenaTools/issues.

Acknowledgement

This package is based on XenaR, thanks Martin Morgan for his work.

LICENSE

GPL-3

please note, code from XenaR package under Apache 2.0 license.

Code of conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.