读入 dbGap 的表型/注释信息
王诗翔 · 2019-08-23
分类:
bioinformatics  
标签:
r  
dbGap  
dbGaP 表型等信息零散地却有规律地分布在 files/
目录下,为了理解哪些文件存储了数据,我对下载的文件类型进行了查看,包括 xml, pdf 以及 gz 文件。大致有以下整理:
- pdf 文件存储的是各种文件的大致说明
- xml 文件存储一些文件的元信息和一些数据
- tar.gz 文件存储的是打包压缩的 xml 文件
- txt.gz 文件基本存储的是数据
xml 文件中的数据很难查看和导入,我也不懂 xml,放弃治疗。
tar.gz 文件存储的 xml 文件包含了 txt.gz 文件列名的描述,如果有这方面的疑问,推荐直接用文本模式看看它们,文件不大,还是比较好查看的。
带 MULTI 字符的 txt 文件虽然汇总了该研究所有的病人或样本信息,但很多信息没有存储!可惜了(如果读者发现自己下载的数据有存储最全面的信息,那还是推荐使用这个,减少麻烦)。
实际包含详细信息的病人文件会带 Subject_Phenotypes 字符,而样本带 Sample_Attributes 字符。而它们还能相互补充,所以我要想办法导入 2 种数据,后面再进行合并处理。
了解规律之后写函数处理:
read_dbGap = function(accession, destdir=getwd(),
type = c("subject", "sample"),
col_types = cols(.default = "c"),
pattern_subject_file = "Subject_Phenotypes",
pattern_sample_file = "Sample_Attributes") {
# col_types see ?readr::read_tsv
# user can set it to dplyr::cols() if no error 'Too many conversion specifiers in format string' returned
type = match.arg(type)
# append / at the end if user not type it
destdir = ifelse(endsWith(destdir, suffix = "/"), destdir, paste0(destdir, "/"))
if (type == "subject") {
fl = dir(destdir, paste0(accession, ".*", pattern_subject_file), full.names = TRUE)
} else {
fl = dir(destdir, paste0(accession, ".*", pattern_sample_file), full.names = TRUE)
}
if (length(fl) < 1) stop("Cannot find any file! Please check your input.")
message("==> Finding out files with type ", type)
print(fl)
if (!require(tidyverse)) {
message("Please install tidyverse package and re-run this function!")
} else {
purrr::map(fl, ~readr::read_tsv(., comment = "#", progress = TRUE,
col_types = col_types)) %>%
dplyr::bind_rows() %>%
unique()
}
}
用户只要指定前 3 个参数,这里设定 col_types = cols(.default = "c")
是通过将所有列都读入为字符串以解决报错 Too many conversion specifiers in format string
,你可以试试 col_types = dplyr::cols()
,如果报错,再使用函数设定的默认参数。
因为我工作要处理的是多个 dbGap Study,所以我再创建一个函数来进行批量处理:
read_dbGapList = function(accession_list, destdir=getwd(),
col_types = cols(.default = "c"),
pattern_subject_file = "Subject_Phenotypes",
pattern_sample_file = "Sample_Attributes") {
if (!require(tidyverse)) {
message("Please install tidyverse package and re-run this function!")
} else {
purrr::map(accession_list, function(x) {
message("=> Processing accession ", x)
type = c("subject", "sample")
purrr::map(type,
~suppressWarnings(read_dbGap(x, destdir = destdir,type = .))) %>%
setNames(type)
}) %>%
setNames(accession_list)
}
}
效果如下:
> df_list = read_dbGapList(accession_list = paste0("phs00",
+ c("0447", "0554", "0909", "0915", "1141")),
+ destdir = "dbGap/phenotype/")
=> Processing accession phs000447
==> Finding out files with type subject
[1] "dbGap/phenotype//phs000447.v1.pht002581.v1.p1.c1.Prostate_CIP_Subject_Phenotypes.GRU.txt.gz"
[2] "dbGap/phenotype//phs000447.v1.pht002581.v1.p1.c2.Prostate_CIP_Subject_Phenotypes.CRO.txt.gz"
==> Finding out files with type sample
[1] "dbGap/phenotype//phs000447.v1.pht002582.v1.p1.c1.Prostate_CIP_Sample_Attributes.GRU.txt.gz"
[2] "dbGap/phenotype//phs000447.v1.pht002582.v1.p1.c2.Prostate_CIP_Sample_Attributes.CRO.txt.gz"
=> Processing accession phs000554
==> Finding out files with type subject
[1] "dbGap/phenotype//phs000554.v1.pht003196.v1.p1.c1.CRPC_Subject_Phenotypes.DS-CA-MDS.txt.gz"
==> Finding out files with type sample
[1] "dbGap/phenotype//phs000554.v1.pht003198.v1.p1.c1.CRPC_Sample_Attributes.DS-CA-MDS.txt.gz"
=> Processing accession phs000909
==> Finding out files with type subject
[1] "dbGap/phenotype//phs000909.v1.pht005250.v1.p1.c1.NEPC_Subject_Phenotypes.GRU.txt.gz"
==> Finding out files with type sample
[1] "dbGap/phenotype//phs000909.v1.pht004876.v1.p1.c1.NEPC_Sample_Attributes.GRU.txt.gz"
=> Processing accession phs000915
==> Finding out files with type subject
[1] "dbGap/phenotype//phs000915.v2.pht004946.v2.p2.c1.mCRPC_SU2C_Subject_Phenotypes.DS-PC-MDS.txt.gz"
[2] "dbGap/phenotype//phs000915.v2.pht004946.v2.p2.c2.mCRPC_SU2C_Subject_Phenotypes.DS-CA-MDS.txt.gz"
==> Finding out files with type sample
[1] "dbGap/phenotype//phs000915.v2.pht004947.v2.p2.c1.mCRPC_SU2C_Sample_Attributes.DS-PC-MDS.txt.gz"
[2] "dbGap/phenotype//phs000915.v2.pht004947.v2.p2.c2.mCRPC_SU2C_Sample_Attributes.DS-CA-MDS.txt.gz"
=> Processing accession phs001141
==> Finding out files with type subject
[1] "dbGap/phenotype//phs001141.v1.pht005601.v1.p1.c1.PROMOTE_Subject_Phenotypes.GRU.txt.gz"
==> Finding out files with type sample
[1] "dbGap/phenotype//phs001141.v1.pht005602.v1.p1.c1.PROMOTE_Sample_Attributes.GRU.txt.gz"
后续的合并就需要更多的 dirty work 了。。。
注:公布的函数是开源的,抄可以,但封装或修改时还是要有点开源精神,以GPL协议附上本人 copyright。如果是发表研究,应当适当引用。