In a nutshell, gcap provides an end-to-end workflow for predicting circular amplicon (also known as ecDNA, extra-chromosomal DNA ) in gene level with machine learning approach, then classifying cancer samples into different focal amplification (fCNA) types, based on input from WES (tumor-normal paired BAM) data, allele specific copy number data (e.g., results from ASCAT or Sequenza), or even absolute integer copy number data (e.g., results from ABSOLUTE). The former two data sources are preferred as input of gcap .
For advanced users, you can prepare the reference files by following the instructions from https://github.com/shixiangwang/ascat/tree/v3.0.
We recommend all users directly download the reference files from the links below:
The prediction model was built with data on the top of hg38 genome build, so hg38-based BAM file input is more recommended.
alleleCount is required to run ASCAT on WES bam data, if you haven’t installed conda or miniconda, please install firstly, then install the alleleCount in terminal with:
conda create -n cancerit -c bioconda cancerit-allelecount
NOTE: gcap set the default alleleCount as the
~/miniconda3/envs/cancerit/bin/alleleCounter
, if you use conda or other approaches, please set the path when you use corresponding functions.
Install ASCAT v3.0 (modified and adapted for GCAP workflow in HPC) in R console from GitHub with:
# This is a forked version ASCAT
remotes::install_github("ShixiangWang/ascat@v3-for-gcap-v1", subdir = "ASCAT")
# A ASCAT version with loose SAM flag, useful sometimes
# remotes::install_github("ShixiangWang/ascat@v3-f1", subdir = "ASCAT")
# See https://github.com/ShixiangWang/gcap/issues/27
Install gcap in R console from GitHub with:
remotes::install_github("ShixiangWang/gcap")
If you would like to use CLI program in Shell terminal, run the following code in your R console after installation:
gcap::deploy()
Two scripts gcap-bam.R
and gcap-ascn.R
shall be linked to your path /usr/local/bin/
. You can use one of them based on you input data.
NOTE
For users with package GetoptLong version >= 1.1.0
, a main command is implemented and also linked to /usr/local/bin/
when calling deploy()
. So you can type gcap
as a unified interface.
$ gcap
gcap (v1.0.0)
Usage: gcap [command] [options]
Commands:
bam Run GCAP workflow with tumor-normal paired BAM files
ascn Run GCAP workflow with curated allele-specific copy number data
----------
Citation:
GCAP
URL:
https://github.com/ShixiangWang/gcap
NOTE: gcap use XGBOOST < 1.6, if you have installed a latest version, you can install the specified version with:
install.packages("https://cran.r-project.org/src/contrib/Archive/xgboost/xgboost_1.5.2.1.tar.gz", repos = NULL)
Run the following code to see a quick example:
library(gcap)
data("ascn")
rv <- gcap.ASCNworkflow(ascn, outdir = tempdir(), model = "XGB11")
rv
To run gcap from bam files, a machine with at least 80GB RAM is required for the allelecount
process. If you set multiple threads, please note the parallel computation is used in part of the workflow. You should balance the nthread
setting and the computing power your machine provides by yourself.
It generally takes ~0.5h
to finish one case (tumor-normal pair).
In our practice, when we want to process multiple cases, set nthread = 22
and directly let gcap handle multiple cases (instead of writing a loop yourself) is good enough.
A recommended setting for Slurm is given as:
#!/bin/bash
#SBATCH -N 1
#SBATCH -o output-%J.o
#SBATCH -n 22
#SBATCH --mem=102400
Templates of practical calling command with provided hg38 and hg19 annotations are given below:
# hg38 ----------------
gcap.workflow(
tumourseqfile = tfile, normalseqfile = nfile, tumourname = tn, normalname = nn, jobname = id,
outdir = outdir,
allelecounter_exe = "~/miniconda3/envs/cancerit/bin/alleleCounter",
g1000allelesprefix = file.path(
"/data/wsx/data/1000G_loci_hg38/",
"1kg.phase3.v5a_GRCh38nounref_allele_index_chr"
),
g1000lociprefix = file.path("/data/wsx/data/1000G_loci_hg38/",
"1kg.phase3.v5a_GRCh38nounref_loci_chrstring_chr"
),
GCcontentfile = "/data/wsx/data/GC_correction_hg38.txt",
replictimingfile = "/data/wsx/data/RT_correction_hg38.txt",
skip_finished_ASCAT = TRUE,
skip_ascat_call = FALSE,
result_file_prefix = "xxx",
extra_info = df,
include_type = FALSE,
genome_build = "hg38",
model = "XGB11"
)
# hg19 ----------------
gcap.workflow(
tumourseqfile = tfile, normalseqfile = nfile, tumourname = tn, normalname = nn, jobname = id,
outdir = outdir,
allelecounter_exe = "~/miniconda3/envs/cancerit/bin/alleleCounter", g1000allelesprefix = file.path(
"/data/wsx/data/1000G_loci_hg19/",
"1000genomesAlleles2012_chr"
), g1000lociprefix = file.path("/data/wsx/data/1000G_loci_hg19/", "1000genomesloci2012chrstring_chr"),
GCcontentfile = "/data/wsx/data/GC_correction_hg19.txt", replictimingfile = "/data/wsx/data/RT_correction_hg19.txt",
skip_finished_ASCAT = TRUE,
skip_ascat_call = FALSE,
result_file_prefix = "xxx",
extra_info = NULL,
include_type = FALSE,
genome_build = "hg19",
model = "XGB11"
)
Please refer to ?gcap.ASCNworkflow()
.
For more custom and advanced control of the analysis, you can read the structured documentation at package site.
For better debugging and rechecking. The logging information of your operation with gcap would be saved into an independent file. You can use the following commands to get the file path and print logging message. Please note you have to use :::
to access these functions as they are not exported from gcap.
> gcap:::get_log_file()
1] "~/Library/Logs/gcap/gcap.log"
[> gcap:::cat_log_file()
Wang et al. Machine learning-based extrachromosomal DNA identification in large-scale cohorts reveals its clinical implications in cancer. (Submitted)
This software and associated documentation files (the “Software”) are protected by copyright. This Software is provided “as is” (at your own risk) for internal non-commercial academic research purposes only. Please read the Non-Commercial Academic License in detail before downloading a copy. By installing or using this Software, you agree to be bound by the terms and conditions of the Non-Commercial Academic License.
All commercial use of the Software or any modification, manipulation or derivative of the Software, including but not limited to transfer, sale or licence to a commercial third party or use on behalf of a commercial third party (including but not limited to use as part of a service supplied to any third party for financial reward) is strictly prohibited and requires a commercial use licence. This software is protected by the P. R. China patent 202211067952.6 For further information please email wangsx1@sysucc.org.cn or zhaoqi@sysucc.org.cn.