GCAP workflow for gene-level amplicon prediction
gcap.workflow(
tumourseqfile,
normalseqfile,
tumourname,
normalname,
jobname = tumourname,
extra_info = NULL,
include_type = FALSE,
genome_build = c("hg38", "hg19"),
model = "XGB11",
tightness = 1L,
gap_cn = 3L,
overlap = 1,
only_oncogenes = FALSE,
outdir = getwd(),
result_file_prefix = paste0("gcap_", uuid::UUIDgenerate(TRUE)),
allelecounter_exe = "~/miniconda3/envs/cancerit/bin/alleleCounter",
g1000allelesprefix = file.path("~/data/snp/1000G_loci_hg38",
"1kg.phase3.v5a_GRCh38nounref_allele_index_chr"),
g1000lociprefix = file.path("~/data/snp/1000G_loci_hg38",
"1kg.phase3.v5a_GRCh38nounref_loci_chrstring_chr"),
GCcontentfile = "~/data/snp/GC_correction_hg38.txt",
replictimingfile = "~/data/snp/RT_correction_hg38.txt",
nthreads = 22,
minCounts = 10,
BED_file = NA,
probloci_file = NA,
chrom_names = 1:22,
min_base_qual = 20,
min_map_qual = 35,
penalty = 70,
skip_finished_ASCAT = TRUE,
skip_ascat_call = FALSE
)
Full path to the tumour BAM file.
Full path to the normal BAM file.
Identifier to be used for tumour output files.
Identifier to be used for normal output files.
job name, typically an unique name for a tumor-normal pair.
(optional) a (file containing) data.frame
with 3 columns 'sample'
(must identical to the setting of parameter jobname
),
'age' and 'gender'. For gender, should be 'XX' or 'XY',
also could be 0
for 'XX' and 1
for 'XY'.
if TRUE
, a fourth column named 'type'
should be included in extra_info
, the supported cancer
type should be described with TCGA cancer type abbr..
genome build version, should be one of 'hg38', 'hg19'.
model name ("XGB11", "XGB32", "XGB56") or a custom model from input. 'toy' can be used for test.
a coefficient to times to TCGA somatic CN to set a more strict threshold
as a circular amplicon.
If the value is larger, it is more likely a fCNA assigned to noncircular
instead of circular
. When it is NA
, we don't use TCGA somatic CN data as reference.
a gap copy number value.
A gene with copy number above background (ploidy + gap_cn
in general) would be treated as focal amplicon.
Smaller, more amplicons.
the overlap percentage on gene.
if TRUE
, only known oncogenes are kept for circular prediction.
result output path.
file name prefix (without directory path) for storing final model prediction file in CSV format. Default a unique file name is generated by UUID approach.
Path to the allele counter executable.
Prefix path to the 1000 Genomes alleles reference files.
Prefix path to the 1000 Genomes SNP reference files.
File containing the GC content around every SNP for increasing window sizes
File containing replication timing at every SNP for various cell lines (optional)
The number of parallel processes for getting allele counts (optional, default=1).
Minimum depth required in the normal for a SNP to be considered (optional, default=10).
A BED file for only looking at SNPs within specific intervals (optional, default=NA).
A file (chromosome <tab> position; no header) containing specific loci to ignore (optional, default=NA).
A vector containing the names of chromosomes to be considered (optional, default=1:22).
Minimum base quality required for a read to be counted (optional, default=20).
Minimum mapping quality required for a read to be counted (optional, default=35).
penalty of introducing an additional ASPCF breakpoint (expert parameter, don't adapt unless you know what you're doing)
if TRUE
, skipped finished ASCAT calls
to save time.
if TRUE
, skip calling ASCAT.
This is useful when you have done this step and just want
to run next steps.
a list of invisible data.table
and corresponding files saved to local machine.