R/seqz_pipeline.R
gcap.workflow.seqz.Rd
GCAP sequenza workflow for gene-level amplicon prediction
gcap.workflow.seqz(
tumourseqfile,
normalseqfile,
jobname,
extra_info = NULL,
include_type = FALSE,
genome_build = c("mm10", "hg38", "hg19"),
model = "XGB11",
tightness = 1L,
gap_cn = 3L,
overlap = 1,
only_oncogenes = FALSE,
ref_file = "path/to/reference.fa",
data_tmp_dir = "~/gcap_data",
outdir = getwd(),
result_file_prefix = paste0("gcap_", uuid::UUIDgenerate(TRUE)),
util_exe = "~/miniconda3/bin/sequenza-utils",
samtools_exe = "~/miniconda3/bin/samtools",
tabix_exe = "~/miniconda3/bin/tabix",
nthreads = 1,
skip_finished_sequenza = TRUE,
skip_sequenza_call = FALSE
)
Full path to the tumour BAM file.
Full path to the normal BAM file.
job name, typically an unique name for a tumor-normal pair.
(optional) a (file containing) data.frame
with 3 columns 'sample'
(must identical to the setting of parameter jobname
),
'age' and 'gender'. For gender, should be 'XX' or 'XY',
also could be 0
for 'XX' and 1
for 'XY'.
if TRUE
, a fourth column named 'type'
should be included in extra_info
, the supported cancer
type should be described with TCGA cancer type abbr..
genome build version, should be one of 'hg38', 'hg19' and 'mm10'.
model name ("XGB11", "XGB32", "XGB56") or a custom model from input. 'toy' can be used for test.
a coefficient to times to TCGA somatic CN to set a more strict threshold
as a circular amplicon.
If the value is larger, it is more likely a fCNA assigned to noncircular
instead of circular
. When it is NA
, we don't use TCGA somatic CN data as reference.
a gap copy number value.
A gene with copy number above background (ploidy + gap_cn
in general) would be treated as focal amplicon.
Smaller, more amplicons.
the overlap percentage on gene.
if TRUE
, only known oncogenes are kept for circular prediction.
a reference genome file, should be consistent with genome_build
option.
a directory path for storing temp data for reuse in handling multiple samples.
result output path.
file name prefix (without directory path) for storing final model prediction file in CSV format. Default a unique file name is generated by UUID approach.
the path to sequenza-utils
.
the path to samtools_exe
.
the path to tabix
.
The number of parallel processes for getting allele counts (optional, default=1).
if TRUE
, skip finished sequenza runs.
if TRUE
, skip calling sequenza.
This is useful when you have done this step and just want
to run next steps.
a list of invisible data.table
and corresponding files saved to local machine.