GCAP sequenza workflow for gene-level amplicon prediction

gcap.workflow.seqz(
  tumourseqfile,
  normalseqfile,
  jobname,
  extra_info = NULL,
  include_type = FALSE,
  genome_build = c("mm10", "hg38", "hg19"),
  model = "XGB11",
  tightness = 1L,
  gap_cn = 3L,
  overlap = 1,
  only_oncogenes = FALSE,
  ref_file = "path/to/reference.fa",
  data_tmp_dir = "~/gcap_data",
  outdir = getwd(),
  result_file_prefix = paste0("gcap_", uuid::UUIDgenerate(TRUE)),
  util_exe = "~/miniconda3/bin/sequenza-utils",
  samtools_exe = "~/miniconda3/bin/samtools",
  tabix_exe = "~/miniconda3/bin/tabix",
  nthreads = 1,
  skip_finished_sequenza = TRUE,
  skip_sequenza_call = FALSE
)

Arguments

tumourseqfile: Full path to the tumour BAM file.
normalseqfile: Full path to the normal BAM file.
jobname: job name, typically an unique name for a tumor-normal pair.
extra_info: (optional) a (file containing) data.frame with 3 columns 'sample' (must identical to the setting of parameter jobname), 'age' and 'gender'. For gender, should be 'XX' or 'XY', also could be 0 for 'XX' and 1 for 'XY'.
include_type: if TRUE, a fourth column named 'type' should be included in extra_info, the supported cancer type should be described with TCGA cancer type abbr..
genome_build: genome build version, should be one of 'hg38', 'hg19' and 'mm10'.
model: model name ("XGB11", "XGB32", "XGB56") or a custom model from input. 'toy' can be used for test.
tightness: a coefficient to times to TCGA somatic CN to set a more strict threshold as a circular amplicon. If the value is larger, it is more likely a fCNA assigned to noncircular instead of circular. When it is NA, we don't use TCGA somatic CN data as reference.
gap_cn: a gap copy number value. A gene with copy number above background (ploidy + gap_cn in general) would be treated as focal amplicon. Smaller, more amplicons.
overlap: the overlap percentage on gene.
only_oncogenes: if TRUE, only known oncogenes are kept for circular prediction.
ref_file: a reference genome file, should be consistent with genome_build option.
data_tmp_dir: a directory path for storing temp data for reuse in handling multiple samples.
outdir: result output path.
result_file_prefix: file name prefix (without directory path) for storing final model prediction file in CSV format. Default a unique file name is generated by UUID approach.
util_exe: the path to sequenza-utils.
samtools_exe: the path to samtools_exe.
tabix_exe: the path to tabix.
nthreads: The number of parallel processes for getting allele counts (optional, default=1).
skip_finished_sequenza: if TRUE, skip finished sequenza runs.
skip_sequenza_call: if TRUE, skip calling sequenza. This is useful when you have done this step and just want to run next steps.

Value

a list of invisible data.table and corresponding files saved to local machine.