GCAP sequenza workflow for gene-level amplicon prediction

gcap.workflow.seqz(
  tumourseqfile,
  normalseqfile,
  jobname,
  extra_info = NULL,
  include_type = FALSE,
  genome_build = c("mm10", "hg38", "hg19"),
  model = "XGB11",
  tightness = 1L,
  gap_cn = 3L,
  overlap = 1,
  only_oncogenes = FALSE,
  ref_file = "path/to/reference.fa",
  data_tmp_dir = "~/gcap_data",
  outdir = getwd(),
  result_file_prefix = paste0("gcap_", uuid::UUIDgenerate(TRUE)),
  util_exe = "~/miniconda3/bin/sequenza-utils",
  samtools_exe = "~/miniconda3/bin/samtools",
  tabix_exe = "~/miniconda3/bin/tabix",
  nthreads = 1,
  skip_finished_sequenza = TRUE,
  skip_sequenza_call = FALSE
)

Arguments

tumourseqfile

Full path to the tumour BAM file.

normalseqfile

Full path to the normal BAM file.

jobname

job name, typically an unique name for a tumor-normal pair.

extra_info

(optional) a (file containing) data.frame with 3 columns 'sample' (must identical to the setting of parameter jobname), 'age' and 'gender'. For gender, should be 'XX' or 'XY', also could be 0 for 'XX' and 1 for 'XY'.

include_type

if TRUE, a fourth column named 'type' should be included in extra_info, the supported cancer type should be described with TCGA cancer type abbr..

genome_build

genome build version, should be one of 'hg38', 'hg19' and 'mm10'.

model

model name ("XGB11", "XGB32", "XGB56") or a custom model from input. 'toy' can be used for test.

tightness

a coefficient to times to TCGA somatic CN to set a more strict threshold as a circular amplicon. If the value is larger, it is more likely a fCNA assigned to noncircular instead of circular. When it is NA, we don't use TCGA somatic CN data as reference.

gap_cn

a gap copy number value. A gene with copy number above background (ploidy + gap_cn in general) would be treated as focal amplicon. Smaller, more amplicons.

overlap

the overlap percentage on gene.

only_oncogenes

if TRUE, only known oncogenes are kept for circular prediction.

ref_file

a reference genome file, should be consistent with genome_build option.

data_tmp_dir

a directory path for storing temp data for reuse in handling multiple samples.

outdir

result output path.

result_file_prefix

file name prefix (without directory path) for storing final model prediction file in CSV format. Default a unique file name is generated by UUID approach.

util_exe

the path to sequenza-utils.

samtools_exe

the path to samtools_exe.

tabix_exe

the path to tabix.

nthreads

The number of parallel processes for getting allele counts (optional, default=1).

skip_finished_sequenza

if TRUE, skip finished sequenza runs.

skip_sequenza_call

if TRUE, skip calling sequenza. This is useful when you have done this step and just want to run next steps.

Value

a list of invisible data.table and corresponding files saved to local machine.