Chapter 3 Copy Number Signature Identification

3.1 Introduction

Unlike several mutation types presented in current COSMIC database for generating mutational signatures, it is hard to represent copy number features and generate the matrix for NMF decomposition.

Macintyre et al. (2018) created a new method to generate the matrix for extracting signature by NMF algorithm. The steps are:

  • derive 6 copy number features from absolute copy number profile
  • apply mixture modeling to breakdown each feature distribution into mixtures of Gaussian or mixtures of Poisson distributions
  • generate a sample-by-component matrix representing the sum of posterior probabilities of each copy-number event being assigned to each component.

Based on previous work, our group devised a new method which discards the statistical modeling and create a fixed number of predefined components from 8 copy number features to generate the matrix as the input of NMF (Wang et al. 2020), it is easier to reproduce the result, apply to different cancer types and compare results. To test if the method would works, we applied it to prostate cancer and successfully identified 5 copy number signatures.

Currently, there are few studies focus on copy number signatures and no reference signature database for matching and explaining the etiologies (The signatures presented in two papers above can be used as references). If you study them, you should do extra work to explore and validate them. Furthermore, the input absolute copy number data may be generated by different methods and platforms, it is normal that the contribution of some copy number feature components varies a little and result in relatively lower signature similarity when comparing different cohorts or different copy number profile generation methods.

3.2 Read Data

The input requires absolute copy number profile with following information:

  • Segment chromosome.
  • Segment start.
  • Segment end.
  • Absolute copy number value for this segment: must be integer.
  • Sample ID.

The input data can be result from any software which provides information above.

Useful softwares are listed below:

The import work is done by read_copynumber(), which supports data.frame or file, and even result directory from ABSOLUTE.

Option sigminer.sex is used to control the processing of sex. If you don’t care the sex chromosomes (i.e. X and Y), you can ignore this setting after removing the X/Y segments, otherwise the summary in the result cn and tally process may be biased.

## Default is "female"
## You can ignore the setting if all samples are females
## But we recommend you set it
options(sigminer.sex = "male")
## For cohort contains both males and females,
## set a data.frame with two columns, i.e.
## options(sigminer.sex = sex_df),
## which
## sex_df = data.frame(sample = c("sample1", "sample2",
##                     sex = "female", "male"))

# Load toy dataset of absolute copynumber profile
load(system.file("extdata", "toy_segTab.RData",
  package = "sigminer", mustWork = TRUE
))
cn <- read_copynumber(segTabs,
  seg_cols = c("chromosome", "start", "end", "segVal"),
  genome_build = "hg19", complement = FALSE, verbose = TRUE
)
#> ℹ [2021-05-18 23:16:54]: Started.
#> ℹ [2021-05-18 23:16:54]: Genome build  : hg19.
#> ℹ [2021-05-18 23:16:54]: Genome measure: called.
#> ✓ [2021-05-18 23:16:54]: Chromosome size database for build obtained.
#> ℹ [2021-05-18 23:16:54]: Reading input.
#> ✓ [2021-05-18 23:16:54]: A data frame as input detected.
#> ✓ [2021-05-18 23:16:54]: Column names checked.
#> ✓ [2021-05-18 23:16:54]: Column order set.
#> ✓ [2021-05-18 23:16:54]: Chromosomes unified.
#> ✓ [2021-05-18 23:16:54]: Data imported.
#> ℹ [2021-05-18 23:16:54]: Segments info:
#> ℹ [2021-05-18 23:16:54]:     Keep - 467
#> ℹ [2021-05-18 23:16:54]:     Drop - 0
#> ✓ [2021-05-18 23:16:54]: Segments sorted.
#> ℹ [2021-05-18 23:16:54]: Joining adjacent segments with same copy number value. Be patient...
#> ✓ [2021-05-18 23:16:54]: 400 segments left after joining.
#> ✓ [2021-05-18 23:16:54]: Segmental table cleaned.
#> ℹ [2021-05-18 23:16:54]: Annotating.
#> ✓ [2021-05-18 23:16:54]: Annotation done.
#> ℹ [2021-05-18 23:16:54]: Summarizing per sample.
#> ✓ [2021-05-18 23:16:54]: Summarized.
#> ℹ [2021-05-18 23:16:54]: Generating CopyNumber object.
#> ✓ [2021-05-18 23:16:54]: Generated.
#> ℹ [2021-05-18 23:16:54]: Validating object.
#> ✓ [2021-05-18 23:16:54]: Done.
#> ℹ [2021-05-18 23:16:54]: 0.215 secs elapsed.
cn
#> An object of class CopyNumber 
#> =============================
#>                           sample n_of_seg n_of_cnv n_of_amp n_of_del n_of_vchr cna_burden
#>  1: TCGA-DF-A2KN-01A-11D-A17U-01       33        6        5        1         4      0.000
#>  2: TCGA-19-2621-01B-01D-0911-01       33        8        5        3         5      0.099
#>  3: TCGA-B6-A0X5-01A-21D-A107-01       28        8        4        4         2      0.087
#>  4: TCGA-A8-A07S-01A-11D-A036-01       38       11        2        9         4      0.112
#>  5: TCGA-26-6174-01A-21D-1842-01       43       13        8        5         8      0.119
#>  6: TCGA-CV-7432-01A-11D-2128-01       40       16        7        9         9      0.198
#>  7: TCGA-06-0644-01A-02D-0310-01       46       19        5       14         8      0.165
#>  8: TCGA-A5-A0G2-01A-11D-A042-01       39       21        5       16        10      0.393
#>  9: TCGA-99-7458-01A-11D-2035-01       48       26       10       16        13      0.318
#> 10: TCGA-05-4417-01A-22D-1854-01       52       37       33        4        17      0.654

Currently, you can refer to extract_facets_cnv() and extract_seqz_cnv() in https://github.com/ShixiangWang/prad_signature/blob/master/analysis/src/99-functions.R to see how to get tidy data from a result directory of FACETS or Sequenza.

3.3 Tally Components

Currently, there are two methods for generating sample-by-component matrix.

Option sigminer.copynumber.max is used to control the processing of max copy number values. Run ?sig_tally to see more.

## Even you set max_copynumber = 20 in read_copynumber(),
## the segmental copy number may be greater than 20
## because for male samples, the X/Y segmental copy number
## values will be doubled in tally process.
## This setting will make copy number values of all segments
## not greater than 20.
options(sigminer.copynumber.max = 20)

# Load copy number object
load(system.file("extdata", "toy_copynumber.RData",
  package = "sigminer", mustWork = TRUE
))

# Use method designed by Wang, Shixiang et al.
cn_tally_W <- sig_tally(cn, method = "W")

You can set options(sigminer.sex = "male", sigminer.copynumber.max = 20) at the top of your code to avoid setting them in two places.

Of note, the sigminer.copynumber.max option only has effect on sig_tally() with method “W,” the sigminer.sex option has effects on read_copynumber() and sig_tally() with method “W.”

This step return a list containing information about copy number features, components and matrix for NMF etc.

3.4 Extract Signatures

When you get the matrix, you can just do the signature extraction as SBS signatures (see chapter Chapter 2). So here we won’t talk much.

cn_tally_W$nmf_matrix[1:5, 1:5]
#>                              BP10MB[0] BP10MB[1] BP10MB[2] BP10MB[3] BP10MB[4]
#> TCGA-05-4417-01A-22D-1854-01       275        20         5         0         0
#> TCGA-06-0644-01A-02D-0310-01       289         5         4         0         1
#> TCGA-19-2621-01B-01D-0911-01       294         2         3         1         0
#> TCGA-26-6174-01A-21D-1842-01       288         4         7         1         0
#> TCGA-99-7458-01A-11D-2035-01       284         9         5         1         1
# library(NMF)
sig_w <- sig_extract(cn_tally_W$nmf_matrix, n_sig = 2, pConstant = 1e-13)