Tally a variation object like MAF, CopyNumber and return a matrix for NMF de-composition and more. This is a generic function, so it can be further extended to other mutation cases. Please read details about how to set sex for identifying copy number signatures. Please read https://osf.io/s93d5/ for the generation of SBS, DBS and ID (INDEL) components.
sig_tally(object, ...)
# S3 method for class 'CopyNumber'
sig_tally(
object,
method = "Wang",
ignore_chrs = NULL,
indices = NULL,
add_loh = FALSE,
feature_setting = sigminer::CN.features,
cores = 1,
keep_only_matrix = FALSE,
...
)
# S3 method for class 'RS'
sig_tally(object, keep_only_matrix = FALSE, ...)
# S3 method for class 'MAF'
sig_tally(
object,
mode = c("SBS", "DBS", "ID", "ALL"),
ref_genome = "BSgenome.Hsapiens.UCSC.hg19",
genome_build = NULL,
add_trans_bias = FALSE,
ignore_chrs = NULL,
use_syn = TRUE,
keep_only_matrix = FALSE,
...
)
a CopyNumber object or MAF object or SV object (from read_sv_as_rs).
custom setting for operating object. Detail see S3 method for
corresponding class (e.g. CopyNumber
).
method for feature classification, can be one of "Wang" ("W"), "S" (for method described in Steele et al. 2019), "X" (for method described in Tao et al. 2023).
Chromsomes to ignore from analysis. e.g. chrX and chrY.
integer vector indicating segments to keep.
flag to add LOH classifications.
a data.frame
used for classification.
Only used when method is "Wang" ("W").
Default is CN.features. Users can also set custom input with "feature",
"min" and "max" columns available. Valid features can be printed by
unique(CN.features$feature)
.
number of computer cores to run this task.
You can use future::availableCores()
function to check how
many cores you can use.
if TRUE
, keep only matrix for signature extraction.
For a MAF
object, this will just return the most useful matrix.
type of mutation matrix to extract, can be one of 'SBS', 'DBS' and 'ID'.
'BSgenome.Hsapiens.UCSC.hg19', 'BSgenome.Hsapiens.UCSC.hg38', 'BSgenome.Mmusculus.UCSC.mm10', 'BSgenome.Mmusculus.UCSC.mm9', etc.
genome build 'hg19', 'hg38', 'mm9' or "mm10", if not set, guess it by ref_genome
.
if TRUE
, consider transcriptional bias categories.
'T:' for Transcribed (the variant is on the transcribed strand);
'U:' for Un-transcribed (the variant is on the untranscribed strand);
'B:' for Bi-directional (the variant is on both strand and is transcribed either way);
'N:' for Non-transcribed (the variant is in a non-coding region and is untranslated);
'Q:' for Questionable.
NOTE: the result counts of 'B' and 'N' labels are a little different from
SigProfilerMatrixGenerator, the reason is unknown (may be caused by annotation file).
Logical. If TRUE
, include synonymous variants in analysis.
a list
contains a matrix
used for NMF de-composition.
For identifying copy number signatures, we have to derive copy number features firstly. Due to the difference of copy number values in sex chromosomes between male and female, we have to do an extra step if we don't want to ignore them.
I create two options to control this, the default values are shown as the following, you can use the same way to set (per R session).
options(sigminer.sex = "female", sigminer.copynumber.max = NA_integer_)
If your cohort are all females, you can totally ignore this.
If your cohort are all males, set sigminer.sex
to 'male' and
sigminer.copynumber.max
to a proper value (the best is consistent
with read_copynumber).
If your cohort contains both males and females, set sigminer.sex
as a data.frame
with two columns "sample" and "sex". And
set sigminer.copynumber.max
to a proper value (the best is consistent
with read_copynumber).
sig_tally(CopyNumber)
: Returns copy number features, components and component-by-sample matrix
sig_tally(RS)
: Returns genome rearrangement sample-by-component matrix
sig_tally(MAF)
: Returns SBS mutation sample-by-component matrix and APOBEC enrichment
Wang, Shixiang, et al. "Copy number signature analyses in prostate cancer reveal distinct etiologies and clinical outcomes." medRxiv (2020).
Steele, Christopher D., et al. "Undifferentiated sarcomas develop through distinct evolutionary pathways." Cancer Cell 35.3 (2019): 441-456.
Mayakonda, Anand, et al. "Maftools: efficient and comprehensive analysis of somatic variants in cancer." Genome research 28.11 (2018): 1747-1756.
Roberts SA, Lawrence MS, Klimczak LJ, et al. An APOBEC Cytidine Deaminase Mutagenesis Pattern is Widespread in Human Cancers. Nature genetics. 2013;45(9):970-976. doi:10.1038/ng.2702.
Bergstrom EN, Huang MN, Mahto U, Barnes M, Stratton MR, Rozen SG, Alexandrov LB: SigProfilerMatrixGenerator: a tool for visualizing and exploring patterns of small mutational events. BMC Genomics 2019, 20:685 https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-6041-2
sig_estimate for estimating signature number for sig_extract, sig_auto_extract for extracting signatures using automatic relevance determination technique.
# Load copy number object
load(system.file("extdata", "toy_copynumber.RData",
package = "sigminer", mustWork = TRUE
))
# \donttest{
# Use method designed by Wang, Shixiang et al.
cn_tally_W <- sig_tally(cn, method = "W")
#> ℹ [2024-08-04 14:45:37.36984]: Started.
#> ℹ [2024-08-04 14:45:37.37339]: Step: getting copy number features.
#> ℹ [2024-08-04 14:45:37.436448]: Getting breakpoint count per 10 Mb...
#> ℹ [2024-08-04 14:45:37.693916]: Getting breakpoint count per chromosome arm...
#> ℹ [2024-08-04 14:45:37.865468]: Getting copy number...
#> ℹ [2024-08-04 14:45:37.872483]: Getting change-point copy number change...
#> ℹ [2024-08-04 14:45:38.017213]: Getting length of chains of oscillating copy number...
#> ℹ [2024-08-04 14:45:38.170242]: Getting (log10 based) segment size...
#> ℹ [2024-08-04 14:45:38.177954]: Getting the minimal number of chromosome with 50% CNV...
#> ℹ [2024-08-04 14:45:38.260884]: Getting burden of chromosome...
#> ✔ [2024-08-04 14:45:38.277201]: Gotten.
#> ℹ [2024-08-04 14:45:38.279085]: Step: generating copy number components.
#> ✔ [2024-08-04 14:45:38.280563]: `feature_setting` checked.
#> ℹ [2024-08-04 14:45:38.283993]: Step: counting components.
#> ✔ [2024-08-04 14:45:38.61093]: Counted.
#> ℹ [2024-08-04 14:45:38.612959]: Step: generating components by sample matrix.
#> ✔ [2024-08-04 14:45:38.61555]: Matrix generated.
#> ℹ [2024-08-04 14:45:38.617147]: 1.247 secs elapsed.
# }
# Use method designed by Steele et al.
# See example in read_copynumber
# \donttest{
# Prepare SBS signature analysis
laml.maf <- system.file("extdata", "tcga_laml.maf.gz", package = "maftools")
laml <- read_maf(maf = laml.maf)
#> -Reading
#> -Validating
#> -Silent variants: 475
#> -Summarizing
#> -Processing clinical data
#> --Missing clinical data
#> -Finished in 0.170s elapsed (0.224s cpu)
if (require("BSgenome.Hsapiens.UCSC.hg19")) {
mt_tally <- sig_tally(
laml,
ref_genome = "BSgenome.Hsapiens.UCSC.hg19",
use_syn = TRUE
)
mt_tally$nmf_matrix[1:5, 1:5]
## Use strand bias categories
mt_tally <- sig_tally(
laml,
ref_genome = "BSgenome.Hsapiens.UCSC.hg19",
use_syn = TRUE, add_trans_bias = TRUE
)
## Test it by enrichment analysis
enrich_component_strand_bias(mt_tally$nmf_matrix)
enrich_component_strand_bias(mt_tally$all_matrices$SBS_24)
} else {
message("Please install package 'BSgenome.Hsapiens.UCSC.hg19' firstly!")
}
#> ℹ [2024-08-04 14:45:38.792151]: Started.
#> ℹ [2024-08-04 14:45:38.79634]: We would assume you marked all variants' position in + strand.
#> ✔ [2024-08-04 14:45:38.798139]: Reference genome loaded.
#> ✔ [2024-08-04 14:45:38.80251]: Variants from MAF object queried.
#> ✔ [2024-08-04 14:45:38.805144]: Chromosome names checked.
#> ✔ [2024-08-04 14:45:38.808707]: Sex chromosomes properly handled.
#> ✔ [2024-08-04 14:45:38.810566]: Only variants located in standard chromosomes (1:22, X, Y, M/MT) are kept.
#> ✔ [2024-08-04 14:45:38.816245]: Variant start and end position checked.
#> ✔ [2024-08-04 14:45:38.820674]: Variant data for matrix generation preprocessed.
#> ℹ [2024-08-04 14:45:38.822259]: SBS matrix generation - start.
#> ℹ [2024-08-04 14:45:38.824213]: Extracting 5' and 3' adjacent bases.
#> ℹ [2024-08-04 14:45:39.15201]: Extracting +/- 20bp around mutated bases for background C>T estimation.
#> ℹ [2024-08-04 14:45:39.508023]: Estimating APOBEC enrichment scores.
#> ℹ [2024-08-04 14:45:39.510094]: Performing one-way Fisher's test for APOBEC enrichment.
#> ✔ [2024-08-04 14:45:39.585581]: APOBEC related mutations are enriched in 3.297% of samples (APOBEC enrichment score > 2; 6 of 182 samples)
#> ℹ [2024-08-04 14:45:39.589088]: Creating SBS sample-by-component matrices.
#> ✔ [2024-08-04 14:45:39.593516]: SBS-6 matrix created.
#> ✔ [2024-08-04 14:45:39.598554]: SBS-96 matrix created.
#> ✔ [2024-08-04 14:45:39.613017]: SBS-1536 matrix created.
#> ℹ [2024-08-04 14:45:39.614946]: Return SBS-96 as major matrix.
#> ✔ [2024-08-04 14:45:39.617526]: Done.
#> ℹ [2024-08-04 14:45:39.618997]: 0.827 secs elapsed.
#> ℹ [2024-08-04 14:45:39.620424]: Started.
#> ℹ [2024-08-04 14:45:39.624358]: We would assume you marked all variants' position in + strand.
#> ✔ [2024-08-04 14:45:39.626189]: Reference genome loaded.
#> ✔ [2024-08-04 14:45:39.630789]: Variants from MAF object queried.
#> ✔ [2024-08-04 14:45:39.633415]: Chromosome names checked.
#> ✔ [2024-08-04 14:45:39.636646]: Sex chromosomes properly handled.
#> ✔ [2024-08-04 14:45:39.638425]: Only variants located in standard chromosomes (1:22, X, Y, M/MT) are kept.
#> ✔ [2024-08-04 14:45:39.643932]: Variant start and end position checked.
#> ✔ [2024-08-04 14:45:39.648027]: Variant data for matrix generation preprocessed.
#> ℹ [2024-08-04 14:45:39.649469]: SBS matrix generation - start.
#> ℹ [2024-08-04 14:45:39.651333]: Extracting 5' and 3' adjacent bases.
#> ℹ [2024-08-04 14:45:39.975314]: Extracting +/- 20bp around mutated bases for background C>T estimation.
#> ℹ [2024-08-04 14:45:40.337088]: Estimating APOBEC enrichment scores.
#> ℹ [2024-08-04 14:45:40.339154]: Performing one-way Fisher's test for APOBEC enrichment.
#> ✔ [2024-08-04 14:45:40.426178]: APOBEC related mutations are enriched in 3.297% of samples (APOBEC enrichment score > 2; 6 of 182 samples)
#> ℹ [2024-08-04 14:45:40.429766]: Creating SBS sample-by-component matrices.
#> ✔ [2024-08-04 14:45:40.434235]: SBS-6 matrix created.
#> ✔ [2024-08-04 14:45:40.43916]: SBS-96 matrix created.
#> ✔ [2024-08-04 14:45:40.453318]: SBS-1536 matrix created.
#> ✔ [2024-08-04 14:45:40.498037]: SBS-24 (6x4) matrix created.
#> ✔ [2024-08-04 14:45:40.545156]: SBS-384 (96x4) matrix created.
#> ✔ [2024-08-04 14:45:40.66629]: SBS-6144 (1536x4) matrix created.
#> ℹ [2024-08-04 14:45:40.668228]: Return SBS-192 as major matrix.
#> ✔ [2024-08-04 14:45:40.686931]: Done.
#> ℹ [2024-08-04 14:45:40.689836]: 1.069 secs elapsed.
#> sample component N_T N_U Total_T Total_U Enrichment p_value
#> <char> <char> <int> <int> <int> <int> <num> <num>
#> 1: TCGA-AB-2838 C>T 3 7 12 8 0.4285714 0.01976661
#> 2: TCGA-AB-3002 C>T 5 11 13 14 0.4545455 0.05423790
#> 3: TCGA-AB-2950 C>T 4 2 4 6 2.0000000 0.07619048
#> 4: TCGA-AB-2878 C>T 3 4 8 4 0.7500000 0.08080808
#> 5: TCGA-AB-2863 T>C 3 0 8 9 Inf 0.08235294
#> ---
#> 1154: TCGA-AB-2908 C>G 1 1 8 10 1.0000000 1.00000000
#> 1155: TCGA-AB-2912 C>G 1 1 11 10 1.0000000 1.00000000
#> 1156: TCGA-AB-2912 T>A 1 1 11 10 1.0000000 1.00000000
#> 1157: TCGA-AB-2920 C>G 1 0 7 2 Inf 1.00000000
#> 1158: TCGA-AB-2920 C>T 6 2 7 2 3.0000000 1.00000000
#> fdr
#> <num>
#> 1: 0.1185997
#> 2: 0.3254274
#> 3: 0.4571429
#> 4: 0.4848485
#> 5: 0.4941176
#> ---
#> 1154: 1.0000000
#> 1155: 1.0000000
#> 1156: 1.0000000
#> 1157: 1.0000000
#> 1158: 1.0000000
# }