The function performs a signatures decomposition of a given mutational
catalogue V
with known signatures W
by solving the minimization problem
min(||W*H - V||)
where W and V are known.
sig_fit(
catalogue_matrix,
sig,
sig_index = NULL,
sig_db = c("legacy", "SBS", "DBS", "ID", "TSB", "SBS_Nik_lab", "RS_Nik_lab",
"RS_BRCA560", "RS_USARC", "CNS_USARC", "CNS_TCGA", "CNS_TCGA176", "CNS_PCAWG176",
"SBS_hg19", "SBS_hg38", "SBS_mm9", "SBS_mm10", "DBS_hg19", "DBS_hg38", "DBS_mm9",
"DBS_mm10", "SBS_Nik_lab_Organ", "RS_Nik_lab_Organ", "latest_SBS_GRCh37",
"latest_DBS_GRCh37", "latest_ID_GRCh37", "latest_SBS_GRCh38", "latest_DBS_GRCh38",
"latest_SBS_mm9", "latest_DBS_mm9", "latest_SBS_mm10", "latest_DBS_mm10",
"latest_SBS_rn6", "latest_DBS_rn6", "latest_CN_GRCh37",
"latest_RNA-SBS_GRCh37", "latest_SV_GRCh38"),
db_type = c("", "human-exome", "human-genome"),
show_index = TRUE,
method = c("QP", "NNLS", "SA"),
auto_reduce = FALSE,
type = c("absolute", "relative"),
return_class = c("matrix", "data.table"),
return_error = FALSE,
rel_threshold = 0,
mode = c("SBS", "DBS", "ID", "copynumber"),
true_catalog = NULL,
...
)
a numeric matrix V
with row representing components and
columns representing samples, typically you can get nmf_matrix
from sig_tally()
and
transpose it by t()
.
a Signature
object obtained either from sig_extract or sig_auto_extract,
or just a raw signature matrix/data.frame
with row representing components (motifs) and
column representing signatures.
a vector for signature index. "ALL" for all signatures.
default 'legacy', it can be 'legacy' (for COSMIC v2 'SBS'),
'SBS', 'DBS', 'ID' and 'TSB' (for COSMIV v3.1 signatures)
for small scale mutations.
For more specific details, it can also be 'SBS_hg19', 'SBS_hg38',
'SBS_mm9', 'SBS_mm10', 'DBS_hg19', 'DBS_hg38', 'DBS_mm9', 'DBS_mm10' to use
COSMIC v3 reference signatures from Alexandrov, Ludmil B., et al. (2020) (reference #1).
In addition, it can be one of "SBS_Nik_lab_Organ", "RS_Nik_lab_Organ",
"SBS_Nik_lab", "RS_Nik_lab" to refer reference signatures from
Degasperi, Andrea, et al. (2020) (reference #2);
"RS_BRCA560", "RS_USARC" to reference signatures from BRCA560 and USARC cohorts;
"CNS_USARC" (40 categories), "CNS_TCGA" (48 categories) to reference copy number signatures from USARC cohort and TCGA;
"CNS_TCGA176" (176 categories) and "CNS_PCAWG176" (176 categories) to reference copy number signatures from PCAWG and TCGA separately.
UPDATE, the latest version of reference version can be automatically
downloaded and loaded from https://cancer.sanger.ac.uk/signatures/downloads/
when a option with latest_
prefix is specified (e.g. "latest_SBS_GRCh37").
Note: the signature profile for different genome builds are basically same.
And specific database (e.g. 'SBS_mm10') contains less signatures than all COSMIC
signatures (because some signatures are not detected from Alexandrov, Ludmil B., et al. (2020)).
For all available options, check the parameter setting.
only used when sig_db
is enabled.
"" for keeping default, "human-exome" for transforming to exome frequency of component,
and "human-genome" for transforming to whole genome frequency of component.
Currently only works for 'SBS'.
if TRUE
, show valid indices.
method to solve the minimazation problem. 'NNLS' for non-negative least square; 'QP' for quadratic programming; 'SA' for simulated annealing.
if TRUE
, try reducing the input reference signatures to increase
the cosine similarity of reconstructed profile to observed profile.
'absolute' for signature exposure and 'relative' for signature relative exposure.
string, 'matrix' or 'data.table'.
if TRUE
, also return sample error (Frobenius norm) and cosine
similarity between observed sample profile (asa. spectrum) and reconstructed profile. NOTE:
it is better to obtain the error when the type is 'absolute', because the error is
affected by relative exposure accuracy.
numeric vector, a signature with relative exposure
lower than (equal is included, i.e. <=
) this value will be set to 0
(both absolute exposure and relative exposure).
In this case, sum of signature contribution may not equal to 1.
signature type for plotting, now supports 'copynumber', 'SBS', 'DBS', 'ID' and 'RS' (genome rearrangement signature).
used by sig_fit_bootstrap, user never use it.
control parameters passing to argument control
in GenSA
function when use method 'SA'.
The exposure result either in matrix
or data.table
format.
If return_error
set TRUE
, a list
is returned.
The method 'NNLS' solves the minimization problem with nonnegative least-squares constraints. The method 'QP' and 'SA' are modified from SignatureEstimation package. See references for details. Of note, when fitting exposures for copy number signatures, only components of feature CN is used.
Daniel Huebschmann, Zuguang Gu and Matthias Schlesner (2019). YAPSA: Yet Another Package for Signature Analysis. R package version 1.12.0.
Huang X, Wojtowicz D, Przytycka TM. Detecting presence of mutational signatures in cancer with confidence. Bioinformatics. 2018;34(2):330–337. doi:10.1093/bioinformatics/btx604
Kim, Jaegil, et al. "Somatic ERCC2 mutations are associated with a distinct genomic signature in urothelial tumors." Nature genetics 48.6 (2016): 600.
# \donttest{
W <- matrix(c(1, 2, 3, 4, 5, 6), ncol = 2)
colnames(W) <- c("sig1", "sig2")
W <- apply(W, 2, function(x) x / sum(x))
H <- matrix(c(2, 5, 3, 6, 1, 9, 1, 2), ncol = 4)
colnames(H) <- paste0("samp", 1:4)
V <- W %*% H
V
#> samp1 samp2 samp3 samp4
#> [1,] 1.666667 2.1 2.566667 0.7
#> [2,] 2.333333 3.0 3.333333 1.0
#> [3,] 3.000000 3.9 4.100000 1.3
if (requireNamespace("quadprog", quietly = TRUE)) {
H_infer <- sig_fit(V, W, method = "QP")
H_infer
H
H_dt <- sig_fit(V, W, method = "QP", auto_reduce = TRUE, return_class = "data.table")
H_dt
## Show results
show_sig_fit(H_infer)
show_sig_fit(H_dt)
## Get clusters/groups
H_dt_rel <- sig_fit(V, W, return_class = "data.table", type = "relative")
z <- get_groups(H_dt_rel, method = "k-means")
show_groups(z)
}
#> ℹ [2024-03-13 10:53:39.49291]: Started.
#> ℹ [2024-03-13 10:53:39.494659]: Signature index not detected.
#> ✔ [2024-03-13 10:53:39.496121]: Signature matrix/data.frame detected.
#> ✔ [2024-03-13 10:53:39.497574]: Database and index checked.
#> ✔ [2024-03-13 10:53:39.499181]: Signature normalized.
#> ℹ [2024-03-13 10:53:39.500638]: Checking row number for catalog matrix and signature matrix.
#> ✔ [2024-03-13 10:53:39.502008]: Checked.
#> ✔ [2024-03-13 10:53:39.503396]: Method 'QP' detected.
#> ✔ [2024-03-13 10:53:39.50475]: Corresponding function generated.
#> ℹ [2024-03-13 10:53:39.506074]: Calling function.
#> ℹ [2024-03-13 10:53:39.507678]: Fitting sample: samp1
#> ℹ [2024-03-13 10:53:39.509166]: Fitting sample: samp2
#> ℹ [2024-03-13 10:53:39.510616]: Fitting sample: samp3
#> ℹ [2024-03-13 10:53:39.512077]: Fitting sample: samp4
#> ✔ [2024-03-13 10:53:39.51349]: Done.
#> ℹ [2024-03-13 10:53:39.514842]: Generating output signature exposures.
#> ✔ [2024-03-13 10:53:39.516775]: Done.
#> ℹ [2024-03-13 10:53:39.518198]: 0.025 secs elapsed.
#> ℹ [2024-03-13 10:53:39.519676]: Started.
#> ℹ [2024-03-13 10:53:39.521026]: Signature index not detected.
#> ✔ [2024-03-13 10:53:39.522402]: Signature matrix/data.frame detected.
#> ✔ [2024-03-13 10:53:39.523793]: Database and index checked.
#> ✔ [2024-03-13 10:53:39.525249]: Signature normalized.
#> ℹ [2024-03-13 10:53:39.526602]: Checking row number for catalog matrix and signature matrix.
#> ✔ [2024-03-13 10:53:39.527973]: Checked.
#> ✔ [2024-03-13 10:53:39.529308]: Method 'QP' detected.
#> ✔ [2024-03-13 10:53:39.530639]: Corresponding function generated.
#> ℹ [2024-03-13 10:53:39.532011]: Calling function.
#> ℹ [2024-03-13 10:53:39.533556]: Fitting sample: samp1
#> ✔ [2024-03-13 10:53:39.535044]: The cosine similarity is very high, just return result.
#> ℹ [2024-03-13 10:53:39.536447]: Fitting sample: samp2
#> ✔ [2024-03-13 10:53:39.537874]: The cosine similarity is very high, just return result.
#> ℹ [2024-03-13 10:53:39.53924]: Fitting sample: samp3
#> ✔ [2024-03-13 10:53:39.540682]: The cosine similarity is very high, just return result.
#> ℹ [2024-03-13 10:53:39.542042]: Fitting sample: samp4
#> ✔ [2024-03-13 10:53:39.543491]: The cosine similarity is very high, just return result.
#> ✔ [2024-03-13 10:53:39.544858]: Done.
#> ℹ [2024-03-13 10:53:39.546206]: Generating output signature exposures.
#> ✔ [2024-03-13 10:53:39.556133]: Done.
#> ℹ [2024-03-13 10:53:39.557641]: 0.038 secs elapsed.
#> ℹ [2024-03-13 10:53:39.559424]: Started.
#> ℹ [2024-03-13 10:53:39.560828]: Checking input format.
#> ✔ [2024-03-13 10:53:39.569675]: Checked.
#> ℹ [2024-03-13 10:53:39.57115]: Checking filters.
#> ℹ [2024-03-13 10:53:39.572593]: Checked.
#> ℹ [2024-03-13 10:53:39.576726]: Plotting.
#> ℹ [2024-03-13 10:53:39.623562]: 0.064 secs elapsed.
#> ℹ [2024-03-13 10:53:39.625447]: Started.
#> ℹ [2024-03-13 10:53:39.627068]: Checking input format.
#> ✔ [2024-03-13 10:53:39.628686]: Checked.
#> ℹ [2024-03-13 10:53:39.630128]: Checking filters.
#> ℹ [2024-03-13 10:53:39.631566]: Checked.
#> ℹ [2024-03-13 10:53:39.636372]: Plotting.
#> ℹ [2024-03-13 10:53:39.682562]: 0.057 secs elapsed.
#> ℹ [2024-03-13 10:53:39.68447]: Started.
#> ℹ [2024-03-13 10:53:39.68604]: Signature index not detected.
#> ✔ [2024-03-13 10:53:39.68758]: Signature matrix/data.frame detected.
#> ✔ [2024-03-13 10:53:39.689054]: Database and index checked.
#> ✔ [2024-03-13 10:53:39.69057]: Signature normalized.
#> ℹ [2024-03-13 10:53:39.691981]: Checking row number for catalog matrix and signature matrix.
#> ✔ [2024-03-13 10:53:39.693363]: Checked.
#> ✔ [2024-03-13 10:53:39.694733]: Method 'QP' detected.
#> ✔ [2024-03-13 10:53:39.696114]: Corresponding function generated.
#> ℹ [2024-03-13 10:53:39.69754]: Calling function.
#> ℹ [2024-03-13 10:53:39.69912]: Fitting sample: samp1
#> ℹ [2024-03-13 10:53:39.700654]: Fitting sample: samp2
#> ℹ [2024-03-13 10:53:39.7021]: Fitting sample: samp3
#> ℹ [2024-03-13 10:53:39.703554]: Fitting sample: samp4
#> ✔ [2024-03-13 10:53:39.704977]: Done.
#> ℹ [2024-03-13 10:53:39.706329]: Generating output signature exposures.
#> ✔ [2024-03-13 10:53:39.716386]: Done.
#> ℹ [2024-03-13 10:53:39.717954]: 0.033 secs elapsed.
#> ℹ [2024-03-13 10:53:39.719489]: Started.
#> ✔ [2024-03-13 10:53:39.720892]: A 'data.table' detected.
#> ✔ [2024-03-13 10:53:39.722264]: Method checked.
#> ✔ [2024-03-13 10:53:39.723783]: Exposure should be relative checked.
#> ℹ [2024-03-13 10:53:39.725341]: Running k-means with 2 clusters...
#> ℹ [2024-03-13 10:53:39.727329]: Generating a table of group and signature contribution (stored in 'map_table' attr):
#> sig1 sig2
#> 1 0.31746 0.68254
#> 2 0.10000 0.90000
#> ℹ [2024-03-13 10:53:39.728767]: Assigning a group to a signature with the maximum fraction...
#> ℹ [2024-03-13 10:53:39.732564]: Summarizing...
#> group #1: 3 samples with sig2 enriched.
#> group #2: 1 samples with sig2 enriched.
#> ! [2024-03-13 10:53:39.734486]: The 'enrich_sig' column is set to dominant signature in one group, please check and make it consistent with biological meaning (correct it by hand if necessary).
#> ℹ [2024-03-13 10:53:39.735925]: 0.016 secs elapsed.
# if (requireNamespace("GenSA", quietly = TRUE)) {
# H_infer <- sig_fit(V, W, method = "SA")
# H_infer
# H
#
# H_dt <- sig_fit(V, W, method = "SA", return_class = "data.table")
# H_dt
#
# ## Modify arguments to method
# sig_fit(V, W, method = "SA", maxit = 10, temperature = 100)
#
# ## Show results
# show_sig_fit(H_infer)
# show_sig_fit(H_dt)
# }
# }