`R/sig_auto_extract.R`

`sig_auto_extract.Rd`

A bayesian variant of NMF algorithm to enable optimal inferences for the
number of signatures through the automatic relevance determination technique.
This functions delevers highly interpretable and sparse representations for
both signature profiles and attributions at a balance between data fitting and
model complexity (this method may introduce more signatures than expected,
especially for copy number signatures (thus **I don't recommend you to use this feature
to extract copy number signatures**)). See detail part and references for more.

sig_auto_extract( nmf_matrix = NULL, result_prefix = "BayesNMF", destdir = tempdir(), method = c("L1W.L2H", "L1KL", "L2KL"), strategy = c("stable", "optimal", "ms"), ref_sigs = NULL, K0 = 25, nrun = 10, niter = 2e+05, tol = 1e-07, cores = 1, optimize = FALSE, skip = FALSE, recover = FALSE )

nmf_matrix | a |
---|---|

result_prefix | prefix for result data files. |

destdir | path to save data runs, default is |

method | default is "L1W.L2H", which uses an exponential prior for W and a half-normal prior for H (This method is used by PCAWG project, see reference #3). You can also use "L1KL" to set expoential priors for both W and H, and "L2KL" to set half-normal priors for both W and H. The latter two methods are originally implemented by SignatureAnalyzer software. |

strategy | the selection strategy for returned data. Set 'stable' for getting optimal
result from the most frequent K. Set 'optimal' for getting optimal result from all Ks.
Set 'ms' for getting result with maximum mean cosine similarity with provided reference
signatures. See |

ref_sigs | A Signature object or matrix or string for specifying
reference signatures, only used when |

K0 | number of initial signatures. |

nrun | number of independent simulations. |

niter | the maximum number of iterations. |

tol | tolerance for convergence. |

cores | number of cpu cores to run NMF. |

optimize | if |

skip | if |

recover | if |

a `list`

with `Signature`

class.

There are three methods available in this function: "L1W.L2H", "L1KL" and "L2KL".
They use different priors for the bayesian variant of NMF algorithm
(see `method`

parameter) written by reference #1 and implemented in
SignatureAnalyzer software
(reference #2).

I copied source code for the three methods from Broad Institute and supplementary
files of reference #3, and wrote this higher function. It is more friendly for users
to extract, visualize and analyze signatures by combining with other powerful functions
in **sigminer** package. Besides, I implemented parallel computation to speed up
the calculation process and a similar input and output structure like `sig_extract()`

.

Tan, Vincent YF, and Cédric Févotte. "Automatic relevance determination in nonnegative matrix factorization with the/spl beta/-divergence." IEEE Transactions on Pattern Analysis and Machine Intelligence 35.7 (2012): 1592-1605.

Kim, Jaegil, et al. "Somatic ERCC2 mutations are associated with a distinct genomic signature in urothelial tumors." Nature genetics 48.6 (2016): 600.

Alexandrov, Ludmil, et al. "The repertoire of mutational signatures in human cancer." BioRxiv (2018): 322859.

sig_tally for getting variation matrix,
sig_extract for extracting signatures using **NMF** package, sig_estimate for
estimating signature number for sig_extract.

Shixiang Wang

load(system.file("extdata", "toy_copynumber_tally_M.RData", package = "sigminer", mustWork = TRUE )) res <- sig_auto_extract(cn_tally_M$nmf_matrix, result_prefix = "Test_copynumber", nrun = 1) # At default, all run files are stored in tempdir() dir(tempdir(), pattern = "Test_copynumber") # \donttest{ laml.maf <- system.file("extdata", "tcga_laml.maf.gz", package = "maftools") laml <- read_maf(maf = laml.maf) mt_tally <- sig_tally( laml, ref_genome = "BSgenome.Hsapiens.UCSC.hg19", use_syn = TRUE ) x <- sig_auto_extract(mt_tally$nmf_matrix, strategy = "ms", nrun = 3, ref_sigs = "legacy" ) x # }