Git Product home page Git Product logo

alexandrovlab / sigprofilersimulator Goto Github PK

View Code? Open in Web Editor NEW
18.0 11.0 4.0 8.59 MB

SigProfilerSimulator allows realistic simulations of mutational patterns and mutational signatures in cancer genomes. The tool can be used to simulate signatures of single point mutations, double point mutations, and insertion/deletions. Further, the tool makes use of SigProfilerMatrixGenerator and SigProfilerPlotting.

License: BSD 2-Clause "Simplified" License

Python 100.00%
bioinformatics mutational-signatures somatic-variants cancer-genomics

sigprofilersimulator's Introduction

Docs License Build Status

SigProfilerSimulator

SigProfilerSimulator allows realistic simulations of mutational signatures in cancer genomes. The tool can be used to simulate signatures of single point mutations, double point mutations, and insertion/deletions. Further, the tool makes use of SigProfilerMatrixGenerator and SigProfilerPlotting.

INTRODUCTION

The purpose of this document is to provide a guide for using the SigProfilerSimulator for simulating mutational signatures in cancer. This tool allows for realistic simulations of single point mutations, double point mutations, and insertions/deletions with the goal of providing a background model for statistical analysis. The simulations are performed in an unbiased fashion, relying on random chance as the main distribution and can be performed across the entire genome or limited to user-provided ranges. This tool currently supports the GRCh37, GRCh38, mm9, and mm10 assemblies, however, additional genomes may be installed. In addition, this tool makes use of SigProfilerMatrixGenerator and SigProfilerPlotting. An extensive Wiki page detailing the usage of this tool can be found at https://osf.io/usxjz/wiki/home/.

For users that prefer working in an R environment, a wrapper package is provided and can be found and installed from: https://github.com/AlexandrovLab/SigProfilerSimulatorR

schematic

PREREQUISITES

The framework is written in PYTHON, however, it also requires the following additional software with the given versions (or newer) and access to BASH:

-PYTHON (version 3.4 or newer)

-FASTRAND (Python module: https://github.com/lemire/fastrand/blob/master/README.md )

-SigProfilerMatrixGenerator (Current version: https://github.com/AlexandrovLab/SigProfilerMatrixGenerator )

-Desired reference genome (Follow the installation process on the SigProfilerMatrixGenerator README )

While the code was developed for application on a local computer, access to a cluster with greater computational power may be required for simulating a large number of mutations/samples.

QUICK START GUIDE

This section will guide you through the minimum steps required to begin simulating mutations:

  1. First, install the python package using pip. The R wrapper still requires the python package:
                          pip install SigProfilerSimulator
  1. Place your vcf files in your desired output folder. It is recommended that you name this folder based on your project's name

  2. From within a Python3 session, you can now simulate mutational patterns/signatures as follows:

$ python3
>> from SigProfilerSimulator import SigProfilerSimulator as sigSim
>> sigSim.SigProfilerSimulator("BRCA", "/Users/ebergstr/Desktop/BRCA/", "GRCh37", contexts=["96"], exome=None, simulations=100, updating=False, bed_file=None, overlap=False, gender='female',  chrom_based=False, seed_file=None, noisePoisson=False, noiseAWGN=0, cushion=100, region=None, vcf=False)

The layout of the required parameters are as follows:

  SigProfilerSimulator ( project, project_path, genome, contexts)

where project, project_path, and genome must be strings (surrounded by quotation marks, ex: "test"), and contexts is a list of the desired contexts to simulate (ex: contexts=["96", "ID"]) Optional parameters include:

  exome=None:       [boolean] Simulates on the exome of the reference genome
  simulations=1:	       [integer] Number of desired iterations per sample. Default is 1 iteration.
  updating=False:       [boolean] Updated the chromosome with each mutation. Default is FALSE.
  bed_file=None:      [string path to bed_file] Simulates on custom regions of the genome. Requires the full path to the BED file. 
  overlap=False:       [boolean] Allows overlapping of mutations along the chromosome. Default is FALSE.
  gender='female':       [string] Simulate male or female genomes. Default is 'female'
  chrom_based=False  [boolean] Maintains the same catalogs of mutations on a per chromosome basis.
  seed_file=None:       [string] Path to user defined seeds. One seed is required per processor. Uses a built in file by default
  noisePoisson=False:       [boolean] Add poisson noise to the simulations. Default is FALSE.     
  noiseAWGN=0:       [Float] Add a noise dependent on a +/- allowance of noise (ex: noiseAWGN=5 allows +/-2.5\% of mutations for each mutation type). Default is 0 noise. 
  cushion=100:       [integer] Allowable cushion when simulating on the exome or targetted panel. Default is 100 base pairs
  region=None:       [string] Path to targetted region panel for simulated on a user-defined region. Default is whole-genome simulations.
  vcf=False		[boolean] Outputs simulated samples as vcf files with one file per iteration per sample. By default, the tool outputs all samples from an iteration into a single maf file.
  mask=None	[string] Path to probability mask file. A mask file format is tab-separated with the following required columns: Chromosome, Start, End, Probability.
                                          Note: Mask parameter does not support exome data where bed_file flag is set to true, and the following header fields are required: Chromosome, Start, End, Probability.

INPUT FILE FORMAT

This tool currently supports maf, vcf, simple text file, and ICGC formats. The user must provide variant data adhering to one of these four formats. If the users' files are in vcf format, each sample must be saved as a separate files. For an example input file, please download the simple text file "example.txt" from the following link: example.txt

Output File Structure

The output structure is divided into three folders: input, output, and logs. The input folder contains copies of the user-provided input files. The output folder contains a DBS, SBS, ID, and simulations folder. The matrices are saved into the appropriate folders, and the simulations are found within a project specific folder under simulations. The logs folder contains the error and log files for the submitted job.

SUPPORTED GENOMES

This tool currently supports the following genomes:

GRCh38.p12 [GRCh38] (Genome Reference Consortium Human Reference 37), INSDC Assembly GCA_000001405.27, Dec 2013. Released July 2014. Last updated January 2018. This genome was downloaded from ENSEMBL database version 93.38.

GRCh37.p13 [GRCh37] (Genome Reference Consortium Human Reference 37), INSDC Assembly GCA_000001405.14, Feb 2009. Released April 2011. Last updated September 2013. This genome was downloaded from ENSEMBL database version 93.37.

GRCm38.p6 [mm10] (Genome Reference Consortium Mouse Reference 38), INDSDC Assembly GCA_000001635.8, Jan 2012. Released July 2012. Last updated March 2018. This genome was downloaded from ENSEMBL database version 93.38.

GRCm37 [mm9] (Release 67, NCBIM37), INDSDC Assembly GCA_000001635.18. Released Jan 2011. Last updated March 2012. This genome was downloaded from ENSEMBL database version release 67.

rn6 (Rnor_6.0) INSDC Assembly GCA_000001895.4, Jul 2014. Released Jun 2015. Last updated Jan 2017. This genome was downloaded from ENSEMBL database version 96.6.

yeast (Saccharomyces cerevisiae S288C; assembly R64-2-1). Released Nov 2014.

LOG FILES

All errors and progress checkpoints are saved into sigProfilerSimulator_[project]_[genome].err and sigProfilerSimulator_[project]_[genome].out, respectively. For all errors, please email the error and progress log files to the primary contact under CONTACT INFORMATION.

CITATION

Erik N. Bergstrom, Mark Barnes, Iñigo Martincorena, Ludmil B. Alexandrov bioRxiv 2020.02.13.948422; doi: https://doi.org/10.1101/2020.02.13.948422 https://www.biorxiv.org/content/10.1101/2020.02.13.948422v1

COPYRIGHT

Copyright (c) 2020, Erik Bergstrom [Alexandrov Lab] All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

CONTACT INFORMATION

Please address any queries or bug reports to Erik Bergstrom at [email protected]

sigprofilersimulator's People

Contributors

mdbarnesucsd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sigprofilersimulator's Issues

Lower mutation number in simulation than observed

Hello,

Thanks for a great tool! I have been using it to randomize the position of mutations in samples while controlling for mutational signatures. However, I have noticed that I am not getting the same amount of mutations in the simulated files as in the input, which I would expect.

I am running the tool like this, including some specific regions to simulate in (based on coverage):
`#!/usr/bin/env python

import sys
from SigProfilerSimulator import SigProfilerSimulator as sigSim

name = sys.argv[1]
vcf_dir = sys.argv[2]
bed = sys.argv[3].strip()

sigSim.SigProfilerSimulator(name,
vcf_dir,
"GRCh37",
contexts=["96", "ID"],
exome=None,
simulations=10,
updating=False,
bed_file=bed,
overlap=False,
gender='female',
chrom_based=False,
seed_file=None,
noisePoisson=False,
cushion=100,
region=None,
vcf=True)`

Here is the output of the log file:
`======================================
SigProfilerSimulator
Checking for all reference files and relevant matrices...
Creating a chromosome proportion file for the given BED file ranges...Completed!
SigProfilerSimulator_cell_output/C1_05_24/output/ID/C1_05_24.ID83.region does not exist. Creating the matrix file now.
Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 116.12 seconds.
Starting matrix generation for INDELs...Completed! Elapsed time: 106.96 seconds.
Matrices generated for 1 samples with 0 errors. Total of 597 SNVs, 1 DINUCs, and 15 INDELs were successfully analyzed.
The context distribution file does not exist. This file needs to be created before simulating. This may take several hours...
Chromosome X done
Chromosome 1 done
Chromosome 2 done
Chromosome 3 done
Chromosome 4 done
Chromosome 5 done
Chromosome 6 done
Chromosome 7 done
Chromosome 8 done
Chromosome 9 done
Chromosome 10 done
Chromosome 11 done
Chromosome 12 done
Chromosome 13 done
Chromosome 14 done
Chromosome 15 done
Chromosome 16 done
Chromosome 17 done
Chromosome 18 done
Chromosome 19 done
Chromosome 20 done
Chromosome 21 done
The context distribution file has been created!
The context distribution file does not exist. This file needs to be created before simulating. This may take several hours...
Chromosome X done
Chromosome 1 done
Chromosome 2 done
Chromosome 3 done
Chromosome 4 done
Chromosome 5 done
Chromosome 6 done
Chromosome 7 done
Chromosome 8 done
Chromosome 9 done
Chromosome 10 done
Chromosome 11 done
Chromosome 12 done
Chromosome 13 done
Chromosome 14 done
Chromosome 15 done
Chromosome 16 done
Chromosome 17 done
Chromosome 18 done
Chromosome 19 done
Chromosome 20 done
Chromosome 21 done
The context distribution file has been created!

Files successfully read and mutations collected. Mutation assignment starting now.
Mutations have been distributed. Starting simulation now...
Chromosome 21 done
Chromosome 22 done
Chromosome 19 done
Chromosome 20 done
Chromosome 18 done
Chromosome 13 done
Chromosome 17 done
Chromosome 15 done
Chromosome 16 done
Chromosome 14 done
Chromosome 9 done
Chromosome 11 done
Chromosome 10 done
Chromosome 12 done
Chromosome X done
Chromosome 8 done
Chromosome 7 done
Chromosome 6 done
Chromosome 4 done
Chromosome 5 done
Chromosome 3 done
Chromosome 1 done
Chromosome 2 done
Simulation completed
Job took 2381.3919444084167 seconds`

In my input file I have 613 mutations, but in the simulated output I only get 313 mutations. This issue is not specific to this file. Any help with this? I have tried to circumvent this by creating many more simulations than I need and sampling them to get the matching number of mutations as in my real sample.

Thanks,
Ronnie

"bed_file" and "region" parameters

Dear author.

Thanks for making this wonderful tool.
I have questions regarding this tool.

I would like to simulate mutations within genes.
In order to do this, I inputted gene regions in "bed_file".
(sigSim.SigProfilerSimulator("SIMUL", "./up_gene", "GRCh38", contexts=["96"],simulations=120,bed_file="./updown_bed_canonincal_hg38")
However, this tool seemed to simulate a subset of mutations with this parameter.
(e.g., the number of total mutations within genes is around 10,000, but, each maf file contains about 700 mutations.)

Next, I inputted "BED_GRCh38_proportions", which seemed to be generated when using "bed_file" parameter, in the region parameter. (sigSim.SigProfilerSimulator("SIMUL", "./up_gene", "GRCh38", contexts=["96"],simulations=120,region="BED_GRCh38_proportions")

However, this approach caused the following error.
Could you tell me the best way to simulate mutations within genes?

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/lustre/scratch117/casm/team78/hj6/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/lustre/scratch117/casm/team78/hj6/anaconda3/lib/python3.7/site-packages/SigProfilerSimulator/mutational_simulator.py", line 2349, in simulator
mutNuc = ''.join([tsb_ref[base][1] for base in sequence[random_number - mut_start:random_number + mut_start+1]])
File "/lustre/scratch117/casm/team78/hj6/anaconda3/lib/python3.7/site-packages/SigProfilerSimulator/mutational_simulator.py", line 2349, in
mutNuc = ''.join([tsb_ref[base][1] for base in sequence[random_number - mut_start:random_number + mut_start+1]])
KeyError: 51
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "", line 1, in
File "/lustre/scratch117/casm/team78/hj6/anaconda3/lib/python3.7/site-packages/SigProfilerSimulator/SigProfilerSimulator.py", line 479, in SigProfilerSimulator
r.get()
File "/lustre/scratch117/casm/team78/hj6/anaconda3/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
KeyError: 51

Reference genome installation and wget

Hi @ebergstr,

Thanks for the great tool! A couple of suggestions from me:

  1. The argument c("96") of contexts variable in the example sigSim.SigProfilerSimulator command in Readme throws an error in my environment: NameError: name 'c' is not defined
    contexts=["96"] works though, so I suggest changing example command to that.

  2. If the reference genome is not downloaded, nothing seems to happen when Chromosome proportion file does not exist. Creating now... message appears. Manual download as per SigProfilerMatrixGenerator Readme works, so perhaps it's worth mentioning it in SigProfilerSimulator readme, too.

  3. When installing the reference genome, The ensembl ftp site is not currently responding. error appears even when WGET is not installed. Although this tool is mentioned in Readme as a prerequisite, perhaps it's worth adding an exception here to check if WGET is available.

Best,
Sergey

Hard-coded chromosomes make it impossible to run code with "chr" prefix for chromosome IDs or "MT" chromosomes

https://github.com/AlexandrovLab/SigProfilerSimulator/blob/ced39a81e38098fc9c42b90255e5f70144ac52aa/SigProfilerSimulator/SigProfilerSimulator.py#L162C1-L163C67

Because the chromosomes are hardcoded without the chr prefix, it is impossible to run SigProfilerSimulator on non-Ensembl GRCh38 genomes that have the "chr" prefix before the chromosome number. Additionally, there is no MT chromosome listed so the code also breaks on vcf files containing MT chromosomes.

Update SigProfilerSimulator for Changes in SigProfilerMatrixGenerator Imports

Description:
Recent changes in SigProfilerMatrixGenerator v1.3.18 have implications for SigProfilerSimulator:

  • File SigProfilerMatrixGenerator.py has been renamed to MutationMatrixGenerator.py.
  • The import statement has transitioned:
    • From:
      from SigProfilerMatrixGenerator.scripts import SigProfilerMatrixGenerator
    • To:
      from SigProfilerMatrixGenerator.scripts import MutationMatrixGenerator

These modifications introduce import errors in SigProfilerSimulator (line 20, line 24).

Chromosome based error

Hello,

I am running your tool using:

sigSim.SigProfilerSimulator(name, \
  vcf_dir, \
  "GRCh37", \
  contexts=["96", "ID"], \
  exome=None, \
  simulations=1000, \
  updating=False, \
  bed_file=bed, \
  overlap=False, \
  gender='female', \
  chrom_based=True, \
  seed_file=None, \
  noisePoisson=False, \
  cushion=100, \
  region=None, \
  vcf=True)

But I am getting the following error:

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/gs/gsfs0/users/rcutler/.conda/envs/SigProfilerSimulator/lib/python3.12/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/gs/gsfs0/users/rcutler/.conda/envs/SigProfilerSimulator/lib/python3.12/site-packages/SigProfilerSimulator/mutational_simulator.py", line 973, in simulator
    random_sample = random.sample(list(mutation_tracker[context]),1)[0]
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gs/gsfs0/users/rcutler/.conda/envs/SigProfilerSimulator/lib/python3.12/random.py", line 430, in sample
    raise ValueError("Sample larger than population or is negative")
ValueError: Sample larger than population or is negative
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/gs/gsfs0/shared-lab/vijg-lab/2023-Ronnie/231009_multiple_ENU_analysis/SigProfilerSimulator/merged/runSigProfilerSimulator.py", line 10, in <module>
    sigSim.SigProfilerSimulator(name, \
  File "/gs/gsfs0/users/rcutler/.conda/envs/SigProfilerSimulator/lib/python3.12/site-packages/SigProfilerSimulator/SigProfilerSimulator.py", line 479, in SigProfilerSimulator
    r.get()
  File "/gs/gsfs0/users/rcutler/.conda/envs/SigProfilerSimulator/lib/python3.12/multiprocessing/pool.py", line 774, in get
    raise self._value
ValueError: Sample larger than population or is negative

When I run your tool with chrom_based=False I am able to get results. So this makes me think it is an error when wanting to have mutations simulated by chromosome. Since some of my samples don't have many mutations, I think this may be due to some chromosomes having 0 mutations. Any help with this?

Thanks,
Ronnie

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.