Git Product home page Git Product logo

alexandrovlab / sigprofilersinglesample Goto Github PK

View Code? Open in Web Editor NEW
23.0 23.0 2.0 14.66 MB

SigProfilerSingleSample allows attributing a known set of mutational signatures to an individual sample. The tool identifies the activity of each signature in the sample and assigns the probability for each signature to cause a specific mutation type in the sample. The tool makes use of SigProfilerMatrixGenerator and SigProfilerPlotting.

Python 100.00%

sigprofilersinglesample's People

Contributors

marcos-diazg avatar mishugeb avatar reddyashrith avatar rptashkin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sigprofilersinglesample's Issues

Input differs than output of SigProfilerExtractor

Hi, I hope I'm not missing anything, but in the paper that came out earlier this year, it was mentioned that there were two steps to extracting the COSMIC signatures- SigProfilerExtractor and SigProfilerSIngleSample. None of the outputs of the SigProfilerExtractor seem to match the possible input formats of the SigProfilerSingleSample. I tried running SigProfilerSingleSample on the vcf files I have. The issues I'm having are:

  1. I am not getting any errors, but it seems like only a fraction of the samples are being extracted.
  2. I am not getting these files in my output folder, which I believe I should be getting:
    -exposure.txt
    -signature.txt
    -probabilities.txt
    -signature plot pdf
    -dendrogram plot
    -decomposition profile.csv
    I'm only getting decomposition_profile.csv and files that match the output of SigProfilerMatrixGeneratorFunc.

Thank you in advance!

Error when running SigProfilerSingleSample.

Hi. @marcos-diazg @mishugeb @itsvenu
I want to assign activities of known COSMIC signatures to each sample and I have prepared data (by MatrixGenerator) and sig_database (from the COSMIC website) according to the example input file and imported them using the following codes.

from sigproSS import spss
import pandas as pd
sig_db = pd.read_csv('COSMIC_v3.2_SBS_GRCh38.csv')
data=pd.read_csv("/home/lcj/lincj/CBCGA/SigProfiler220414/input/CBCGA.SBS96.all.csv")

Below are the first few rows of each data frame.

data
   Mutation type Trinucleotide  ACEJ  ACKR  ACSK  ACYZ  AEFC  AEUJ  AEXF  AGNS  AGRL  AGVN  AILX  AIWT  AJEH  AJYK  AKVJ  ...  ZRET  ZROW  ZSRB  ZSRD  ZTPX  ZUCD  ZUXJ  ZVCO  ZVDR  ZWCY  ZXHO  ZXLK  ZYFK  ZYIG  ZYQH  ZYTB  ZYWC
0            C>A           ACA     1     0     0     0     0     0     0     4     1     1     7     1     1     0     1  ...     0     1     2     3     0     0     1     2     0     0     1     0     0     0     0     0     0
1            C>A           ACC     2     0     0     1     0     3     1     2     1     0     9     6     0     0     0  ...     0     0     3     3     0     2     1     0     1     1     3     0     1     0     0     0     1
2            C>A           ACG     0     0     1     0     0     0     1     1     0     0     2     0     0     1     0  ...     0     0     0     4     0     0     0     0     0     0     0     1     0     0     0     0     0
3            C>A           ACT     0     0     0     0     0     3     0     1     0     0     6     5     0     0     0  ...     0     0     0     4     2     0     1     0     0     0     2     0     0     0     1     0     0
4            C>G           ACA     1     0     0     1     0     0     0     1     0     0     9     5     0     2     0  ...     0     0     0    11     0     2     0     0     0     0     0     0     0     1     1     0     0
..           ...           ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...  ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...
91           T>C           TTT     1     0     0     0     1     2     1     1     1     0     5     7     1     1     1  ...     0     0     0     2     0     0     1     0     0     0     0     0     0     0     0     0     0
92           T>G           TTA     0     0     0     0     0     0     0     0     1     0     2     2     0     0     0  ...     1     0     0     0     0     0     1     0     0     0     0     0     1     0     0     0     0
93           T>G           TTC     0     0     0     0     0     0     0     0     0     1     1     1     0     0     0  ...     0     1     0     0     0     0     0     0     0     0     1     0     0     0     0     1     0
94           T>G           TTG     0     0     0     0     0     0     2     1     3     0     1     0     0     0     0  ...     0     0     0     2     0     0     0     0     0     0     0     1     0     0     0     0     0
95           T>G           TTT     0     0     1     0     0     1     0     0     0     0     5     1     0     3     0  ...     0     0     0     1     0     0     0     0     0     1     2     0     1     2     1     1     0

sig_db
   Type SubType          SBS1          SBS2      SBS3      SBS5      SBS6          SBS8      SBS9         SBS13        SBS17a    SBS17b     SBS18         SBS20     SBS26     SBS30     SBS37     SBS40     SBS41
0   C>A     ACA  8.760230e-04  5.790000e-07  0.020920  0.012052  0.000425  4.431064e-02  0.000561  1.816879e-03  2.072799e-03  0.000608  0.051688  6.242480e-04  0.000877  0.001811  0.003963  0.028323  0.002120
1   C>A     ACC  2.220120e-03  1.455050e-04  0.016343  0.009337  0.000516  4.729956e-02  0.004047  7.088420e-04  9.052930e-04  0.000127  0.015617  1.380514e-03  0.000522  0.000501  0.001433  0.013254  0.001207
2   C>A     ACG  1.797270e-04  5.360000e-05  0.001808  0.001908  0.000053  4.767276e-03  0.000440  2.706560e-04  4.890000e-05  0.000060  0.002505  2.260000e-05  0.000118  0.000094  0.001092  0.003012  0.000063
3   C>A     ACT  1.265053e-03  9.760000e-05  0.012265  0.006636  0.000180  4.720459e-02  0.003063  3.472570e-04  6.190000e-05  0.000456  0.021469  1.249985e-03  0.000621  0.000559  0.001855  0.014858  0.001336
4   C>G     ACA  1.839055e-03  2.230000e-16  0.019813  0.010144  0.000471  4.350682e-03  0.004863  3.863364e-03  1.011366e-03  0.000146  0.001736  8.844347e-03  0.000429  0.001076  0.034416  0.012253  0.005355
..  ...     ...           ...           ...       ...       ...       ...           ...       ...           ...           ...       ...       ...           ...       ...       ...       ...       ...       ...
91  T>C     TTT  4.274201e-03  3.570000e-05  0.013957  0.018550  0.001738  4.584279e-03  0.038518  5.292180e-04  2.099382e-02  0.000998  0.003377  1.943161e-02  0.057560  0.000447  0.061507  0.010228  0.046241
92  T>G     TTA  2.170000e-16  1.640000e-05  0.007161  0.005149  0.000103  2.190000e-16  0.064829  1.803960e-04  2.180000e-16  0.000012  0.000686  2.200000e-16  0.001411  0.000117  0.018033  0.008345  0.041343
93  T>G     TTC  5.520000e-05  7.120000e-05  0.006401  0.006677  0.000291  1.160874e-03  0.008777  2.250000e-16  1.177210e-04  0.008864  0.002136  2.270000e-16  0.001751  0.000098  0.019830  0.011604  0.015783
94  T>G     TTG  5.776140e-04  9.540000e-05  0.008113  0.006984  0.000325  3.111109e-03  0.010974  3.670000e-05  9.231280e-04  0.004788  0.001458  2.819407e-03  0.002858  0.000819  0.030364  0.008716  0.019531
95  T>G     TTT  2.200000e-16  2.220000e-16  0.010543  0.013536  0.001009  9.991120e-04  0.064097  1.880000e-05  4.578653e-03  0.121753  0.005170  1.520297e-03  0.009476  0.008927  0.029151  0.025068  0.088168

I run spss based on these two dfs and it didn't run.

spss.single_sample(data, "spss_output", ref="GRCh38", exome=True, sig_database=sig_db)
##########################################################
Exacting Profile for Sample 1
>>>

I'm wondering if the input file was not formated correctly but I prepared them according to the example files, including the colnames.

New signatures on COSMIC website

Hi,

I noticed that there were new signatures that were defined and put on the COSMIC website. I want to compute scores for one of the new signatures (SBS88), but I have exome data only. There is a signature definition that can be downloaded but I believe this is the genome signature definition; however, I believe that the exome signature definitions are different from the genome signature definitions (as the exome and genome signature text files are different). Was wondering how the genome signatures definitions are normalized to convert to exome signature definitions?

In other words, if I wanted to use SigProfilerSingleSample to find scores for SBS88 with my exome data, would I be able to use the definition that I can download from the COSMIC website or would it require some normalization?

Thanks in advance for your help!

KeyError: '7_pvalue'

Hi,

To obtain indel signatures (ID), I ran like this:
sigdb = pd.read_csv("indel_catalog_downloaded_from_cosmic.txt", sep="\t")
spss.single_sample(input_dir, out_dir, ref='GRCh37', exome=False, sig_database=sigdb)

Works fine for SBS with sig_database='default'.
But, no avail for ID with the error as in the title (from line 560 in spss.py).

Can you please help me with this?

SigProfilerSingleSample not used by CNV signature

Hi,I read the CNV signature paper(s41586-022-04738-6),
I understand that computing CNV signature requires SigProfilerMatrixGenerator-->SigProfilerExtractor-->SigProfilerSingleSample. I don't know if I got that right...?
The sigproSS created on "Nov 7 13:33:06 2018", may cannot run it with CNV signatures

minimum threshold for reported signatures?

Hi,

So i've been comparing results from SigProfilerSingleSample to those of deconstructSigs to better get a grasp of what this tool does differently. I notice that the number of reported signatures for SigProfiler are much lower.

Is there currently a minimum proportion threshold in place for reporting whether a signature exists in SigProfiler?
If so, what is that threshold?

Is there a way to adjust SigProfiler to report every signature regardless of threshold?

Thanks again!

DBS and ID mutational signatures?

Hi,
Thanks for your work on this program.
I've successfully run this for SBS mutational signatures on my dataset and was wondering..
How do I get results for DBS and ID mutational signatures?

Thanks,
Tim

ID decomposition support?

Hi Mishu,

as you might have noticed from email from/to Ludmil and you regarding ID6 and SBS3, I'd like to test some hypothesis from our data. I'd like to know if ID decomposition is support in SS module?

Also a bit related question, do you suggest to run SigProfilerExtractor to run on let's say 20 samples?

thanks a lot for the help.

implementation for mutational catalogue as input

Hi,

Is there an implementation of SigProfilerSingleSample that takes the already generated mutational catalogues as input?
I only find the option of using the mutation data themselves, but maybe I just don't see it....

Thank you!

IndexError when run with GRCh38

Hi,

Error happened when running with GRCh38, but it worked with GRCh37. Could you please help?

Thanks,
Qiang

>>> from sigproSS import spss
>>> spss.single_sample("results", "results", exome=True, ref="GRCh38")

Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 1.61 seconds.
Starting matrix generation for INDELs...Completed! Elapsed time: 1.39 seconds.
Matrices generated for 1 samples with 0 errors. Total of 24 SNVs, 0 DINUCs, and 1 INDELs were successfully analyzed.
##########################################################
Exacting Profile for Sample 1
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.8/site-packages/sigproSS/signatures_optimizer.py", line 158, in remove_signatures
    sol = optimize.minimize(parameterized_objective2_custom, x0, args=(W1, genomes),  bounds=bnds, constraints =cons1, tol=1e-15)
  File "/usr/local/lib/python3.8/site-packages/scipy/optimize/_minimize.py", line 684, in minimize
    bounds = standardize_bounds(bounds, x0, meth)
  File "/usr/local/lib/python3.8/site-packages/scipy/optimize/_minimize.py", line 947, in standardize_bounds
    bounds = new_bounds_to_old(bounds.lb, bounds.ub, x0.shape[0])
IndexError: tuple index out of range
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "spss.py", line 13, in <module>
    spss.single_sample("results", "results", exome=True, ref="GRCh38")
  File "/usr/local/lib/python3.8/site-packages/sigproSS/spss.py", line 620, in single_sample
    results = analysis_individual_samples(samples,
  File "/usr/local/lib/python3.8/site-packages/sigproSS/spss.py", line 441, in analysis_individual_samples
    results = parallel_for_loop(iSample, inputSamples, allGenomeSignatures, allExomeSignatures, cancerType,
  File "/usr/local/lib/python3.8/site-packages/sigproSS/spss.py", line 237, in parallel_for_loop
    output = [p.get() for p in results]
  File "/usr/local/lib/python3.8/site-packages/sigproSS/spss.py", line 237, in <listcomp>
    output = [p.get() for p in results]
  File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
IndexError: tuple index out of range

Can this tool be used with a different genome?

Hello,

I am working on calling mutational signatures in dogs, and would like to use this tool. Is there a way to use SigProfilerSingleSample with the CanFam3.1 reference genome?

Thank you!

Best,
Kate

gzip VCF files

Hi,
I use a directory of *.vcf.gz as input but get "File format not supported".
I there a possibility to have support for compressed VCF files as input.
Thank you
TG

Vcf input doesn't work with other signatures than default ones

Hi,

I'm currently using SigProfilerSingleSample to refit mutational catalog from my samples against signatures from 2013.
I'm now trying to do the same but with a VCF instead of a mutational catalog. I'm able to do it when I refit against the signatures of 2020, so the ones used by default by the tool.

>>> spss.single_sample("/Users/romain/Doctorat/data/formatted/mutational_signature/sigprofiler-vcf_test", "/Users/romain/Doctorat/data/results/mutational_signature", ref="GRCH38", exome=False)
Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 28.03 seconds.
Matrices generated for 1 samples with 0 errors. Total of 1369083 SNVs, 55110 DINUCs, and 0 INDELs were successfully analyzed.
##########################################################
Exacting Profile for Sample 1


CONGRATULATIONS! THE SIGPROFILER SINGLE SAMPLE ANALYSIS ENDED SUCCESSFULLY

My problem is, the tool stops after the creation of decomposition profile.csv if I try to refit the VCF against the signatures from 2013. The decomposition profile.csv is empty as well as the .err file.

>>> import pandas as pd
>>> cols = list(pd.read_csv("/Users/romain/Doctorat/data/brut/mutational_signature/signatures.txt", delimiter='\t', nrows=1))
>>> signatures = pd.read_csv("/Users/romain/Doctorat/data/brut/mutational_signature/signatures.txt", delimiter='\t', usecols=[i for i in cols if i not in ['Substitution Type', 'Trinucleotide']], index_col=0)
>>> spss.single_sample("/Users/romain/Doctorat/data/formatted/mutational_signature/sigprofiler-vcf_test/2013_sig", "/Users/romain/Doctorat/data/results/mutational_signature/sigprofiler-vcf_test/2013_sig", sig_database=signatures, ref="GRCH38", exome=False)
Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 27.19 seconds.
Matrices generated for 1 samples with 0 errors. Total of 1369083 SNVs, 55110 DINUCs, and 0 INDELs were successfully analyzed.
##########################################################
Exacting Profile for Sample 1
>>> # The prompt appears almost immediately after previous line

Is it a bug or is it not possible to use SigProfilerSingleSample with a VCF to refit against other signatures than the ones used by default?

copy number signature

Hi ,
I am trying to get the CN signature based on copy number 48 matrix. I don't know whether I missed something:

  • I've generate the CN matrix by CNVMatrixGenerator.py:
MutationType | TCGA-OR-A5J4-01 | TCGA-OR-A5J5-01 | TCGA-OR-A5J6-01 | TCGA-OR-A5J7-01 | TCGA-OR-A5J9-01 | TCGA-OR-A5JC-01
-- | -- | -- | -- | -- | -- | --
0:homdel:0-100kb | 1 | 1 | 0 | 1 | 0 | 0
0:homdel:100kb-1Mb | 1 | 1 | 0 | 0 | 1 | 1
0:homdel:>1Mb | 0 | 0 | 0 | 0 | 0 | 0
1:LOH:0-100kb | 12 | 1 | 4 | 6 | 2 | 6
1:LOH:100kb-1Mb | 35 | 0 | 0 | 9 | 9 | 5
1:LOH:1Mb-10Mb | 14 | 0 | 0 | 11 | 14 | 2
1:LOH:10Mb-40Mb | 4 | 0 | 0 | 5 | 2 | 3
1:LOH:>40Mb | 0 | 0 | 0 | 0 | 1 | 1

##########################################################
Exacting Profile for Sample 1
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[17], line 5
      2 input_f = "/home/UTHSCSA/hef/2.Project/6.WGD/2.analysis/8.signature/CN_sig/CN_WGD.CNV48.matrix.tsv"
      3 data = pd.read_table(input_f)
----> 5 spss.single_sample(data, wdir +  "/results", ref="GRCh38.p12", exome=False)


File ~/Tools/miniconda3/lib/python3.8/site-packages/sigproSS/signatures_optimizer.py:239, in add_signatures(W, genome, cutoff, presentSignatures, toBeAdded)
    237     #print(len(notToBeAdded))
    238 originalSimilarity = -1 # it can be also written as oldsimilarity
--> 239 maxmutation = round(np.sum(genome))
    240 init_listed_idx = presentSignatures 
    241 init_nonlisted_idx = list(set(list(range(W.shape[1])))-set(init_listed_idx)-set(notToBeAdded))

TypeError: type str doesn't define __round__ method

I changed round() to np.round() in "signatures_optimizer.py", got other errors:

File ~/Tools/miniconda3/lib/python3.8/site-packages/numpy/core/fromnumeric.py:43, in _wrapit(obj, method, *args, **kwds)
     41 except AttributeError:
     42     wrap = None
---> 43 result = getattr(asarray(obj), method)(*args, **kwds)
     44 if wrap:
     45     if not isinstance(result, mu.ndarray):

TypeError: ufunc 'rint' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Thanks in advance

Missing Subfolders

Hello,

I ran spss.single_sample() on a bunch of samples. The resulting /VCF/output/ folder is supposed to have 4 sub-folders:TSB, SBS, DBS and ID. But for some samples, few of the subfolders are missing.

Say, for example, sample1 has all 4 output sub-folders, sample2 is missing DBS sub-folder, and sample3 is missing TSB,SBS,DBS sub-folders. Its quite random, there is no pattern and no error. The number of mutations in the input VCF file varies for each sample. Could that be one of the reasons?

Please can you tell me what might be causing this issue and how can I fix it? I would appreciate your help.

Thanks

S should be uppercase for filename "signatures.txt" in spss.py script

Hello Mishu,

In the script spss.py, for function single_sample(), in the plotting command, I think the name of the input file should be "Signatures.txt" and not "signatures.txt".

This is because the file that is output in the results folder is "Signatures.txt". The script looks for file "signatures.txt", and since it is unable to find it, it throws error "SORRY! THE MUTATION CONTEXT YOU PROVIDED COULD NOT BE PLOTTED"

Line 688 plot.plotSBS(output+"/signatures.txt", output+"/Signature_plot", "", "96", True, custom_text_upper= " ")

I was able to get the signature plot when I ran the same sigplt.plotSBS command using "Signatures.txt" as the input file.

Please let me know if it sounds right. Thank you for your help.

Regards,
Pankhuri

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.