Git Product home page Git Product logo

genometric / mspc Goto Github PK

View Code? Open in Web Editor NEW
19.0 4.0 10.0 23.16 MB

Using combined evidence from replicates to evaluate ChIP-seq peaks

Home Page: https://genometric.github.io/MSPC/

License: GNU General Public License v3.0

C# 83.65% JavaScript 2.92% CSS 1.02% Python 3.95% Shell 0.42% R 2.25% PowerShell 0.22% Jupyter Notebook 5.57%
next-generation-sequencing chip-seq ngs-analysis genome-analysis peak enriched-regions overlapping-peaks analysis mspc peaks

mspc's Introduction

MSPC

Quick Start | Documentation | Download | Publication

About

The analysis of ChIP-seq samples outputs a number of enriched regions, each indicating a protein-DNA interaction or a specific chromatin modification. Enriched regions (commonly known as "peaks") are called when the read distribution is significantly different from the background and its corresponding significance measure (p-value) is below a user-defined threshold.

When replicate samples are analysed, overlapping enriched regions are expected. This repeated evidence can therefore be used to locally lower the minimum significance required to accept a peak. Here, we propose a method for joint analysis of weak peaks.

Given a set of peaks from (biological or technical) replicates, the method combines the p-values of overlapping enriched regions: users can choose a threshold on the combined significance of overlapping peaks and set a minimum number of replicates where the overlapping peaks should be present. The method allows the "rescue" of weak peaks occuring in more than one replicate and outputs a new set of enriched regions for each replicate.

In general, the method groups enriched regions as background, weak, or stringent based on user-defined weak and stringency thresholds. The method then confirms or discards the weak and stringent enriched regions if their combined stringency is at least as significant as a user-defined threshold. The method then performs a multiple testing correction on confirmed enriched regions at a user-defined false-discovery rate, identifying true-positive and false-positive regions. See the following figure as an example, and you may refer to MSPC publications, slides on slideshare, or documentation page for more details.


Download and Run

MSPC is distributed as a cross-platform console application, a .NET library, and a Bioconductor R package.

mspc's People

Contributors

dependabot[bot] avatar fernandopalluzzi avatar marziacremona avatar meriembahda avatar vjalili avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

mspc's Issues

Automatic filter of ENCODE blacklist

I think it would be great to add an automatic filtering of known artifacts, such as the ones present in ENCODE blacklists.

Functional genomics experiments based on next-gen sequencing (e.g. ChIP-seq, MNase-seq, DNase-seq, FAIRE-seq) that measure biochemical activity of various elements in the genome often produce artifact signal in certain regions of the genome. It is important to keep track of and filter artifact regions that tend to show artificially high signal (excessive unstructured anomalous reads mapping).

MSPC could have an additional argument (say, --blacklist ) so that the user can decide not to filter anything (--blacklist none), or to filter out every peak overlapping the ENCODE blacklist corresponding to the genome of reference (--blacklist hg19, --blacklist GRCh38 and so on. The complete list of available blacklists can be found here).

Problem about "Invalid culture info"

Hi! Recently I want to run the MSPC software to merge the peaks generated by MACS2 form DAP-Seq data. However, When I test the program using the demo data, I encountered an issue about wrong Invalid culture info as following:
image

I Tried to add a parser configuration using parameter "-p";
image
However, it does not work. thus, please help me to figure it out. Thanks.

Add MSPC to `PATH`

  • Update documentation on how to add MSPC to each OS's PATH;
  • Add a check in the wrapper to assert if MSPC is installed and callable (whether binaries are locally available or the installation directory is added to PATH).

Obtaining the actual values of 'inf's

Hello,

I just wonder if there is a way to obtain the actual values of the peaks that have p-values as infs in the resultant Consensus files.

Thank you.

What value to set for -c?

What would you recommend setting for -c argument? I have 24 scATAC datasets. Each dataset has anywhere from 1000-10000 cells. Is 50% a good choice? The cell types are neurons (excitatory/inhibitory), astrocytes, microglia, oligodendrocytes, and OPC's. I am doing mspc using peaks called separately for each cell type (macs2)

Error while running the v5.0.0 and v5.1.0 releases of MSPC

Hi,

I downloaded the 2 latest releases of MSPC ( v5.0.0 and v5.1.0 ) using Method B: Self-Contained as shown in the MSPC installation tutorial.
I downloaded afterwards the data from the demo.zip file.
I then ran mspc using the command .\mspc.exe -i .\rep*.bed -r bio -w 1e-4 -s 1e-8

The MSPC program worked and returned the expected output as shown in the attached picture :
MSPC output

But it also returned the following error
Error

The files in my MSPC folder when i run the program are the following:

  • clrcompression.dll
  • clrjit.dll
  • Core.pdb
  • coreclr.dll
  • mscordaccore.dll
  • mspc.exe
  • mspc.pdb
  • rep1
  • rep2

Thank you,
Meriem

Number of consensus peaks and TruePositives are not changing in different runs

Hello,

I have two questions which I think related with each other.

When I run mspc with such various values for -w and -s options like; -w 1e-4 -s 1e-8, -w 1e-4 -s 1e-6, -w 1e-1 -s 1e-3 , I am always getting the same number of consensus peaks.

And also for different runs, the percentage of Background, Weak, Discarded, FalsePositive are always %0,000 while Stringent , Confirmed, TruePositive ones are %100,000. ( Only Read peaks# changes, expectedly. )

If you could help me to figure out the reasons I will more than appreciate.

Curious scores equal 0 (zero) in MSPC consensus file for replicated stringent peaks

Hello
I am using mspc version 4.0.1
I am obtaining an unexpected behavior for the results of the MSPC analysis, where peaks (ATAC-seq, called by MACS2) that are of very high confidence get a score of zero while much lower confidence consensus peaks get an above-zero score. I would like an explanation to help me understand if this is the expected behavior or if I am doing something wrong.

The command I used with MSPC was

dotnet mspc.dll \
-i Sample_01_peaks.mspcBED \
Sample_02.mspcBED \
Sample_03.mspcBED \
Sample_04.mspcBED \
Sample_05.mspcBED \
Sample_06.mspcBED \
-r bio -w 1e-5 -s 1e-10 -d 6 \
--output MSPC_output

Peaks are called with MACS2 with p-value cutoff left at default value. The distribution of scores calculated in R for one of my 6 replicates is as follows:

       V7                V8                 V9                V10        
 Min.   :  1.251   Min.   :   3.132   Min.   :   1.374   Min.   :   0.0  
 1st Qu.:  4.029   1st Qu.:  10.598   1st Qu.:   8.617   1st Qu.:  56.0  
 Median :  6.339   Median :  24.337   Median :  22.179   Median : 106.0  
 Mean   :  9.752   Mean   :  93.057   Mean   :  90.696   Mean   : 150.1  
 3rd Qu.: 13.196   3rd Qu.:  89.245   3rd Qu.:  86.703   3rd Qu.: 189.0  
 Max.   :110.091   Max.   :3742.200   Max.   :3735.730   Max.   :1555.0  

V7: fold-change at peak summit
V8: -log10pvalue at peak summit
V9: -log10qvalue at peak summit
V10: relative summit position to peak start

In the consensus peaks:

chr9    69304509        69304862        MSPC_Peak_116502        0
chr9    69305100        69305173        MSPC_Peak_116501        12.332
chr9    69305437        69306418        MSPC_Peak_116500        0

Same peaks, in the ConsensusPeaks_mspc_peaks.txt file:

chr     start   stop    name    -1xlog10(p-value)       xSqrd   -1xlog10(Right-Tail Probability)        -1xlog10(AdjustedP-value)
chr9    69304509        69304862        MSPC_Peak_116502        0       3470.525        0       0
chr9    69305100        69305173        MSPC_Peak_116501        12.332  56.79   12.332  12.312
chr9    69305437        69306418        MSPC_Peak_116500        0       8933.281        0       0

The corresponding peaks, in the first of 6 input files from MACSs are:

chr9	69304514	69304845	peak_152259	1257	.	4.95997	128.515	125.789	57
chr9	69305100	69305173	peak_152260	123	.	2.02626	14.4003	12.3318	49
chr9	69305445	69306393	peak_152261	3642	.	8.83607	367.663	

thus -log10(q-values) of about 125, 12 and 367.

Here is a visual example to explain:
image

A plot of the xSqrd vs -1xlog10(Right-Tail Probability) gives me this, so at values of chi-square above ~1500, the scores get very bad. Is this a problem with the program not being able to calculate an accurate probability with huge xSqrd values?

image

I don't understand how such a great peak (MSPC_Peak_116500, on the right on the graph) can get a poorer "score" than a very short peak. I must be getting something wrong?

Thanks for your help

Alex

Is it possible to get original pvalue in consensus peak file?

Dear developer,

Thank you for providing this brilliant package to us. I have a question about p-value in ConsensusPeaks.bed file. The fifth column is -log(pvalue), and I got many inf value in this column, which means the peaks are extremely significant? (If I am right). I show these values to my collaborator, and they want the original p-value of these peaks. And I'd like to ask is it possible? Thank you in advance, and have a good day.

Which is consider as peak enrichment in output

Hi,
I have a question about the output from MSPC.
Do I use p-value (I think this is -log10(p-value)?) or use value in xSquare as the enrichment score for peaks intensity? I would like to determine the peak enrichment in a list of genes. Thank you!

Best,
Ellie

The number of consensus peaks is greater than that of individual replicates?

Dear developers,

First, thanks for providing us with such a great package. I wanted to find some "common" peaks around 5 biological replicates of a sample with the sample command you presented in the manual: $MSPC/mspc -i ATAC_D3_*.narrowPeak -w 1e-4 -s 1e-8 -r bio -o D3. However, I am confused about the consensus peak file which gave me a strange value that is greater than the number of peaks of any replicate. Typically, the number of common peaks should be less than individual replicate, right? I don't know if it is a normal result, or I used inappropriate parameters? Could you please give me some suggestions to solve the problem? Thanks in advance!

Number of peaks (5 replicates)
53639 ATAC_D3_1_peaks.narrowPeak
69989 ATAC_D3_2_peaks.narrowPeak
152633 ATAC_D3_3_peaks.narrowPeak
68203 ATAC_D3_4_peaks.narrowPeak
95279 ATAC_D3_5_peaks.narrowPeak

Number of common peaks (MSPC results)
171405 ConsensusPeaks.bed

Here is some background info:
ref genome: hg19
peak calling: macs2

C argument does not accept percentage

While running MSPC, I wanted to specify 50% for the c argument on rmspc (from Bioconductor), as described here. However, the program threw me a warning Warning: Invalid C=50, it is set to C=5. and proceeded to use 5 as the parameter value for the run.

Context: I had 5 biological replicated input peak files. The code I use to run rmspc specifically was as follows:

results <- mspc(input = input_paths, replicateType = "Bio", stringencyThreshold = 1e-8, weakThreshold = 1e-4, gamma = 1e-8, keep = TRUE, multipleIntersections = "Lowest", c = 50, alpha = 0.05, inputParserConfiguration = parser_json, outputPath = output_path)

Is it because I did not use % symbol? I removed it because my VSCode linter disliked that symbol there so I thought it is going to error out if I run the command.

Tangential question: Considering the robustness of the MSPC analysis under the hood, do you have any recommendation for the value of C? Would setting it to the number of input samples be too stringent/remove potentially true peaks? The default setting is 1; is there any justification for this design decision that I am not seeing?

MSPC cannot continue error

Hello,

Thanks for making a very useful software!
I am using MSPC to generate consensus peaks for my ATAC-Seq experiments. I want to generate consensus peaks per treatment type which I was successfully able to do for one treatment type; however, it fails for the other treatment types after a certain point. Below is an example of the error message:

 .::...Parsing Samples....::. 
    #	            Filename	Read peaks#	Min p-value	Mean p-value	Max p-value	
------	--------------------	-----------	-----------	------------	-----------	`
  1/11	        273E_IC_peak	    116,172	 1.908E-191	  6.663E-004	 9.993E-003	
  2/11	        280D_IC_peak	    137,711	 9.204E-185	  1.290E-003	 9.993E-003	
  3/11	        519C_IC_peak	    139,766	 3.152E-190	  1.465E-003	 9.993E-003	
  4/11	        551A_IC_peak	    151,793	 3.873E-182	  6.876E-004	 9.993E-003	
  5/11	        595D_IC_peak	    131,618	 2.306E-168	  5.484E-004	 9.993E-003	
  6/11	        643C_IC_peak	    148,040	 9.996E-185	  1.167E-003	 9.993E-003	
  7/11	         704_IC_peak	    135,992	 3.030E-191	  6.284E-004	 9.993E-003	
Error: error parsing data: An item with the same key has already been added. Key: -2061531644
You may run mspc with either of [-? | -h | --help] tags for help.
MSPC cannot continue. 

I have tried a couple of things but I am getting the same error and I am not really sure how to proceed. This is the basic command (without options) I submit:
mspc -i *IC_peak.bed -r bio -w 1e-4 -s 1e-8

Is it possible to get some insight on how I can solve this please? 🙂

Many thanks,
Lara

Unhandled Exception:

Thanks for this software! it runs super fast!
I got a message that saying dotnet is quit unexpectedly.
I thought it run through since all fold is generated. But I checked the log output has this message in the middle, showed up 2 times.
I rerun it, the same.
Can you help me with it? Thank you!

Only the "Summary statistics" was missing from these 2 files, so I guess it worked?

Unhandled Exception: System.ArgumentOutOfRangeException: StartIndex cannot be less than zero.
Parameter name: startIndex
   at System.String.Substring(Int32 startIndex, Int32 length)
   at Genometric.MSPC.CLI.SummaryStats.TruncateString(String value, Int32 maxLength)
   at Genometric.MSPC.CLI.SummaryStats.RenderRow(Int32 columnwidth, String[] columns)
   at Genometric.MSPC.CLI.SummaryStats.Create(List`1 samples, Dictionary`2 samplesDict, ReadOnlyDictionary`2 results, ReadOnlyDictionary`2 consensusPeaks, List`
1 exportedAttributes)
   at Genometric.MSPC.CLI.Program.Main(String[] args)
./6.14.19_MSPC_mac2_reps.sh: line 28: 33841 Abort trap: 6           dotnet /Users/ellie/Downloads/mspc_v3.3/CLI.dll -i ${line}*.bed -r bio -s 1E-8 -w 1E-4 -g 1E
-8 -c 2 -a 0.05 -m lowest

output path

Hi,
very much like MSPC. I am using it in R, and one thing I can't get to work is the outputpath. No matter what I add here, MSPC defaults to save the output into this folder:
C:\Users\username\AppData\Local\Temp

Do you know what causes this, and how to make MSPC accept my outputpath ?
Thanks,
Oliver

Incorrect decimal separator

Using the software with different language configurations, I have detected a new issue.

When I use my PC in Spanish (with decimal separator ","), MSPC is not able to detect the typical "." separator, which is use in the output of MACS2. If a -log10(pvalue) is 5.454, for example, MSPC detects 10^-5454. Moreover, as MSPC does not report an error, you could use the incorrect result. The only way to make it work properly is change the language or change the separator in the bed files.

I don't know nothing about .NET, but it would be great if you can force the separator to be always a point, regardless of the system language.

Best regards and thanks for the software.

Error while running the recent version of MSPC

Hi,
After downloading the most recent version of MSPC, I tried running it but I got the following error message :

Error: An assembly specified in the application dependencies manifest (mspc.deps.json) was not found: package: 'System.Security.Cryptography.ProtectedData', version: '4.5.0' path: 'runtimes/win/lib/netstandard2.0/System.Security.Cryptography.ProtectedData.dll'

Consensus peaks, clarification on their definition and impact of the -c option

Hello
I discovered MSPC and it seems to be quite useful for our work with peak calling in ATAC-seq datasets.

I have a question regarding the definition of "consensus peaks". It is not very clearly explained on the website:
https://genometric.github.io/MSPC/docs/method/consensus

The definition says

A consensus peak is created on a position of genome where is covered by at least one peak from the output sets of the processed replicates.

But in the output folders (each folder created for each of my replicate peak calls, I see multiple outputs (the "confirmed", "stringent", "true positives", etc). I am wondering which of those is used to generate the consensus list.

Secondly, I noted the -c option, and its usage is again not very clearly explained.

-c Sets minimum number of overlapping peaks before combining p-values.

Does the -c value mean the number of replicates where the peak is in the "confirmed" output bed file? And is this a criteria for sending (or not sending) a given peak to the "consensus" bed file based on whether it passes this filter?

Thanks
Alex

I run rmspc had met some problems

when i want to run the program to calculate the peak calling in Rstudio of windows,The program will report an error,Does this program not support Windows? Thansk!

library(rmspc)
> results <- mspc(input = input, replicateType = "Technical",
+                 stringencyThreshold = 1e-8,
+                 weakThreshold = 1e-4, gamma = 1e-8,
+                 keep = FALSE,GRanges = TRUE,
+                 multipleIntersections = "Lowest",
+                 c = 2,alpha = 0.05)

Error in process_initialize(self, private, command, args, stdin, stdout, …:
! Native call to processx_exec failed
Caused by error in chain_call(c_processx_exec, command, c(command, args), pty, pty_options, …:
! Command 'dotnet' not found @win/processx.c:982 (processx_exec)
Type .Last.error to see the more details.

DOTNET CIFS version issues

Environment: Rocky Linux 8.5 (and 8.4)
The bundled version of Dotnet in the linux-x64.zip has an issue with CIFS calls depending on how they are called, resulting in empty files being created followed by access denied upon writing:

.::....Saving Results....::.
Error: Access to the path '/mnt/redacted/ConsensusPeaks_mspc_peaks.txt' is denied.

See here for the bug: dotnet/runtime#42790

The workaround i used was from here: https://forums.docker.com/t/a-workaround-for-net-writing-to-cifs-volume-yielding-empty-file-and-access-to-the-path-is-denied/110872

How do I run the self-contained version?

Hi authors! Thank you for developing this neat little program. However, I have an problem that is not sufficiently documented in the program's wiki (at least not that I could find). I am not able to install .NET for the machine I am planning to run MSPC on because I would be working with a remote HPC cluster. I followed the instruction to install the self-contained version here https://genometric.github.io/MSPC/docs/installation/ but I am unsure as to how to run the program? There are only .pdb files in the generated directory. How do I invoke the command on CLI?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.