encode-dcc / chip-seq-pipeline Goto Github PK
View Code? Open in Web Editor NEWENCODE Uniform processing pipeline for ChIP-seq
License: MIT License
ENCODE Uniform processing pipeline for ChIP-seq
License: MIT License
Hi,
I'm curious about the function of finding overlapped peaks, expecially the scripts 'overlap_peaks.py'. But the explanation about how to excute this function is barely mentioned in your document? I even don't know what kind of files should be submitted. Could you help me to use it? I really want to get a more precise results.
Hanwen
Hi,
I assumed you're using the following in the pipeline to calculate metrics for library complexity:
bedtools bamtobed -bedpe -i tmp.bam | awk 'BEGIN{OFS="\t"}{print $1,$2,$4,$6,$9,$10}'
| grep -v 'chrM' | sort | uniq -c | awk 'BEGIN{mt=0;m0=0;m1=0;m2=0}($1==1){m1=m1+1}
($1==2){m2=m2+1} {m0=m0+1} {mt=mt+$1}
END{printf "%d\t%d\t%d\t%d\t%f\t%f\t%f\n", mt,m0,m1,m2,m0/mt,m1/m0,m1/m2}' > ${sample}.pbc.qc
rm tmp.bam
where mt = # TotalReadPairs, m0 = # DistinctReadPairs, m1 = # OneReadPair, m2 = #TwoReadPairs, m0/mt = NRF=Distinct/Total, PBC1 = m1/m0 = OnePair/Distinct, PBC2 = m1/m2 = OnePair/TwoPair
Then if you remove duplicates mt becomes equal to m0 and NRF will be 1.
As I see it, the line "uniq -c" prefixes lines by the number of occurrences, so it adds prefix 1 if the lines is unique i.e. m1, then prefix 2 for a second occurrence if the line is repeated. However, identical lines are usually removed during remooval of duplicates. If we would use the definition of distinct genomic location then the code should not search for identical occurrences lines to classify them as m2 but for lines that map to the same location (partially overlapping fragments that originates from a different dna molecule)
I have tried to use these calculation after removing duplicates and the NRF does not look right. It is always. Maybe you can explain why this step is always after removing duplicates in the pipeline which cause NRF to be always 1
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.