Git Product home page Git Product logo

multimotifmaker's Introduction

MultiMotifMaker

Overview

MultiMotifMaker is a multi-thread tool for identifying DNA methylation motifs from Pacbio reads. It is an efficient, multi threads implementation of MotifMaker (https://github.com/PacificBiosciences/MotifMaker) to take full advantage of multi-processors to achieve significant acceleration.

DNA methylation is the most common form of DNA modification in the genomes of prokaryotes and eukaryotes and plays a vital role in many critical biological processes. The methylation of DNA bases is catalyzed by DNA methyltransferases, which bind to a special DNA sequence motif. Recently, the third generation sequencing technologies, such as Pacbio SMRT, provide a new way to identify base methylation in the genome. For each methylated site, the SMRT pipeline outputs a sequence of 41 bases centered by the methylated base. The number of methylated site ranges from ten thousands for E. coli to ten millions for human. Identifying methylation motif from the output of SMRT pipeline differs from previous de novo motif finding algorithms. See the publication for more background on modification detection: (http://nar.oxfordjournals.org/content/early/2011/12/07/nar.gkr1146.full)

Algorithm

Existing motif finding algorithms such as MEME, Gibbs motif Sampler and MEpigram. None of those methods considers the base centralized sequences as input. In addition, none of those methods includes the base modification signals in their algorithms.

In order to find methylation motifs from SMRT output methylation sequences, the PacBio developed a tool, MotifMaker.
Given a list of modification detections and a genome sequence, MotifMaker searches all possible motifs using a motif score. The search is gradually performed from short to longer motifs using a branch-and-bound method. However, MotifMaker generally executes in single-threaded and the search process is very time consuming(MotifMaker).

Here, we give a rough overview of the algorithm used by MultiMotifMaker. The branch-and-bound search step, which is designed to search motifs from modification sequences through a series of iterations, is the most time-consuming procedure of overall workflow of MotifMaker. Since every expansion node of the solution space tree can calculate its subtree independently, multiple expansion nodes can be computed at the same time. Therefore, according to the branching rule in the branch-and-bound method, we may first calculate the candidate living nodes of the first k layers of the invisible tree with n son nodes for every expansion node. Then, branch-and-bound search method will be applied to these candidate living nodes respectively. Consequently, it is possible to submit these computing tasks to a thread pool to achieve parallel computation. Every thread will search their local solution space trees to obtain local optimal solutions independently. Additionally, we set two global variables to represent the current optimal solution and the paths searched. After every thread ends, two global variables should be updated. The following task threads will adopt these two values as parameters to prune for their search processes. Finally, when all executions finish, we got the final optimal motif. We take full advantage of multi-core concurrent computation, which has obviously accelerated.

Usage

The jar supplied in artifacts/MultiMotifMaker.jar bundles all dependencies and should be runnable on most systems. The sample datasets can be found in ./resources. (The original datasets are as follows, Geobacter metallireducens data: https://github.com/PacificBiosciences/MotifMaker/tree/master/src/test/resources , E.coli data: https://github.com/PacificBiosciences/DevNet/wiki/E.-coli-Bacterial-Assembly, Arabidopsis data: NCBI(PRJNA314706),https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA314706&go=go.)

For command-line motif finding, run the 'find' sub-command, and pass the reference fasta and the modifications.gff(.gz) file emitted by the PacBio modification detection workflow (SMRT Analysis: https://www.pacb.com/products-and-services/analytical-software/smrt-analysis/).
The reprocess command annotates the gff with motif information for better genome browsing.

$ java -jar artifacts/MultiMotifMaker.jar

Usage: MultiMotifMaker [options] [command] [command options]
  Options:
    -h, --help
                 Default: false
  Commands:
    find      Run motif finding
      Usage: find [options]
        Options:
        * -f, --fasta      Reference fasta file
        * -g, --gff        modifications.gff or .gff.gz file
          -m, --minScore   Minimum Qmod score to use in motif finding
                           Default: 30.0
        * -o, --output     Output motifs csv file
          -x, --xml        Output motifs xml file
          -l, --layer      Search depth used to Parallelize motif finder
                           Default: 1
          -t, --thread     The concurrency to parallelize motif finder,this parameter should be the number of CPU
                           Default: 16

    reprocess      Reprocess gff file with motif information
      Usage: reprocess [options]
        Options:
          -c, --csv           Raw modifications.csv file
        * -f, --fasta         Reference fasta file
        * -g, --gff           original modifications.gff or .gff.gz file
              --minFraction   Only use motifs above this methylated fraction
                              Default: 0.75
        * -m, --motifs        motifs csv
        * -o, --output        Reprocessed modifications.gff file

For example: java -jar ./artifacts/MultiMotifMaker.jar find --fasta ./resources/lambda/lambda.fasta --gff ./resources/lambda/lambda_modifications.gff --minScore 30.0 --output ./resources/lambda/lambda_motifs.csv --xml ./resources/lambda/lambda_motifs.xml --layer 1 --thread 8 or java -jar ./artifacts/MultiMotifMaker.jar reprocess --fasta ./resources/lambda/lambda.fasta --gff ./resources/lambda/lambda_modifications.gff --motifs ./resources/lambda/lambda_motifs.csv --output ./resources/lambda/lambda_modifications_new.gff

Output file descriptions:

Using the find command:

  • Output csv file: This file follows the same format as the standard "Fields included in motif_summary.csv" described in the Methylome Analysis White Paper (https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Methylome-Analysis-Technical-Note).

  • Output xml file: This is an output used by SMRT Portal and is not necessary using the command line. Simply do not include the -x command option. The information contained in this file is used to fill in the standard motif report table in SMRT Portal and is redundant with the CSV output file.

Using the reprocess command: (Reprocessing will update a modifications.gff file with information based on new Modification QV thresholds)

DISCLAIMER

THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.

multimotifmaker's People

Contributors

litao-csu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

chenzixi07

multimotifmaker's Issues

I use MultiMotifMaker for identifying motifs but get nothing

Hi, there
I tried MultiMotifMaker for identifying motifs but get no results. Here's is my fasta file and gff file produced from ipdsummary:

>scaffold00014
TTTTACTATTATCGGAAAGTTAATCTTACTTCTTCCTAATATATTATATAATCTTAATAA
ACGTTATTCCCTTAGCTATATTATCTTTATAAATAGCTGAATAAAGTATCTATTATCTTT
ATTAATATAGATAATAATATTTATAACTTTCTCTTCCTCCTTAAAGCTAGCTAAGTAATA
AATATCTTTATTAATTAATTAAGAACTCCTAGAAATTATCCCCTATAAACCTAAAAGCAC
TATTTAACTTTAATATTAATATTATCGCTAATCCCTTATTATTTATTACCTTCTTAAGGA
TATTTTCGAGGATTAATATAAATTAATTAAAAGAAAGGCCTTAAATCGGGATATACTATA
CCTCTCAAGTAAATAGGTAATTAATTATATTCTTATAAAGTACTATTCTAGGTTAAAGAG
GTTAAATATTCTTATAAACTAGTAGGAAAATAGGTCTTAGAGGCCGGCTATTATTAAGAA
GGAGTCTAACTTTATTATCGTTATTATATTAAAGTAGAGAGGGTTAGAAATAGGCGAGGA
TTTACTATCTTTACGAGAATGCTCCCTATAAATAGTAAAAGCCCCTTTTATAAATAAACT
TAGGTATTAGCTATAGTAAATATAGGTTTATTCTCTTAATTAATTAATATAAGGGTTCCG
AGGCTAGATTATTTACTTTACTTTCTTTAATTATTAATTAGGGTCTATCGTTTATTATTA
TTTCTTATTATTTTATTTTATATATATTAATTTTTACTAAGGAAGAAGGGGGGTTATAGA
AGAGGGAGGTCGAAGCTATTTTTAAGGAATCGAGTCGCTTTATATAAGCTTCGCTATAGC
CTCCCCTATCTTAGTATTATATAACCTATTATAGGGTCCTAATACTATAATTAGGGGAAT
ATAATTATTATAGATTAATTATATTATATATATAGCTAGTTAGCTATATATATTATTTAT
ATATAAAGCTTATCTATTATTTATATTATTTATATATAAAGCTTATCTATTATTTATATT
ATTTATATATAAAGCTTATCTATTATTTATATTATTTATATTATATATATAGCTAATTAG
CTATATATATTATTCGTAAATCTCGTATATTTTATAAAATTATAGTATTACTTTAGTAAT
.....

gff file:

##gff-version 3
##source ipdSummary v2.0
##source-commandline /home/sujitmaiti/miniconda3/envs/kinectictools/bin/ipdSummary ../raw.merged.bam --reference ./scaffolds.reduced.fa --gff basemods.gff --csv ki
netics.csv --bigwig IpdRatio.bw --identify m6A,m4C --methylFraction --numWorkers 10 --pvalue 0.01 --minCoverage 3
##sequence-region scaffold00014 1 418576
##sequence-region scaffold00015 1 394874
##sequence-region scaffold00016 1 350982
##sequence-region scaffold00017 1 156563
##sequence-region scaffold00010 1 1520983
##sequence-region scaffold00011 1 1050304
##sequence-region scaffold00012 1 933454
##sequence-region scaffold00013 1 900483
##sequence-region scaffold00018 1 123238
##sequence-region scaffold00019 1 83344
##sequence-region scaffold00024 1 11475
##sequence-region scaffold00007 1 2329946
##sequence-region scaffold00006 1 3148538
##sequence-region scaffold00009 1 1587396
##sequence-region scaffold00008 1 3559532
##sequence-region scaffold00023 1 14512
##sequence-region scaffold00003 1 6610137
##sequence-region scaffold00002 1 8563036
##sequence-region scaffold00001 1 17280981
##sequence-region scaffold00004 1 3506857
##sequence-region scaffold00021 1 42785
##sequence-region scaffold00020 1 47391
##sequence-region scaffold00005 1 2967496
##sequence-region scaffold00022 1 15686
scaffold00014   kinModCall      modified_base   5013    5013    27      -       .       coverage=5;context=ACGCCGGCGCCCAGACCGAGGCCGCGCTCGCCGTGTACCGG;IPDRatio=2.09
scaffold00014   kinModCall      modified_base   5748    5748    21      +       .       coverage=6;context=GGGCGTCCGACATGGCGAGCAGCCAGGGTGGTCGTTCGAGG;IPDRatio=2.16
scaffold00014   kinModCall      modified_base   7639    7639    21      -       .       coverage=13;context=AGGCCGTGCGCGCCATCGGCCGGTGTGCGCAGGCCGATGCG;IPDRatio=3.71
scaffold00014   kinModCall      m4C     14765   14765   20      +       .       coverage=11;context=TAGATATACTCTAAACTAAACTATAGATCGACGAAGTACTC;IPDRatio=1.96;frac=1.000;fracLow=0.367;fracUp=1.000;identificationQv=3
scaffold00014   kinModCall      m6A     16816   16816   23      +       .       coverage=16;context=GTCCGATTAAGAGGAAATCGACTTACTCTTAATAGCTAGAG;IPDRatio=2.67;frac=0.814;fracLow=0.347;fracUp=1.000;identificationQv=8
scaffold00014   kinModCall      m4C     19652   19652   22      -       .       coverage=18;context=GTTCATGCCGGCATAGACGTCCTGGGTAGCTTTCTTCATAT;IPDRatio=1.94;frac=1.000;fracLow=0.728;fracUp=1.000;identificationQv=4
scaffold00014   kinModCall      modified_base   19967   19967   23      +       .       coverage=23;context=AGTATGAACGCAGGATGGAGAAGGGACGAAAGCCCGCCGCA;IPDRatio=3.11

And this is my running

java -jar ~/biosoftware/MultiMotifMaker/artifacts/MultiMotifMaker.jar find -f /home/sujitmaiti/ver/vlgenome/quickmerge/ca_ne/redundans/scaffolds.reduced.fa -g basemods.gff -o motif.csv -t 12

Any help would be appreciated.
Fei

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.