Git Product home page Git Product logo

lrscaf's Introduction

LRScaf: improving draft genomes using long noisy reads

Hybrid assembly strategy is a reasonable and promising approach to utilize strengths and settle weaknesses in Next-Generation Sequencing (NGS) and Third-Generation Sequencing (TGS) technologies. According to this principle, we here present a new toolkit named LRScaf (Long Reads Scaffolder) by applied TGS data to improve draft genome assembly. The main features are: short running time, accuracy, and being contiguity. To scaffold rice genome, it could be done in 20 mins with minimap mapper. In human, LRScaf could improve the draft assembly NG50 from 127.5 Kb to 10.4 Mb on 20x PacBio CHM1 dataset and NG50 from 115.7 Kb to 17.4 Mb on ~35x Nanopore NA12878 dataset.


################################################################################
Requirements
################################################################################
Java version: 1.8+.

################################################################################
Building LRScaf project
################################################################################
There are two ways to build and run this project:
  • There is a jar package named LRScaf-<version>.jar under target folder in releases. User could run it with command: "java -jar LRScaf-<version>.jar -x <configure.xml>". The details of configuration XML file are described below.
  • If you want to compile the source code by yourself, you could download the source code and then compile and build this project by maven <https://maven.apache.org/> in following steps:
  • # 1. download the latest releases version and unzip the package
    unzip lrscaf-<version>.zip
    # 2. change the working folder
    cd lrscaf-<version>
    # 3. complie source code and package the project, and a jar package named LRScaf-<version>.jar would be under the target folder.
    mvn package

    ################################################################################
    Quick starting
    ################################################################################
    # XML configuration style
    java -jar LRScaf-<version>.jar -x <configure.xml>
    # or command-line in short style
    java -jar LRScaf-<version>.jar -c <draft_assembly.fasta> -a <alignment.m4> -t <m4> -o <output_foloder> [options]
    # or command-line in long style
    java -jar LRScaf-<version>.jar --contig <draft_assembly.fasta> --alignedFile <alignment.m4> -t <m4> --output <output_foloder> [options]

    ################################################################################
    Parameters of LRScaf
    ################################################################################

    LRScaf supports parameters set by XML confiuration file or command-line. It recommends to use XML configuration file. There is a template configuration file of XML format, named "scafconf.xml", in the project. In command-line, LRScaf supports long (dash-dash) and short (dash) style of GNU like options. And the following table would show each parameter meaning and default value if available.

    The first and second columns are the command-line paremeters in long and its coressponding short style.

    The third column is the code in XML configuration file. NA is not available in XML configuration file.

    The fourth column is the details and default value of this option if available.

    ParameterAbbreviationXML CodeDetails
    xmlxNAThe XML configuration file. All command-line parameters would be omitted if this is set.
    contigccontigThe contigs file of draft assembly in fasta format.
    m5m5m5The alignment file in -m 5 format of BLASR.
    m4m4m4The alignment file in -m 4 format of BLASR.
    samsamsamThe alignment file in sam format of BLASR.
    mmmmmmThe alignment file in PAF format of Minimap.
    outputooutputThe output folder.
    miniCntLenmiclmin_contig_lengthThe minimum contigs length to be included for scaffolding. Default: <500> bp.
    identityiidentityThe identity threshold for filtering invalid alignment. Default: <0.8>.
    This value must be modify according to the mapper.
    For the BLASR alignment file, the higher value means the higher identity.
    For the Minimap alignment file, the value should not be larger than 0.3 and the value could be set to 0.1.
    miniOLLenmiollmin_overlap_lengthThe minimum overlap length of contig. Default: <400> bp.
    miniOLRatiomiolrmin_overlap_ratioThe minimum overlap length ratio of contig. Default: <0.8>.
    If the overlap length is large than the miniOLLen,
    it will compute the ratio of overlap length which is overlap_length/contig_length.
    maOHLenmaohlmax_overhang_lengthThe maximum overhang length of contig. Default: <500> bp.
    maOHRatiomaohrmax_overhang_ratioThe maximum overhang ratio of contig. Default: <0.1>.
    If the overhang length is less than the maohl,
    it will compute the ratio of overhang length which is overhang_lenght/contig_length.
    maELenmaelmax_end_lengthThe maximum ending length of long read. Default: <500> bp.
    maERatiomaermax_end_ratioThe maximum ending ratio of long read. Default: <0.1>.
    It will compute the ending length (ending_len) by long_read_length * maer,
    then def_ending_len = (mael >= ending_len ? ending_len : mael).
    miSLNmislmin_supported_linksThe minimum support links. Default: <2>.
    If the depth of long reads less than 10x, the misl could be set to 1.
    ratiorratioThe ratio for deleting error prone edges in divergence nodes. Default: <0.2>.
    mrmrrepeat_maskThe indicator for masking repeats. Default: <true>. It recommends to be true.
    tiplengthtltip_lengthThe maximum tip length. Default: <1500> bp.
    iqrtimeiqrtiqr_timeThe IQR times for setting contigs as repeats by their coverages. Default: <3>.
    mmcmmmcmmmcmThe parameter to filter invalid Minimap alignments. Default: <8>. Only for Minimap alignment.
    helphNAPrint this help information.

    ################################################################################
    XML Configuration File Content
    ################################################################################
    <?xml version="1.0" encoding="UTF-8"?>
    <scaffold>
      <!--The input file for scaffolding, including contigs and aligned files (i.e. m5, m4 or mm file) -->
      <input>
        <contig>Draft assembly in fasta format.</contig>
        <m4>The aligned file in BLASR -m 4 format.</m4>
      </input>
      <!-- The output folder for scaffolding -->
      <output>The output folder.</output>
      <!-- The parameters for scaffolding-->
      <paras>
        <!--More details are showed in README.md-->
        <min_contig_length>500</min_contig_length>
        <identity>0.8</identity>
        <min_overlap_length>400</min_overlap_length>
        <min_overlap_ratio>0.8</min_overlap_ratio>
        <max_overhang_length>500</max_overhang_length>
        <max_overhang_ratio>0.1</max_overhang_ratio>
        <max_end_length>500</max_end_length>
        <max_end_ratio>0.1</max_end_ratio>
        <min_supported_links>2</min_supported_links>
        <tips_length>1500</tips_length>
        <ratio>0.2</ratio>
        <repeat_mask>true</repeat_mask>
        <iqr_time>3</iqr_time>
        <mmcm>8</mmcm> <!--only for Minimap Alignment.-->
      </paras>
    </scaffold>

    ################################################################################
    License
    ################################################################################

    LRScaf is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
    This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
    You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.


    If you have any questions, please feel free to contact me <[email protected]>.

    lrscaf's People

    Contributors

    shingocat avatar

    Watchers

     avatar  avatar

    Recommend Projects

    • React photo React

      A declarative, efficient, and flexible JavaScript library for building user interfaces.

    • Vue.js photo Vue.js

      🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

    • Typescript photo Typescript

      TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

    • TensorFlow photo TensorFlow

      An Open Source Machine Learning Framework for Everyone

    • Django photo Django

      The Web framework for perfectionists with deadlines.

    • D3 photo D3

      Bring data to life with SVG, Canvas and HTML. 📊📈🎉

    Recommend Topics

    • javascript

      JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

    • web

      Some thing interesting about web. New door for the world.

    • server

      A server is a program made to process requests and deliver data to clients.

    • Machine learning

      Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

    • Game

      Some thing interesting about game, make everyone happy.

    Recommend Org

    • Facebook photo Facebook

      We are working to build community through open source technology. NB: members must have two-factor auth.

    • Microsoft photo Microsoft

      Open source projects and samples from Microsoft.

    • Google photo Google

      Google ❤️ Open Source for everyone.

    • D3 photo D3

      Data-Driven Documents codes.