Git Product home page Git Product logo

dna-align-dataset's Introduction

dna-align-dataset

This repository contains notes on how to generate DNA string alignment dataset from real datasets from NCBI Bioproject on Ubuntu.

Getting started

First we need to download the SRA toolkit of NCBI in order to download dataset from NCBI Bioproject. Here we use version 3.0.0. If there is a newer version, check out the sra-tools repository.

wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.0/sratoolkit.3.0.0-ubuntu64.tar.gz
tar -xvf sratoolkit.3.0.0-ubuntu64.tar.gz
cd sratoolkit.3.0.0-ubuntu64/bin
echo "export PATH=\${PATH}:$(pwd)" >> ~/.bashrc
source ~/.bashrc

You may want to run vdb-config --interactive first before testing the installation with prefetch.

Download SRA datasets

After entering a project (we use PRJNA178613 as an example), see the table Project Data, click on the Number of Links number and there will be a list of links to runs. Click on one of the links and you will see an accession ID starting with SRR. Copy that ID (e.g. SRR611076) and run

prefetch SRR611076

It takes around 1.5 hours to download this dataset. Next we can see that an .sra file is downloaded in ./SRR611076. We can then convert the file into fastq file with

cd SRR611076/
fastq-dump --split-files SRR611076.sra

We use --split-files because this dataset has PAIRED layout. After waiting some time we can see that two fastq files are generated.

Create draft alignment

We use BWA as the sequence mapper. First we can download a reference genome of the species (sequence.fasta here) to the following.

bwa index -p test sequence.fasta
bwa mem -M -t 1 test SRR611076_1.fastq SRR611076_2.fastq > SRR611076.sam

This creates a .sam file, which records the possible position of mapping. We can then use this to generate a string alignment dataset with the script generate_dataset.sh. To use this we can do the following

chmod +x generate_dataset.sh
generate_dataset.sh [sam_file] [fasta_file] > [output_directory]

dna-align-dataset's People

Contributors

gzhoffie avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.