Git Product home page Git Product logo

autoturbo_dna's Introduction

Autoturbo-DNA: Turbo-Autoencoders for the DNA data storage channel

Autoturbo-DNA is a comprehensive autoencoder framework designed for the specific challenges of DNA data storage. It leverages the principles of Turbo Autoencoders (Jiang et. al., 2019) while integrating critical components for DNA data storage applications.

Key Features:

  • End-to-End Integration: Combines Turbo Autoencoder principles with DNA data storage channel simulation.
  • Modular Architecture: Supports a wide range of Neural Network architectures which can be easily integrated.
  • Configurable Components: Components can be customized using a configuration file.
  • Flexible Parameter Adjustments: User-centric design allows for easy adjustments of DNA data storage channel settings and constraint adherence parameters.

Installation:

Install Python 3.7.x if not already done (if the GPU should be used, a CUDA-capable system is required and the corresponding dependencies). Clone or download this repository, and install the dependencies:

git clone https://github.com/MW55/autoturbo_dna.git

# install packages (tested under python 3.7 and 3.11)
pip install torch numpy scipy regex pytorch-ignite

Usage

Train:

./sDNA.py --wdir models/simple_train/ --train 

During training, a config will be generated in the model folder containing all the additional parameters used.

Encode:

./sDNA.py --wdir models/simple_train/ -i test_data/MOSLA.txt -o test_data/MOSLA_encoded.fasta -e 

Decode:

./sDNA.py --wdir models/simple_train/ -i test_data/MOSLA_encoded.fasta -o test_data/MOSLA_decoded.txt -d

Configuration:

There are different arguments that influence the model and with which the specific DNA data storage can be defined. The training of a model can be canceled at any time and continued at a later point. A call with the default values to train a model could look like this and be extended by certain arguments:

Option Strings Type Default Help
-h, --help None ==SUPPRESS== show this help message and exit
-v, --version None ==SUPPRESS== show program's version number and exit
--wdir str None Path to the working directory, if not existing the model will be saved here, if already existing the model will be loaded.
--train bool False Create and train the desired model.
--bitenc None None Encode with a model a bit string into a code.
--bitdec None None Decode with a model a bit string into a code.
--encode, -e None None Encode with a model a file.
--decode, -d None None Decode with a model a code back into a file.
--input, -i str None Path to the file to be en-/decoded.
--output, -o str None Path to the output file.
--index_size, -is int 16 size (in bits) of the added index, larger files need bigger index sizes, has to be a multiple of 8.
--simulate None None Simulate errors on a generated code.
--ids None False Shows a list of the default ids of the different options for DNA synthesis, storage and sequencing simulation.
--seed int 0 Specify a integer number, this allows to reproduce the results.
--gpu None False Whether the calculations of the models should run on the GPU (using CUDA).
--parallel None False Whether to run the calculations on multiple GPUs, if there are more than one.
--threads int 8 If using the CPU, how many threads should be used.
--rate str onethird Rate of the code, supported are 1/3 (argument=onethird) and 1/2 (argument=onehalf)
--block-length int 64 Length of the bitstreams to be used
--block-padding int 18 Length of the padding by which the bitstream is extended
--encoder str cnn Choose which encoder to use: RNN, SRNN, CNN, SCNN or RNNatt
--enc-units int 64 The number of expected features in the hidden layer for the encoder
--enc-actf str elu Choose which activation function should be applied to the encoder: tanh, elu, relu, selu, sigmoid or identity
--enc-dropout float 0.0 Dropout probability for the encoder
--enc-layers int 5 Number of recurrent layers per RNN/CNN structure in the encoder
--enc-kernel int 5 Size of the kernels for the CNN in the encoder
--enc-rnn str GRU Choose which structure to use for the RNN in the encoder: GRU or LSTM
--vae-beta float 0.0 The beta multiplier of the Kullback–Leibler divergence if using a VAE.
--decoder str cnn Choose which decoder to use: RNN or CNN
--dec-units int 64 The number of expected features in the hidden layer for the decoder
--dec-actf str identity Choose which activation function should be applied to the decoder: tanh, elu, relu, selu, sigmoid or identity
--dec-dropout float 0.0 Dropout probability for the decoder
--dec-layers int 5 Number of recurrent layers per RNN/CNN structure in the decoder
--dec-inputs int 5 The number of expected input features for the decoder
--dec-iterations int 6 Number of iterative loops to be made in the decoder
--dec-kernel int 5 Size of the kernels for the CNN in the decoder
--dec-rnn str GRU Choose which structure to use for the RNN in the decoder: GRU or LSTM
--not-extrinsic None True Whether extrinsic information should be applied to the decoder each iteration
--coder str cnn Choose which coder to use: MLP, CNN or RNN
--coder-units int 64 The number of expected features in the hidden layer for the coder
--coder-actf str elu Choose which activation function should be applied to the coder: tanh, elu, relu, selu, sigmoid or identity
--coder-dropout float 0.0 Dropout probability for the coder
--coder-layers int 5 Number of recurrent layers per RNN/CNN structure in the coder
--coder-kernel int 5 Size of the kernels for the CNN in the coder
--coder-rnn str GRU Choose which structure to use for the RNN in the coder: GRU or LSTM
--init-weights str None Choose which method to use to initialize the linear layers of the model: normal, uniform, constant, xavier_normal, xavier_uniform, kaiming_normal or kaiming_uniform
--lat-redundancy int 0 Redundancy of the final encoder layer (and first decoder layer), required to account for constraints. Has to be divisible by 2
--ens-models int 3 If ensemble coders are used, defines the number of coder instances in the ensemble.
--padding-style str constant If padding should be constant values or a circular copy of the input.
--blocks int 1024 Number of the bitstreams to be used
--batch-size int 256 Size of the batch to be used during training
--epochs int 100 Number of epochs the whole model should be trained
--enc-lr float 0.00001 Value of the learning rate to be used for the encoder
--enc-optimizer str adam Choose which optimizer to use for the encoder: Adam, SGD or Adagrad
--enc-steps int 1 Number of training steps to be performed per epoch for the encoder
--dec-lr float 0.00001 Value of the learning rate to be used for the decoder
--dec-optimizer str adam Choose which optimizer to use for the decoder: Adam, SGD or Adagrad
--dec-steps int 2 Number of training steps to be performed per epoch for the decoder
--coder-lr float 0.001 Value of the learning rate to be used for the coder
--coder-optimizer str adam Choose which optimizer to use for the coder: Adam, SGD or Adagrad
--coder-steps int 5 Number of training steps to be performed per epoch for the coder
--simultaneously None False Whether the encoder and decoder are to be trained at the same time, if so, the learning parameters from the encoder are used
--batch-norm bool False Whether to use batch normalization or not.
--separate-coder-training None False If the coder should be split into 3 seperate instances during training.
--all-errors None False train each part of the model always with all error types.
--channel str dna which channel model should be used for training
--continuous-coder None False toggles that the intermediate decoder (coder) passes continuous values to the decoder.
--constraint-training None False If the code should also be trained to adhere to constraints.
--loss-beta float 1.0 beta parameter for the smooth L1 loss.
--coder-train-target str encoded_data how the coder should be trained, for best reconstruction accuracy or to be as close to the encoder output as possible
--simultaneously-warmup int 0 if using simultaneously training, how many warmup epochs should be trained seperatly, before moving to simultaneously training.
--synthesis None (1, None) Specify the id of the synthesis method
--pcr-cycles int 30 Number of cycles to be used for the PCR
--pcr None (14, None) Specify the id of the PCR type
--storage-months int 24 Months of storage to be simulated
--storage None (1, None) Specify the id of the storage host
--sequencing None (2, None) Specify the id of the sequencing method
--amplifier float 5.0 Value by how much more distinct the errors should be
--probabilities str probabilities.json Path to json file for error probabilities
--useq str undesired_sequences.json Path to json file for undesired sequences
--gc-window int 50 Size of the window to be used for the GC-Content error probability detection
--kmer-window int 10 Size of the window to be used for the Kmer error probability detection

Docker:

Autoturbo-DNA can also be run inside a Docker container. A Dockerfile for running Autoturbo-DNA on the CPU is provided in the repository. If you want to run Autoturbo-DNA inside a container while utilizing the host CUDA GPU, you have to install the nvidia-container-toolkit, then configure the docker daemon to recognize the toolkit, run the container using the --gpu flag and use a CUDA base image. An extensive tutorial of how to allow Docker to utilize the host GPU can be found at https://saturncloud.io/blog/how-to-install-pytorch-on-the-gpu-with-docker/.

Build container (while being in the root of Autoturbo-DNA):

docker build -t autoturbo_dna . -f Dockerfile 

Train a model:

docker run -v </full/path/to/the/autoturbo-dna/project/>models:/autoturbo_dna/models:z -it autoturbo_dna ./sDNA.py --wdir models/simple_train/ --train 

The training parameters can be adjusted by utilizing the parameters described above.

To encode data using a trained model:

docker run -v </full/path/to/the/autoturbo-dna/project/>models:/autoturbo_dna/models:z -v </full/path/to/the/autoturbo-dna/project/>test_data:/autoturbo_dna/test_data:z -it autoturbo_dna ./sDNA.py --wdir models/simple_train/ -i test_data/MOSLA.txt -o test_data/MOSLA_encoded.fasta -e

To decode the data:

docker run -v </full/path/to/the/autoturbo-dna/project/>models:/autoturbo_dna/models:z -v </full/path/to/the/autoturbo-dna/project/>test_data:/autoturbo_dna/test_data:z -it autoturbo_dna ./sDNA.py --wdir models/simple_train/ -i test_data/MOSLA_encoded.fasta -o test_data/MOSLA_decoded.txt -d

[1]: Y. Jiang, H. Kim, H. Asnani, S. Kannan, S. Oh, and P. Viswanath, “Turbo autoencoder: Deep learning based channel codes for point-to-point communication channels,” in Advances in Neural Information Processing Systems, pp. 2754–2764, 2019.

autoturbo_dna's People

Contributors

mw55 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.