Autoturbo-DNA: Turbo-Autoencoders for the DNA data storage channel

Autoturbo-DNA is a comprehensive autoencoder framework designed for the specific challenges of DNA data storage. It leverages the principles of Turbo Autoencoders (Jiang et. al., 2019) while integrating critical components for DNA data storage applications.

Key Features:

End-to-End Integration: Combines Turbo Autoencoder principles with DNA data storage channel simulation.
Modular Architecture: Supports a wide range of Neural Network architectures which can be easily integrated.
Configurable Components: Components can be customized using a configuration file.
Flexible Parameter Adjustments: User-centric design allows for easy adjustments of DNA data storage channel settings and constraint adherence parameters.

Installation:

Install Python 3.7.x if not already done (if the GPU should be used, a CUDA-capable system is required and the corresponding dependencies). Clone or download this repository, and install the dependencies:

git clone https://github.com/MW55/autoturbo_dna.git

# install packages (tested under python 3.7 and 3.11)
pip install torch numpy scipy regex pytorch-ignite

Usage

Train:

./sDNA.py --wdir models/simple_train/ --train

During training, a config will be generated in the model folder containing all the additional parameters used.

Encode:

./sDNA.py --wdir models/simple_train/ -i test_data/MOSLA.txt -o test_data/MOSLA_encoded.fasta -e

Decode:

./sDNA.py --wdir models/simple_train/ -i test_data/MOSLA_encoded.fasta -o test_data/MOSLA_decoded.txt -d

Configuration:

There are different arguments that influence the model and with which the specific DNA data storage can be defined. The training of a model can be canceled at any time and continued at a later point. A call with the default values to train a model could look like this and be extended by certain arguments:

Option Strings	Type	Default	Help
-h, --help	None	==SUPPRESS==	show this help message and exit
-v, --version	None	==SUPPRESS==	show program's version number and exit
--wdir	str	None	Path to the working directory, if not existing the model will be saved here, if already existing the model will be loaded.
--train	bool	False	Create and train the desired model.
--bitenc	None	None	Encode with a model a bit string into a code.
--bitdec	None	None	Decode with a model a bit string into a code.
--encode, -e	None	None	Encode with a model a file.
--decode, -d	None	None	Decode with a model a code back into a file.
--input, -i	str	None	Path to the file to be en-/decoded.
--output, -o	str	None	Path to the output file.
--index_size, -is	int	16	size (in bits) of the added index, larger files need bigger index sizes, has to be a multiple of 8.
--simulate	None	None	Simulate errors on a generated code.
--ids	None	False	Shows a list of the default ids of the different options for DNA synthesis, storage and sequencing simulation.
--seed	int	0	Specify a integer number, this allows to reproduce the results.
--gpu	None	False	Whether the calculations of the models should run on the GPU (using CUDA).
--parallel	None	False	Whether to run the calculations on multiple GPUs, if there are more than one.
--threads	int	8	If using the CPU, how many threads should be used.
--rate	str	onethird	Rate of the code, supported are 1/3 (argument=onethird) and 1/2 (argument=onehalf)
--block-length	int	64	Length of the bitstreams to be used
--block-padding	int	18	Length of the padding by which the bitstream is extended
--encoder	str	cnn	Choose which encoder to use: RNN, SRNN, CNN, SCNN or RNNatt
--enc-units	int	64	The number of expected features in the hidden layer for the encoder
--enc-actf	str	elu	Choose which activation function should be applied to the encoder: tanh, elu, relu, selu, sigmoid or identity
--enc-dropout	float	0.0	Dropout probability for the encoder
--enc-layers	int	5	Number of recurrent layers per RNN/CNN structure in the encoder
--enc-kernel	int	5	Size of the kernels for the CNN in the encoder
--enc-rnn	str	GRU	Choose which structure to use for the RNN in the encoder: GRU or LSTM
--vae-beta	float	0.0	The beta multiplier of the Kullback–Leibler divergence if using a VAE.
--decoder	str	cnn	Choose which decoder to use: RNN or CNN
--dec-units	int	64	The number of expected features in the hidden layer for the decoder
--dec-actf	str	identity	Choose which activation function should be applied to the decoder: tanh, elu, relu, selu, sigmoid or identity
--dec-dropout	float	0.0	Dropout probability for the decoder
--dec-layers	int	5	Number of recurrent layers per RNN/CNN structure in the decoder
--dec-inputs	int	5	The number of expected input features for the decoder
--dec-iterations	int	6	Number of iterative loops to be made in the decoder
--dec-kernel	int	5	Size of the kernels for the CNN in the decoder
--dec-rnn	str	GRU	Choose which structure to use for the RNN in the decoder: GRU or LSTM
--not-extrinsic	None	True	Whether extrinsic information should be applied to the decoder each iteration
--coder	str	cnn	Choose which coder to use: MLP, CNN or RNN
--coder-units	int	64	The number of expected features in the hidden layer for the coder
--coder-actf	str	elu	Choose which activation function should be applied to the coder: tanh, elu, relu, selu, sigmoid or identity
--coder-dropout	float	0.0	Dropout probability for the coder
--coder-layers	int	5	Number of recurrent layers per RNN/CNN structure in the coder
--coder-kernel	int	5	Size of the kernels for the CNN in the coder
--coder-rnn	str	GRU	Choose which structure to use for the RNN in the coder: GRU or LSTM
--init-weights	str	None	Choose which method to use to initialize the linear layers of the model: normal, uniform, constant, xavier_normal, xavier_uniform, kaiming_normal or kaiming_uniform
--lat-redundancy	int	0	Redundancy of the final encoder layer (and first decoder layer), required to account for constraints. Has to be divisible by 2
--ens-models	int	3	If ensemble coders are used, defines the number of coder instances in the ensemble.
--padding-style	str	constant	If padding should be constant values or a circular copy of the input.
--blocks	int	1024	Number of the bitstreams to be used
--batch-size	int	256	Size of the batch to be used during training
--epochs	int	100	Number of epochs the whole model should be trained
--enc-lr	float	0.00001	Value of the learning rate to be used for the encoder
--enc-optimizer	str	adam	Choose which optimizer to use for the encoder: Adam, SGD or Adagrad
--enc-steps	int	1	Number of training steps to be performed per epoch for the encoder
--dec-lr	float	0.00001	Value of the learning rate to be used for the decoder
--dec-optimizer	str	adam	Choose which optimizer to use for the decoder: Adam, SGD or Adagrad
--dec-steps	int	2	Number of training steps to be performed per epoch for the decoder
--coder-lr	float	0.001	Value of the learning rate to be used for the coder
--coder-optimizer	str	adam	Choose which optimizer to use for the coder: Adam, SGD or Adagrad
--coder-steps	int	5	Number of training steps to be performed per epoch for the coder
--simultaneously	None	False	Whether the encoder and decoder are to be trained at the same time, if so, the learning parameters from the encoder are used
--batch-norm	bool	False	Whether to use batch normalization or not.
--separate-coder-training	None	False	If the coder should be split into 3 seperate instances during training.
--all-errors	None	False	train each part of the model always with all error types.
--channel	str	dna	which channel model should be used for training
--continuous-coder	None	False	toggles that the intermediate decoder (coder) passes continuous values to the decoder.
--constraint-training	None	False	If the code should also be trained to adhere to constraints.
--loss-beta	float	1.0	beta parameter for the smooth L1 loss.
--coder-train-target	str	encoded_data	how the coder should be trained, for best reconstruction accuracy or to be as close to the encoder output as possible
--simultaneously-warmup	int	0	if using simultaneously training, how many warmup epochs should be trained seperatly, before moving to simultaneously training.
--synthesis	None	(1, None)	Specify the id of the synthesis method
--pcr-cycles	int	30	Number of cycles to be used for the PCR
--pcr	None	(14, None)	Specify the id of the PCR type
--storage-months	int	24	Months of storage to be simulated
--storage	None	(1, None)	Specify the id of the storage host
--sequencing	None	(2, None)	Specify the id of the sequencing method
--amplifier	float	5.0	Value by how much more distinct the errors should be
--probabilities	str	probabilities.json	Path to json file for error probabilities
--useq	str	undesired_sequences.json	Path to json file for undesired sequences
--gc-window	int	50	Size of the window to be used for the GC-Content error probability detection
--kmer-window	int	10	Size of the window to be used for the Kmer error probability detection

Docker:

Autoturbo-DNA can also be run inside a Docker container. A Dockerfile for running Autoturbo-DNA on the CPU is provided in the repository. If you want to run Autoturbo-DNA inside a container while utilizing the host CUDA GPU, you have to install the nvidia-container-toolkit, then configure the docker daemon to recognize the toolkit, run the container using the --gpu flag and use a CUDA base image. An extensive tutorial of how to allow Docker to utilize the host GPU can be found at https://saturncloud.io/blog/how-to-install-pytorch-on-the-gpu-with-docker/.

Build container (while being in the root of Autoturbo-DNA):

docker build -t autoturbo_dna . -f Dockerfile

Train a model:

docker run -v </full/path/to/the/autoturbo-dna/project/>models:/autoturbo_dna/models:z -it autoturbo_dna ./sDNA.py --wdir models/simple_train/ --train

The training parameters can be adjusted by utilizing the parameters described above.

To encode data using a trained model:

docker run -v </full/path/to/the/autoturbo-dna/project/>models:/autoturbo_dna/models:z -v </full/path/to/the/autoturbo-dna/project/>test_data:/autoturbo_dna/test_data:z -it autoturbo_dna ./sDNA.py --wdir models/simple_train/ -i test_data/MOSLA.txt -o test_data/MOSLA_encoded.fasta -e

To decode the data:

docker run -v </full/path/to/the/autoturbo-dna/project/>models:/autoturbo_dna/models:z -v </full/path/to/the/autoturbo-dna/project/>test_data:/autoturbo_dna/test_data:z -it autoturbo_dna ./sDNA.py --wdir models/simple_train/ -i test_data/MOSLA_encoded.fasta -o test_data/MOSLA_decoded.txt -d

[1]: Y. Jiang, H. Kim, H. Asnani, S. Kannan, S. Oh, and P. Viswanath, “Turbo autoencoder: Deep learning based channel codes for point-to-point communication channels,” in Advances in Neural Information Processing Systems, pp. 2754–2764, 2019.

mw55 / autoturbo_dna Goto Github PK

autoturbo_dna's Introduction

Autoturbo-DNA: Turbo-Autoencoders for the DNA data storage channel

Key Features:

Installation:

Usage

Configuration:

Docker:

autoturbo_dna's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent