m5imunovic / curly-octo-train Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 9.47 MB

Python 96.09% Shell 3.91%

curly-octo-train's People

Contributors

Watchers

curly-octo-train's Issues

Start dataset pipeline guide document

Introduce structured configs for independent scripts

Set up pre-commit and add instructions

Support hydra multirun threading

Allow to copy profile instead of creating symbolic link

On mounted storages, creating symlinks might not be allowed so it would be good to easily switch between the linking and copying the simulation profile:
https://superuser.com/questions/1337257/clients-cant-create-symlinks-on-samba-share

When copying check for the existence file and compare md5 hash to prevent unnecessary work.

Move cmake to mamba dependencies and update readme

Create a script for random sampling from FASTQ file using the specified percentage of reads

Upgrade pre commit package versions

We want to upgrade the pre-commit packages as file is a bit of older date

Implement filtering by chromosome in sequencing simulation

Similar to filtering in assembler and graph dataset creation, we should be able to create a sequencing experiment only with chromosome matching filter

Use hardlink instead of symlink for simulator profile

It turns out that usage of hardlinks solves the problem for mounted storage much more simple than the intermediate copy

Creating dataset in discrete stages in computationally expensive. We need to update global pipeline config in order to be able to do it in one continuous step and do the cleanup after every dataset is created. Store appropriate metadata along the step in order to enable reconstruction of the data, evaluation and cleanup

Raw zip should be removed from dataset and used for the actual PT files

Previously, all the input data from with PT graph file was constructed were saved in the "raw/" as ZIP file.
We want to remove it for couple of reasons:

The practice shows that we don't have any use for it
This only duplicates the data from the assembly step (maybe we can keep some metadata to know the origin but the entire package is unnecessary)
It does not fit well with how dataset in PyTorch and PyG work (here, we want to place the PT in the raw directory)

Download CHM13 uses hydra config

Use hydra config to simplify downloading the chm13 with custom options

m5imunovic / curly-octo-train Goto Github PK

curly-octo-train's People

Contributors

Watchers

curly-octo-train's Issues

Recommend Projects

Recommend Topics

Recommend Org