m5imunovic / curly-octo-train Goto Github PK

View Code? Open in Web Editor NEW

0.0 1.0 0.0 9.47 MB

Python 93.98% Shell 6.02%

curly-octo-train's Issues

Improve logging

Unify logger and hydra logs
Remove all prints
Check that all artifacts are stored in the files locally

Adapt to new LJA interface

LJA switched to different IDs for nodes and edges, adopt new interface

Move cmake to mamba dependencies and update readme

Create sampling based read simulator

Add detailed tests for assembly

Asm is missing detailed tests, both called from command line and from scheduler

Add the keep function

For testing and evaluation, we need to be able to keep intermediate results:

reads files,
mappings between reads and edge ids in the graphs
eval scripts
we also want to keep some metadata around for better reproducibility and prevent duplicates

LJA uses experimental verions of filesystem library which throws when copy operation is performed on the SMB drive.
The reason is probably the attempt to copy the file permissions (based on forum dicussions, did not debug this into detail).
Replace copying of the old logs with move operation.
For second copy operation simply ignore the error (the error is caused by copying the assemblies file from pipeline step 03
into top level directory) because the files is available in subfolder

Remove digraph support

Remove digraph related code

Add assembly as a step into scheduler

Also add missing tests for sequencing step

Add executor that will run pipeline steps

Eval should catch errors from subprocess and continue on fail

Download CHM13 uses hydra config

Use hydra config to simplify downloading the chm13 with custom options

Raw zip should be removed from dataset and used for the actual PT files

Previously, all the input data from with PT graph file was constructed were saved in the "raw/" as ZIP file.
We want to remove it for couple of reasons:

The practice shows that we don't have any use for it
This only duplicates the data from the assembly step (maybe we can keep some metadata to know the origin but the entire package is unnecessary)
It does not fit well with how dataset in PyTorch and PyG work (here, we want to place the PT in the raw directory)

Benchmark LJA hashing algo speed against sha-256

We want to check if using standard library hash is faster than the rolling hash
implementation (as we don't actually need rolling hash)

Add tests to pre commit hooks

Or just when pushing to the upstream

Create script for collecting eval results in the table

Set up pre-commit and add instructions

Create at least two more profiles and move them to wandb artifacts

Streamline the creation of PbSim profiles and add uploading to W&B

Reads simulator should only accept single sample definition

We want to take a responsibility of deciding which genomes shoud be executed from the reads simulator.
The single responsibility of the producing the reads makes the code simpler.
The execution part goes to the future scheduler/pipeline code which will offload jobs to each respective pipeline step

Document mapping of the CHM13 reads

Add documentation describing how the reads are mapped to the reference

Update LJA to latest commit

Contains fixes for ID stability and Jumbo outputs

Start dataset pipeline guide document

Add vendor installation script

It is easier to manage this using the external script and offers more transparecy on what is being installed

Use temp directory for all sample operations and move the result to the storage at the end

This avoids copying from the mounted storage and should hopefully prevent the hickups with shutil operations when using samba storage

Allow to copy profile instead of creating symbolic link

On mounted storages, creating symlinks might not be allowed so it would be good to easily switch between the linking and copying the simulation profile:
https://superuser.com/questions/1337257/clients-cant-create-symlinks-on-samba-share

When copying check for the existence file and compare md5 hash to prevent unnecessary work.

Implement filtering by chromosome in sequencing simulation

Similar to filtering in assembler and graph dataset creation, we should be able to create a sequencing experiment only with chromosome matching filter

Verify that eval works with new LJA interface

Use threads from LJA config instead of configuring from schedule

This way it makes it easier to config execution on the server

Upgrade pre commit package versions

We want to upgrade the pre-commit packages as file is a bit of older date

Add more tracking information to the dataset

Random number
Sampling setting

Introduce structured configs for independent scripts

Verify new LJA interface

Check that graph/dataset is properly created
Check that eval still works

Support hydra multirun threading

Use hardlink instead of symlink for simulator profile

It turns out that usage of hardlinks solves the problem for mounted storage much more simple than the intermediate copy

Create a script for random sampling from FASTQ file using the specified percentage of reads

Serialize steps and add cleanup step

Remove read and assembly folders after the sample is created to save the space.
For now convert sample by sample.

Restore evaluation functionality

Due to the refactoring, the evaluation of LJA is not working anymore

Add detailed tests for graph

Add variable coverage in dataset scenario

Update the pipeline config

Creating dataset in discrete stages in computationally expensive. We need to update global pipeline config in order to be able to do it in one continuous step and do the cleanup after every dataset is created. Store appropriate metadata along the step in order to enable reconstruction of the data, evaluation and cleanup

m5imunovic / curly-octo-train Goto Github PK

curly-octo-train's Issues

Recommend Projects

Recommend Topics

Recommend Org