curly-octo-train's Issues
Improve logging
Unify logger and hydra logs
Remove all prints
Check that all artifacts are stored in the files locally
Adapt to new LJA interface
LJA switched to different IDs for nodes and edges, adopt new interface
Move cmake to mamba dependencies and update readme
Create sampling based read simulator
Add detailed tests for assembly
Asm is missing detailed tests, both called from command line and from scheduler
Add the keep function
For testing and evaluation, we need to be able to keep intermediate results:
reads files,
mappings between reads and edge ids in the graphs
eval scripts
we also want to keep some metadata around for better reproducibility and prevent duplicates
Fix LJA filesystem errors
LJA uses experimental verions of filesystem library which throws when copy operation is performed on the SMB drive.
The reason is probably the attempt to copy the file permissions (based on forum dicussions, did not debug this into detail).
Replace copying of the old logs with move operation.
For second copy operation simply ignore the error (the error is caused by copying the assemblies file from pipeline step 03
into top level directory) because the files is available in subfolder
Remove digraph support
Remove digraph related code
Add assembly as a step into scheduler
Also add missing tests for sequencing step
Add executor that will run pipeline steps
Eval should catch errors from subprocess and continue on fail
Download CHM13 uses hydra config
Use hydra config to simplify downloading the chm13 with custom options
Raw zip should be removed from dataset and used for the actual PT files
Previously, all the input data from with PT graph file was constructed were saved in the "raw/" as ZIP file.
We want to remove it for couple of reasons:
- The practice shows that we don't have any use for it
- This only duplicates the data from the assembly step (maybe we can keep some metadata to know the origin but the entire package is unnecessary)
- It does not fit well with how dataset in PyTorch and PyG work (here, we want to place the PT in the raw directory)
Benchmark LJA hashing algo speed against sha-256
We want to check if using standard library hash is faster than the rolling hash
implementation (as we don't actually need rolling hash)
Add tests to pre commit hooks
Or just when pushing to the upstream
Create script for collecting eval results in the table
Set up pre-commit and add instructions
Create at least two more profiles and move them to wandb artifacts
Streamline the creation of PbSim profiles and add uploading to W&B
Reads simulator should only accept single sample definition
We want to take a responsibility of deciding which genomes shoud be executed from the reads simulator.
The single responsibility of the producing the reads makes the code simpler.
The execution part goes to the future scheduler/pipeline code which will offload jobs to each respective pipeline step
Document mapping of the CHM13 reads
Add documentation describing how the reads are mapped to the reference
Update LJA to latest commit
Contains fixes for ID stability and Jumbo outputs
Start dataset pipeline guide document
Add vendor installation script
It is easier to manage this using the external script and offers more transparecy on what is being installed
Use temp directory for all sample operations and move the result to the storage at the end
This avoids copying from the mounted storage and should hopefully prevent the hickups with shutil operations when using samba storage
Allow to copy profile instead of creating symbolic link
On mounted storages, creating symlinks might not be allowed so it would be good to easily switch between the linking and copying the simulation profile:
https://superuser.com/questions/1337257/clients-cant-create-symlinks-on-samba-share
When copying check for the existence file and compare md5 hash to prevent unnecessary work.
Implement filtering by chromosome in sequencing simulation
Similar to filtering in assembler and graph dataset creation, we should be able to create a sequencing experiment only with chromosome matching filter
Verify that eval works with new LJA interface
Use threads from LJA config instead of configuring from schedule
This way it makes it easier to config execution on the server
Upgrade pre commit package versions
We want to upgrade the pre-commit packages as file is a bit of older date
Add more tracking information to the dataset
Random number
Sampling setting
Introduce structured configs for independent scripts
Verify new LJA interface
Check that graph/dataset is properly created
Check that eval still works
Support hydra multirun threading
Use hardlink instead of symlink for simulator profile
It turns out that usage of hardlinks solves the problem for mounted storage much more simple than the intermediate copy
Create a script for random sampling from FASTQ file using the specified percentage of reads
Serialize steps and add cleanup step
Remove read and assembly folders after the sample is created to save the space.
For now convert sample by sample.
Restore evaluation functionality
Due to the refactoring, the evaluation of LJA is not working anymore
Add detailed tests for graph
Add variable coverage in dataset scenario
Update the pipeline config
Creating dataset in discrete stages in computationally expensive. We need to update global pipeline config in order to be able to do it in one continuous step and do the cleanup after every dataset is created. Store appropriate metadata along the step in order to enable reconstruction of the data, evaluation and cleanup
Add download URL to species_info for downloaded datasets
Unify logging over entire project
Add pytest-xdist to speed up testing
Add graph as a step in the scheduler
Add H002 dataset
Add CHM13 as an option for reference.py
Instead of having standalone script config option should determine if we are using simulated or real dataset
Add profile as W&B artifact
Ability to use profile from W&B registry
Adopt newest LJA interface
Finish adopting latest interface from LJA
Avoid duplication by comparing the already created data with the scenario
Separate sample creation and collection into concurrent operations
This way we are preparing the stage for concurrent uploads after each sample is created
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.