samstudio8 / chitin Goto Github PK

View Code? Open in Web Editor NEW

61.0 61.0 2.0 213 KB

chitin: an awful shell for awful bioinformaticians

Home Page: https://samnicholls.net/2016/11/16/disorganised-disaster/

License: MIT License

Python 100.00%

chitin's People

Contributors

Stargazers

Watchers

Forkers

unix0000 gsc0107

chitin's Issues

Show the exit code and take action for failure

Were output files borked etc

Implement a database because omg flat file

Could we read a user's .bashrc?

We could make aliases work by forcing a source of the bashrc before feeding our commands to the shell. Probably.

Dashboard

Ongoing experiments
Latest errors and file handler warnings

Is multiprocessing a good idea anyway?

I'm not even sure if we want a multiprocessed shell. Sure, it can run jobs in parallel and is trivial to script, and also needs no configuration. But does anyone want this? Is it because I just don't want to hammer at GNU Parallel? I don't know.

Report number of running jobs in toolbar

Generate uuid cmd_str on server

Apply command to...

I've been generating big runs of data in directories with UUIDs, and a lookup file that contains the UUID to parameters-what-were-used-to-generate-those-files. This has been pretty handy because all the files are uniquely identifiable and don't feature parameters that become unhelpful, or deprecated, etc. later. This also means I'm not messing about with stupid folder hierarchies: the filesystem is a crap abstraction for the representation of experiment properties.

Because of this hierarchy, I've found myself just applying operations to lists of UUID-named directories, so why not make this a part of chitin? We could just provide a script (or series of commands) and a bunch of UUIDs.

Be less dumb about integrity checks

Hashing all files in all adjacent directories is not working out
It's not me, it's definitely you. Stop it right now.

Should we be using `abspath`?

Currently chitin switches ALL paths to an absolute path, but if you wanted to wrap up and throw your workspace on another system, all your paths are suddenly incorrect...

Would be nice to manually "tag" other pieces of metadata to a file

For example, a run of FASTQC produces a small set of reports - we could attach these to a given FQ file.
Bonus points if we ever switch to Flask and automatically host such metafiles to be served later.

The hash function is probably terrible and it is also very slow on BIG files

As part of 2adff2d, I cobbled together a function to hash large files that linearly subsamples bytes to reduce time. I feel this going to turn out to be a terrible idea, so I'm going to open an issue in advance. Sorry.

Config option to relax MD5 for large files if desired

How to address the problem of multiple users?

File metadata is stored per user, in their local database. Changes to files are monitored via use of chitin. So we have two points:

How to share file metadata between a group of users?
What happens when non-chitin users cause changes in the file system?

The first point falls in-line with the future development of permitting the database to be on a server instead of just local. I suspect we may have some work to do to ensure that history is processed and stored in the correct order if multiple users do things at once, but I think this will be fine. Caching may also be necessary so users have some history data for when they are offline? But we are a while from this right now anyway.

The second is unlikely to be addressable in a fashion I would like. Right now, chitin will always raise a warning about files that have changed outside of its knowledge, which is reasonable. After all, that is what I care about more than the history: now a user will know if somebody has messed with this file. Potential ways to catch this (on a shared computer at least), is to have an additional daemon or kernel module that captures some information - but the reason chitin works the way it does is because it seemed to be a rather easy way of getting this data in the first place! ;)

I would love to try and make a ZFS extension for this, but that's a long time away and possible beyond my time and ability anyway.

sqlalchemy error upon launch

Getting an error when launching for the first time. Install (as --user) went smoothly.

Traceback (most recent call last):
File "/homes/ccole/.local/bin/chitin", line 9, in
load_entry_point('chitin==0.0.1', 'console_scripts', 'chitin')()
File "/sw/opt/python/2.7.3/lib/python2.7/site-packages/pkg_resources/init.py", line 542, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/sw/opt/python/2.7.3/lib/python2.7/site-packages/pkg_resources/init.py", line 2569, in load_entry_point
return ep.load()
File "/sw/opt/python/2.7.3/lib/python2.7/site-packages/pkg_resources/init.py", line 2229, in load
return self.resolve()
File "/sw/opt/python/2.7.3/lib/python2.7/site-packages/pkg_resources/init.py", line 2235, in resolve
module = import(self.module_name, fromlist=['name'], level=0)
File "build/bdist.linux-x86_64/egg/chitin/init.py", line 18, in
'BufrStubImagePlugin',
File "build/bdist.linux-x86_64/egg/chitin/util.py", line 11, in
File "build/bdist.linux-x86_64/egg/chitin/record.py", line 6, in
File "/cluster/gjb_lab/ccole/.local/lib/python2.7/site-packages/Flask_SQLAlchemy-2.1-py2.7.egg/flask_sqlalchemy/init.py", line 25, in
from sqlalchemy import orm, event, inspect
ImportError: cannot import name inspect

Could be a dependency issue. What version requirements are there?

multiprocessing tracking bug

For some stupid reason I thought adding multiprocessing to the mix was a great idea.
Now there is a ton of wonky shit going on simultaneously:

Potential to run sequential commands out-of-order
stderr and stdout interrupt the console
Zombie processes (ish)
Gross code. Specifically to handle apparently incorrect attributes
Bash script variable capturing broken (the context to pass back data is lost)
CPU usage wasted by empty zombies that poll and timeout
Files edited by multiple processes will not be trackable to particular Events

Defining file integrity rules and handlers should be less disgusting

Just so you know.

Can we determine whether a modified file was clobbered first?

We could "drop" (hide/archive) command history for files that are modified after being deleted first.

Permit more complex environmental integrity checks

It would be kinda cool if chitin could enforce a sane working environment at all times:

Does a BAM have a BAI? Is it in date?

`chitin` should be a lab book instance

Seeing as one can't change directory in this shell, it could be considered as a manager for a given top level analysis directory. We could store the JSON (soon to be sqlite schema) in the same directory and switch to relative paths?
[See #4 ]

[ffs] Race condition in closing scripts with easy jobs and lots of procs

Crappy scratch notes magic function?

Could be useful to take a note of what one was working on in the shell before giving up to drink tea?

Can we monitor system resource usage for metadata?

chitin3 Roadmap

In case you hadn't noticed, I trashed the entire chitin repo to make chitin2. It's around half as much less garbage as the last time and moves a little away from the idea of replacing your shell, but instead wrapping a script to keep track of what happens inside. I got overexcited in the last version, and made chitin a clever, parallelised shell that sent commands to a remote machine and allowed any chitin-capable shell to download and process the jobs. At this point, I realised I'd made a grid engine, so I've nuked the code base and started over: this time trying to remember the goal of chitin was to be a watchful guardian of your filesystem.
For a reminder of my November 2016 tirade that caused chitin to come into existence, check my blog.

Pretty much all the cool features of chitin1 are missing, but I plan to bring them back:

filetype handlers (eg. catching the number of alignments in a BAM)
command handlers (eg. remembering the alignment rate from bowtie2 stderr)
fetch or put a resource (file) to any other of your machines if a command needs it

I've finally made a business decision about the metadata storage part of chitin. I don't like sqlite, the database gets big, slow and locked. Originally the metadata was to be presented in the terminal (and it was), but we've outgrown this by necessity (commands and resources are linked together and I want you to click on them to find out stuff). Thus we're in your browser. The current version of chitin2 has an integrated webserver using Flask and SQLalchemy but this is troublesome for migration, and it was never my intention to bundle the shell-part and web-part together. Thus my roadmap includes:

Extracting all the web crap and deploying it to django instead; I'll be making a chitin-server repo soon. Django is definitely OTT for this, it's also wonderfully crafted, extendable, well-supported and has an excellent database migration system. chitin-server

Additional ideas of things that are to come:

An arbitrary resource monitor can ping the server with CPU/RAM info every minute or so, we'll then graph these between the start and end of a command on its detail page; alternatively, we could keep track of the PID for a command and try to keep more specific numbers.
Leverage all of your hashes and sell chitin-coins
Associate ENA ids with the chitin:\ resource locator
Dumping generated graphs and some text data for an experiment to the server

Early 2019 Stories

User should be able to see most recent commands run on a node
User should be able to see most recent resource changes on a node
User should be able to see a list of over-arching top-level "projects"
User should be able to see a list of "experiments" that belong to a project (eg. all the assemblies), ideally an API would be able to generate tables to present parameters/results

Late 2019 Stories

ditch chitin-server and plug chitin into majora (https://github.com/SamStudio8/majora)
rename chitin to ocarina

Don't quit on CTRL-C

It's really annoying

`EventSet` to house executions of `%script`

Seeing as we can now run bash scripts, it would be nice to group all of the Event objects (that is, the commands executed individually) together under some container. An EventSet seems like a reasonable solution. We can attach the input parameters, name, path, MD5 (or even a copy?) of the script in question to the EventSet such that is available to all ItemEvent.

We could also have total_wall metadata and such, too.

File integrity rules should permit stopping the command

Allow re-running of an experiment automatically

I've just had to re-run an experiment, I don't need to generate new data, but rather just have an umbrella for a "new set of runs". It would be helpful if there was a CLI/API/Web option to request a new UUID to do this. Bonus points if we held a "parent" experiment or something.

We might need an intermediate class where Experiments have RunGroups with Runs, rather than Runs immediately belonging to an experiment.

`chitin` now blocks commands requiring user input

As chitin now buffers stdout and stderr, commands that need user input (such as bgzip in the case of requesting user input to overwrite a file) hang. Nice one, @SamStudio8.

[tracking] Pain points

This is really a tracking issue for me, that outlines the issues I have using chitin with my own workflow, but please feel free to add your own.

Make directory `Item`s more useful

Directory Items are somewhat useless. The hash of an Item that is also a directory was designed to detect changes in directories outside of chitin, but serves little purpose outside of the integrity check.

It would be much more useful if we could correspond the hash of a directory to a group of items. So I propose something like an ItemSet object that can represent a directory (or even a group of files belonging to a project, etc.). An Item could easily be in multiple ItemSet.

My ideas for "protecting" files could be instead applied to ItemSets (ie. flat out prevent clobbering).

An example of where this would be much more useful is tar: where we already capture the directory hash, but cannot easily work out what the file state was at that particular hash.

Automated variable replacement in %script is great until you try to use awk

yields hilarious results

Newly created folders should be crawled deeper than one level

Have a FUSE backed file system

Enforce integrity rules with FUSE (ie. actually prevent writing, truncating)
Catch copies and moves with less hassle
Provide useful magic files that can automatically output the results of %history, or create tar archives
Mirrored versions of directories