livinthelookingglass / job-splitter Goto Github PK

Package research code and split it among several machines in parallel

License: MIT License

Makefile 0.67% Python 99.33%

datascience makefile map multi-host parallel parallel-computing parallel-processing portable-executable pyinstaller reduce unified-logging

job-splitter's People

Contributors

Stargazers

Watchers

job-splitter's Issues

Is there an efficient way to compress the csv without decompressing to concatenate?

If there is, we should definitely support it to help guarantee disk space not becoming an issue. Needs to be investigated.

Add unit test for zipped logger

Needs proper tutorial

Ideally this would be written based on questions someone else asks me about how it works, given the current Readme

Add helper version of split_jobs that calls C functions

Potential API could be:

def run_cjobs(job_name, *args, converters=()): ...

job_name would be the name of the C function, and possibly also its file. converters would be a list of functions to convert the given Python object to a cffi object or similar.

This would probably be easiest with cython, but I don't, as of yet, know how.

If this does work, a similar API should be provided for other languages that can interact with Python or be called from C. I know you can do go, and I'm fairly sure rust would work if C does

ProgressReporter needs a .increment()

It's possible that some jobs will be I/O-bound in a way that reporting a progress amount asynchronously is useful.

API should just be a += wrapped by a .increment(amount: float, base: float = 100.0)

Job ID given in ProgressPool does not correspond to index in working set

Currently the Job ID printed to screen is just using enumerate() on the shuffled portion of the array you're assigned. This means that it does not match the Job ID reported in the CSV, and is essentially arbitrary.

It's not obvious that this should be the default. Maybe we need to support both behaviors?

Missing a lot of docstrings

Going to open a separate branch to slowly work on this, probably while I'm on a plane

Add unit tests for ProgressReporter

Is there any way to reliably test ProgressPool?

Automatically populate machines.json with current machine cores

Populating it with [1, # cores] is a sensible default, and would let people run faster if they didn't need to distribute across multiple hosts

Move configuration functions to separate file

Job ID should really be in default results.csv data

This should be the ID of the job as given by its arguments' index before being shuffled

Allow toggling of default CSV elements

Allow custom template for progress bar

Does progress.py work on Windows?

I have a suspicion that my queue injection is only working because of fork(), but don't have a machine to test it on. Help would be appreciated in testing this.

Support more complicated workflows

It would be useful if the idea of a job could be abstracted a little bit. To this end, I propose that a job be abstracted into a collection of "tasks". Each task must have marked what other task it is dependent on, and are deferred until each of these are completed.

This ensures that shuffling jobs will not cause a dependent task to cross host boundaries. It also has the benefit of allowing some amount of shared memory between tasks, though of course some limitations will need to be imposed on usage.

So a generic version of this might look like:

@dataclass
class Job:
  tasks: list[Task]
  dependencies: dict[Task, list[Task]]

  def __post_init__(...):
    self.callbacks: dict[Task, AsyncResult] = {}
    self.context = {}

  def execute(self, pool):
    while not self.done():
      for task in self.tasks:
        if task.ready() and task.inactive():
          self.callbacks[Task] = pool.apply_async(task.execute, self.context)

Implement a checkpointing scheme

Potential idea for a general case, assuming no external I/O.

Provide a context object and wrap the requested job in a function that starts a helper thread. This helper thread should attempt to serialize this context object.

the job reports the object is busy by locking it
the job opens files via this object, whose states are copied/reflinked at the same frequency as program state
the serializer thread does this every 30s if no I/O active, 120s if I/O inactive
serializer thread stores the last {configurable} snapshots
serializer thread stores this with name job_{number}.snapshot_{number}.bin

Open questions:

is the random object snapshottable?

A better world would let me do this using this package, but I'm not sure that you use it this way given their API. I'd like to have this be as non-invasive as possible.

livinthelookingglass / job-splitter Goto Github PK

job-splitter's People

Contributors

Stargazers

Watchers

job-splitter's Issues

Recommend Projects

Recommend Topics

Recommend Org