livinthelookingglass / job-splitter Goto Github PK
View Code? Open in Web Editor NEWPackage research code and split it among several machines in parallel
License: MIT License
Package research code and split it among several machines in parallel
License: MIT License
If there is, we should definitely support it to help guarantee disk space not becoming an issue. Needs to be investigated.
Ideally this would be written based on questions someone else asks me about how it works, given the current Readme
Potential API could be:
def run_cjobs(job_name, *args, converters=()): ...
job_name
would be the name of the C function, and possibly also its file. converters
would be a list of functions to convert the given Python object to a cffi object or similar.
This would probably be easiest with cython, but I don't, as of yet, know how.
If this does work, a similar API should be provided for other languages that can interact with Python or be called from C. I know you can do go, and I'm fairly sure rust would work if C does
It's possible that some jobs will be I/O-bound in a way that reporting a progress amount asynchronously is useful.
API should just be a +=
wrapped by a .increment(amount: float, base: float = 100.0)
Currently the Job ID printed to screen is just using enumerate()
on the shuffled portion of the array you're assigned. This means that it does not match the Job ID reported in the CSV, and is essentially arbitrary.
It's not obvious that this should be the default. Maybe we need to support both behaviors?
Going to open a separate branch to slowly work on this, probably while I'm on a plane
Populating it with [1, # cores]
is a sensible default, and would let people run faster if they didn't need to distribute across multiple hosts
This should be the ID of the job as given by its arguments' index before being shuffled
I have a suspicion that my queue injection is only working because of fork()
, but don't have a machine to test it on. Help would be appreciated in testing this.
It would be useful if the idea of a job could be abstracted a little bit. To this end, I propose that a job be abstracted into a collection of "tasks". Each task must have marked what other task it is dependent on, and are deferred until each of these are completed.
This ensures that shuffling jobs will not cause a dependent task to cross host boundaries. It also has the benefit of allowing some amount of shared memory between tasks, though of course some limitations will need to be imposed on usage.
So a generic version of this might look like:
@dataclass
class Job:
tasks: list[Task]
dependencies: dict[Task, list[Task]]
def __post_init__(...):
self.callbacks: dict[Task, AsyncResult] = {}
self.context = {}
def execute(self, pool):
while not self.done():
for task in self.tasks:
if task.ready() and task.inactive():
self.callbacks[Task] = pool.apply_async(task.execute, self.context)
Potential idea for a general case, assuming no external I/O.
Provide a context object and wrap the requested job in a function that starts a helper thread. This helper thread should attempt to serialize this context object.
Open questions:
A better world would let me do this using this package, but I'm not sure that you use it this way given their API. I'd like to have this be as non-invasive as possible.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.