cdelcastillo21 / taccjm Goto Github PK
View Code? Open in Web Editor NEWTACC Job Manager Library
License: MIT License
TACC Job Manager Library
License: MIT License
Investigate and get more logging as to why sessions are becoming inactive. Implement specific JM checks in heartbeat:
ERROR:taccjm.TACCJobManager:list_files - Unknown error trying to access .: SSH session not active Traceback (most recent call last): File "/usr/local/lib/python3.9/wsgiref/handlers.py", line 137, in run self.result = application(self.environ, self.start_response) File "/usr/local/lib/python3.9/site-packages/falcon/api.py", line 269, in __call__ responder(req, resp, **params) File "/usr/local/lib/python3.9/site-packages/hug/interface.py", line 947, in __call__ raise exception File "/usr/local/lib/python3.9/site-packages/hug/interface.py", line 918, in __call__ self.call_function(input_parameters), context, request, response, **kwargs File "/usr/local/lib/python3.9/site-packages/hug/interface.py", line 840, in call_function return self.interface(**parameters) File "/usr/local/lib/python3.9/site-packages/hug/interface.py", line 129, in __call__ return __hug_internal_self._function(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/taccjm/taccjm_server.py", line 163, in list_files files = JM[jm_id].list_files(path=path) File "/usr/local/lib/python3.9/site-packages/taccjm/TACCJobManager.py", line 492, in list_files raise e File "/usr/local/lib/python3.9/site-packages/taccjm/TACCJobManager.py", line 459, in list_files with self._client.open_sftp() as sftp: File "/usr/local/lib/python3.9/site-packages/paramiko/client.py", line 558, in open_sftp return self._transport.open_sftp_client() File "/usr/local/lib/python3.9/site-packages/paramiko/transport.py", line 1142, in open_sftp_client return SFTPClient.from_transport(self) File "/usr/local/lib/python3.9/site-packages/paramiko/sftp_client.py", line 164, in from_transport chan = t.open_session( File "/usr/local/lib/python3.9/site-packages/paramiko/transport.py", line 920, in open_session return self.open_channel( File "/usr/local/lib/python3.9/site-packages/paramiko/transport.py", line 1014, in open_channel raise SSHException("SSH session not active") paramiko.ssh_exception.SSHException: SSH session not active 127.0.0.1 - - [30/Sep/2022 21:40:19] "GET /l1/files/list HTTP/1.1" 500 59
TACCJM throws an error on downloading a file from any public directory that a user has read but not write access to:
stdout : tar (child): /work2/06307/clos21/pub/adcirc/inputs/ShinnecockInlet/mesh/def.tar.gz: Cannot open: Permission denied
This bug is happening because TACCJM tries to tar whatever you want to download before downloading it. However it tar’s it in the same directory as the data is in, which is a public directory that only I have write access to.
Fix is to have TACCJM tar the contents in another temp directory (lets say the JM's trash directory), and then download the tarred file from there. This also is a good implementation change as it handles automatically trash clean-up of partial files on failed download attempts.
run_script error
---------------------------------------------------------------------------
TACCJMError Traceback (most recent call last)
Input In [63], in <module>
----> 1 res = tjm.run_script('l1', 'adcirc_compile', args=["v55.01", "https://github.com/cdelcastillo21/adcirc-cg.git", "1"])
File ~/repos/taccjm/src/taccjm/taccjm.py:1297, in run_script(jm_id, script_name, job_id, args)
1295 e.message = f"run_script error"
1296 logger.error(e.message)
-> 1297 raise e
1299 return res
File ~/repos/taccjm/src/taccjm/taccjm.py:1293, in run_script(jm_id, script_name, job_id, args)
1291 data = {'script_name': script_name, 'job_id': job_id, 'args': args}
1292 try:
-> 1293 res = api_call('PUT', f"{jm_id}/scripts/run", data)
1294 except TACCJMError as e:
1295 e.message = f"run_script error"
File ~/repos/taccjm/src/taccjm/taccjm.py:177, in api_call(http_method, end_point, data)
175 return json.loads(res.text)
176 else:
--> 177 raise TACCJMError(res)
TACCJMError: args : 'list' object is not callable
TACCJM version 0.0.2
TACCJobManager class list_files
routine returns list of dictionaries with file info, but says it returns list of strings, which is what list_files in taccjm_server.py is expecting.
Getting useless 'unable to parse json errors' when TACCJMErrors occurs. Fix for better messages.
>>> job = tjm.deploy_job('l1',local_job_dir='/Users/carlos/repos/pyadcirc/apps/adcirc', proj_conf
... ig_file='/Users/carlos/repos/pyadcirc/apps/adcirc/ls6.ini')
deploy_job error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/carlos/repos/taccjm/src/taccjm/taccjm.py", line 857, in deploy_job
raise e
File "/Users/carlos/repos/taccjm/src/taccjm/taccjm.py", line 853, in deploy_job
res = api_call('POST', f"{jm_id}/jobs/deploy", data)
File "/Users/carlos/repos/taccjm/src/taccjm/taccjm.py", line 177, in api_call
raise TACCJMError(res)
taccjm.exceptions.TACCJMError: <Response [500]> unable to parse json errors
<Response [500]> unable to parse json errors
No way currently to clean-up stale/dead job managers, and the error message provided is not helpful for the user (just throws a general 500 server error). Need to fix.
Make the jm_id parameter in the taccjm init command via the CLI optional.
Implement DAG Simulation Framework.
1.) Base simulation class should have a 'parent' field, with parent simulation.
2.) _dag object -> Graph, _sims dictionary -> Maps simulation ID/name to DAG point.
3.) Entrypoint for sim is a call to a task list, that has DAG built into dependencies.
4.) run() into job.
Ideally want a framework that allows for the following (example):
sim = ADCPREP(parent=None)
sim = PADCIRC(parent=sim)
sim = ADCIRCOutputCompress(parent=sim)
sim.run()
i.e. - Can chain simulations.
Error when trying to initialize tjm processes from within jupyter environment. This example is from trying to init JM within designsafe:
1819 if errno_num != 0:
1820 err_msg = os.strerror(errno_num)
-> 1821 raise child_exception_type(errno_num, err_msg, err_filename)
1822 raise child_exception_type(err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'hug'```
Works if in as separate terminal window within jupyter one navigates to proper conda environment where installed and then initializes via a python repl.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.