Comments (12)
@jan-janssen I opened a pull request. pyiron/pysqa#248
from pyiron.
@hujay2019 Thank you for testing pyiron. We might be a little slow to respond during the Christmas break.
To understand a bit better what is causing your issue, can you post your queue.yaml
file for both your local workstation as well as the HPC? To understand a bit better what is causing this issue, I would suggest to login to the remote cluster using SSH, then try to submit a calculation on the login node and once this is working try to do the same from the remote workstation. This helps us to identify if the file is corrupted during transfer, or if something goes wrong before.
from pyiron.
@jan-janssen Thanks for your response.
queue.yaml
in local workstation:
queue_type: REMOTE
queue_primary: slurm
ssh_host: *******
ssh_port: ****
ssh_username: hujie
known_hosts: /home/hu/.ssh/known_hosts
ssh_key: /home/hu/.ssh/***
ssh_remote_config_dir: /home/hujie/pyiron/resources/queues/
ssh_remote_path: /home/hujie/pyiron/pro/
ssh_local_path: /home/hu/pyiron/projects/
ssh_continous_connection: True
#ssh_delete_file_on_remote: False
queues:
slurm: {cores_max: 64, cores_min: 1, run_time_max: 1440}
queue.ymal
in HPC:
queue_type: SLURM
queue_primary: slurm
queues:
slurm: {cores_max: 64, cores_min: 1, run_time_max: 1440, script: slurm.sh}
Of course, job.server.squeue = "slurm"
and job.server.cores = 64
.
Submitting calculations on the login node works fine.
from pyiron.
In source file pyiron_base/jobs/job/extension/server/queuestatus.py
def wait_for_job(job, interval_in_s=5, max_iterations=100):
"""
Sleep until the job is finished but maximum interval_in_s * max_iterations seconds.
Args:
job (pyiron_base.job.utils.GenericJob): Job to wait for
interval_in_s (int): interval when the job status is queried from the database - default 5 sec.
max_iterations (int): maximum number of iterations - default 100
Raises:
ValueError: max_iterations reached, job still running
"""
if job.status.string not in job_status_finished_lst:
if (
state.queue_adapter is not None
and state.queue_adapter.remote_flag
and job.server.queue is not None
):
finished = False
for _ in range(max_iterations):
if not queue_check_job_is_waiting_or_running(item=job):
state.queue_adapter.transfer_file_to_remote(
file=job.project_hdf5.file_name,
transfer_back=True,
)
status_hdf5 = job.project_hdf5["status"]
job.status.string = status_hdf5
else:
status_hdf5 = job.status.string
if status_hdf5 in job_status_finished_lst:
job.transfer_from_remote()
finished = True
break
time.sleep(interval_in_s)
if not finished:
raise ValueError(
"Maximum iterations reached, but the job was not finished."
)
else:
finished = False
for _ in range(max_iterations):
if state.database.database_is_disabled:
job.project.db.update()
job.refresh_job_status()
if job.status.string in job_status_finished_lst:
finished = True
break
elif isinstance(job.server.future, Future):
job.server.future.result(timeout=interval_in_s)
finished = job.server.future.done()
break
else:
time.sleep(interval_in_s)
if not finished:
raise ValueError(
"Maximum iterations reached, but the job was not finished."
)
In this loop:
for _ in range(max_iterations):
if not queue_check_job_is_waiting_or_running(item=job):
state.queue_adapter.transfer_file_to_remote(
file=job.project_hdf5.file_name,
transfer_back=True,
)
status_hdf5 = job.project_hdf5["status"]
job.status.string = status_hdf5
else:
status_hdf5 = job.status.string
if status_hdf5 in job_status_finished_lst:
job.transfer_from_remote()
finished = True
break
time.sleep(interval_in_s)
When the status of the remote job changes to finished
, state.queue_adapter.transfer_file_to_remote()
is called in the first if
statement, and the remote .h5 file is deleted. Then job.transfer_from_remote()
is called in the second if
statement, but the remote .h5 file has already been deleted.
I think that's probably where the mistake was made.
from pyiron.
If I comment these lines of code in def transfer_from_remote(self):
of file pyiron_base/jobs/job/generic.py
# state.queue_adapter.transfer_file_to_remote(
# file=self.project_hdf5.file_name,
# transfer_back=True,
# )
The error won't happen.
from pyiron.
@hujay2019 Thank you for your feedback - I am happy you got it working. I am still confused why this happens or when the behaviour changed. Basically, when the local job has a queuing system ID, then there should never be a new transfer of the local job to the remote location as it is happening here. But I need to take a deeper look at this and that might take a bit more time.
from pyiron.
As the transfer_back
parameter is set to True
it should already copy the correct job, so I am surprised that the job seems to be empty. Can you try to disable the deletion on the remote location with you .pyiron
settings file and check the content of the HDF5 file on the remote location? I just want to make sure the file is correctly executed on the remote location and gets corrupted during the transfer.
from pyiron.
I made sure that the remote task is executed correctly and the status is finished
and can also be parsed by h5py. I think the remote h5 was deleted before the second transfer of the h5 file, resulting in the correctly received h5 file being overwritten with an empty content.
In _transfer_files
method of pysqa/ext/remote.py
:
try:
sftp_client.get(file_dst, file_src)
except FileNotFoundError:
pass
If the remote h5 was previously deleted, sftp_client.get(file_dst, file_src)
will throw a FileNotFoundError
. However, before the error is raised, the local h5 file is opened with mode='wb'
. When the `FileNotFoundError' is raised, the local h5 is overwritten to empty.
I changed the code to:
try:
sftp_client.stat(file_dst)
sftp_client.get(file_dst, file_src)
except FileNotFoundError:
pass
It works. The FileNotFoundError
is thrown in sftp_client.stat(file_dst)
before sftp_client.get(file_dst, file_src)
. The local h5 file isn't opened at all. This solution adds additional time overhead by adding an ssh connection to check if the remote file exists.
In summary, the solution is skiping the files that not exist in the remote. I think there should be solutions that don't add extra time overhead, but I haven't been able to do that yet.
from pyiron.
It's a behavior of paramiko
.
I tested paramiko
:
import paramiko.client
client = paramiko.SSHClient()
client.connect(*******)
sftp = client.open_sftp()
try:
sftp.get("/home/**/1.txt", "1.txt")
except FileNotFoundError:
print("File Not Found")
-
Remote non-empty file
1.txt
exists and local non-empty1.txt
exists. Local1.txt
is overwritten by remote1.txt
. Correct. -
Delete remote
1.txt
, and local non-empty1.txt
exists. CatchFileNotFoundError
, and local1.txt
becomes empty. Unexpected behavior.
So pyiron
needs some work to avoid empty overwriting local files.
from pyiron.
I changed the code to:
try: sftp_client.stat(file_dst) sftp_client.get(file_dst, file_src) except FileNotFoundError: pass
That sounds like a good idea, do you want to open a pull request to prevent this issue in future?
from pyiron.
It's a behavior of
paramiko
.I tested
paramiko
:import paramiko.client client = paramiko.SSHClient() client.connect(*******) sftp = client.open_sftp() try: sftp.get("/home/**/1.txt", "1.txt") except FileNotFoundError: print("File Not Found")1. Remote non-empty file `1.txt` exists and local non-empty `1.txt` exists. Local `1.txt` is overwritten by remote `1.txt`. Correct. 2. Delete remote `1.txt`, and local non-empty `1.txt` exists. Catch `FileNotFoundError`, and local `1.txt` becomes empty. Unexpected behavior.
So
pyiron
needs some work to avoid empty overwriting local files.
This also sounds like a bug in paramiko to me. Consider reporting just this snippet to them as well.
from pyiron.
@pmrv Yes. I'll consider reporting to paramiko
.
from pyiron.
Related Issues (20)
- Restructure pyiron HOT 1
- How to modify "run_thermointdfteam_0.01_mpi.sh" in pyiron HOT 12
- outdated informations in the CONTRIBUTING.rst
- pyiron in binder - jupyter and jupyterlab extensions throw an error HOT 5
- submitted jobs disappeared in the queue list HOT 11
- pyiron inside pycharm: ValueError: Was not able to locate a periodic table HOT 5
- working in view_mode HOT 5
- Mailing list or chat room? HOT 1
- docker command is not correct HOT 1
- Creating a custom job HOT 6
- Documentation or Feature for Environment Use on Remote HOT 3
- Reporting a vulnerability HOT 1
- import pyiron error HOT 3
- Installation is confusing HOT 4
- Intrecting Potential is missing for Potassium Niobate(KNbO3) HOT 14
- Documentation is broken HOT 2
- Outdated (pr.create) commands in tutorials
- [Bug] The module index is currently broken
- Lost kernels after creating a new conda environment
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyiron.