Git Product home page Git Product logo

Comments (12)

hujay2019 avatar hujay2019 commented on June 11, 2024 1

@jan-janssen I opened a pull request. pyiron/pysqa#248

from pyiron.

jan-janssen avatar jan-janssen commented on June 11, 2024

@hujay2019 Thank you for testing pyiron. We might be a little slow to respond during the Christmas break.

To understand a bit better what is causing your issue, can you post your queue.yaml file for both your local workstation as well as the HPC? To understand a bit better what is causing this issue, I would suggest to login to the remote cluster using SSH, then try to submit a calculation on the login node and once this is working try to do the same from the remote workstation. This helps us to identify if the file is corrupted during transfer, or if something goes wrong before.

from pyiron.

hujay2019 avatar hujay2019 commented on June 11, 2024

@jan-janssen Thanks for your response.
queue.yaml in local workstation:

queue_type: REMOTE
queue_primary: slurm
ssh_host: *******
ssh_port: ****
ssh_username: hujie
known_hosts: /home/hu/.ssh/known_hosts
ssh_key: /home/hu/.ssh/***
ssh_remote_config_dir: /home/hujie/pyiron/resources/queues/
ssh_remote_path: /home/hujie/pyiron/pro/
ssh_local_path: /home/hu/pyiron/projects/
ssh_continous_connection: True
#ssh_delete_file_on_remote: False
queues:
    slurm: {cores_max: 64, cores_min: 1, run_time_max: 1440}

queue.ymal in HPC:

queue_type: SLURM
queue_primary: slurm
queues:
    slurm: {cores_max: 64, cores_min: 1, run_time_max: 1440, script: slurm.sh}

Of course, job.server.squeue = "slurm" and job.server.cores = 64.
Submitting calculations on the login node works fine.

from pyiron.

hujay2019 avatar hujay2019 commented on June 11, 2024

In source file pyiron_base/jobs/job/extension/server/queuestatus.py

def wait_for_job(job, interval_in_s=5, max_iterations=100):
    """
    Sleep until the job is finished but maximum interval_in_s * max_iterations seconds.

    Args:
        job (pyiron_base.job.utils.GenericJob): Job to wait for
        interval_in_s (int): interval when the job status is queried from the database - default 5 sec.
        max_iterations (int): maximum number of iterations - default 100

    Raises:
        ValueError: max_iterations reached, job still running
    """
    if job.status.string not in job_status_finished_lst:
        if (
            state.queue_adapter is not None
            and state.queue_adapter.remote_flag
            and job.server.queue is not None
        ):
            finished = False
            for _ in range(max_iterations):
                if not queue_check_job_is_waiting_or_running(item=job):
                    state.queue_adapter.transfer_file_to_remote(
                        file=job.project_hdf5.file_name,
                        transfer_back=True,
                    )
                    status_hdf5 = job.project_hdf5["status"]
                    job.status.string = status_hdf5
                else:
                    status_hdf5 = job.status.string
                if status_hdf5 in job_status_finished_lst:
                    job.transfer_from_remote()
                    finished = True
                    break
                time.sleep(interval_in_s)
            if not finished:
                raise ValueError(
                    "Maximum iterations reached, but the job was not finished."
                )
        else:
            finished = False
            for _ in range(max_iterations):
                if state.database.database_is_disabled:
                    job.project.db.update()
                job.refresh_job_status()
                if job.status.string in job_status_finished_lst:
                    finished = True
                    break
                elif isinstance(job.server.future, Future):
                    job.server.future.result(timeout=interval_in_s)
                    finished = job.server.future.done()
                    break
                else:
                    time.sleep(interval_in_s)
            if not finished:
                raise ValueError(
                    "Maximum iterations reached, but the job was not finished."
                )

In this loop:

            for _ in range(max_iterations):
                if not queue_check_job_is_waiting_or_running(item=job):
                    state.queue_adapter.transfer_file_to_remote(
                        file=job.project_hdf5.file_name,
                        transfer_back=True,
                    )
                    status_hdf5 = job.project_hdf5["status"]
                    job.status.string = status_hdf5
                else:
                    status_hdf5 = job.status.string
                if status_hdf5 in job_status_finished_lst:
                    job.transfer_from_remote()
                    finished = True
                    break
                time.sleep(interval_in_s)

When the status of the remote job changes to finished, state.queue_adapter.transfer_file_to_remote() is called in the first if statement, and the remote .h5 file is deleted. Then job.transfer_from_remote() is called in the second if statement, but the remote .h5 file has already been deleted.

I think that's probably where the mistake was made.

from pyiron.

hujay2019 avatar hujay2019 commented on June 11, 2024

If I comment these lines of code in def transfer_from_remote(self): of file pyiron_base/jobs/job/generic.py

        # state.queue_adapter.transfer_file_to_remote(
        #     file=self.project_hdf5.file_name,
        #     transfer_back=True,
        # )

The error won't happen.

from pyiron.

jan-janssen avatar jan-janssen commented on June 11, 2024

@hujay2019 Thank you for your feedback - I am happy you got it working. I am still confused why this happens or when the behaviour changed. Basically, when the local job has a queuing system ID, then there should never be a new transfer of the local job to the remote location as it is happening here. But I need to take a deeper look at this and that might take a bit more time.

from pyiron.

jan-janssen avatar jan-janssen commented on June 11, 2024

As the transfer_back parameter is set to True it should already copy the correct job, so I am surprised that the job seems to be empty. Can you try to disable the deletion on the remote location with you .pyiron settings file and check the content of the HDF5 file on the remote location? I just want to make sure the file is correctly executed on the remote location and gets corrupted during the transfer.

from pyiron.

hujay2019 avatar hujay2019 commented on June 11, 2024

I made sure that the remote task is executed correctly and the status is finished and can also be parsed by h5py. I think the remote h5 was deleted before the second transfer of the h5 file, resulting in the correctly received h5 file being overwritten with an empty content.

In _transfer_files method of pysqa/ext/remote.py:

                try:
                    sftp_client.get(file_dst, file_src)
                except FileNotFoundError:
                    pass

If the remote h5 was previously deleted, sftp_client.get(file_dst, file_src) will throw a FileNotFoundError. However, before the error is raised, the local h5 file is opened with mode='wb'. When the `FileNotFoundError' is raised, the local h5 is overwritten to empty.

I changed the code to:

                try:
                    sftp_client.stat(file_dst)
                    sftp_client.get(file_dst, file_src)
                except FileNotFoundError:
                    pass

It works. The FileNotFoundError is thrown in sftp_client.stat(file_dst) before sftp_client.get(file_dst, file_src). The local h5 file isn't opened at all. This solution adds additional time overhead by adding an ssh connection to check if the remote file exists.

In summary, the solution is skiping the files that not exist in the remote. I think there should be solutions that don't add extra time overhead, but I haven't been able to do that yet.

from pyiron.

hujay2019 avatar hujay2019 commented on June 11, 2024

It's a behavior of paramiko .

I tested paramiko:

import paramiko.client

client = paramiko.SSHClient()
client.connect(*******)
sftp = client.open_sftp()

try:
    sftp.get("/home/**/1.txt", "1.txt")
except FileNotFoundError:
    print("File Not Found")
  1. Remote non-empty file 1.txt exists and local non-empty 1.txt exists. Local 1.txt is overwritten by remote 1.txt. Correct.

  2. Delete remote 1.txt, and local non-empty 1.txt exists. Catch FileNotFoundError, and local 1.txt becomes empty. Unexpected behavior.

So pyiron needs some work to avoid empty overwriting local files.

from pyiron.

jan-janssen avatar jan-janssen commented on June 11, 2024

I changed the code to:

                try:
                    sftp_client.stat(file_dst)
                    sftp_client.get(file_dst, file_src)
                except FileNotFoundError:
                    pass

That sounds like a good idea, do you want to open a pull request to prevent this issue in future?

from pyiron.

pmrv avatar pmrv commented on June 11, 2024

It's a behavior of paramiko .

I tested paramiko:

import paramiko.client

client = paramiko.SSHClient()
client.connect(*******)
sftp = client.open_sftp()

try:
    sftp.get("/home/**/1.txt", "1.txt")
except FileNotFoundError:
    print("File Not Found")
1. Remote non-empty file `1.txt` exists and local non-empty `1.txt` exists. Local `1.txt` is overwritten by remote `1.txt`. Correct.

2. Delete remote `1.txt`, and local non-empty `1.txt` exists. Catch `FileNotFoundError`, and local `1.txt` becomes empty. Unexpected behavior.

So pyiron needs some work to avoid empty overwriting local files.

This also sounds like a bug in paramiko to me. Consider reporting just this snippet to them as well.

from pyiron.

hujay2019 avatar hujay2019 commented on June 11, 2024

@pmrv Yes. I'll consider reporting to paramiko.

from pyiron.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.