Git Product home page Git Product logo

Comments (11)

t-reents avatar t-reents commented on August 25, 2024 2

Great! If you don't mind, I can open a PR in the upcoming days.

from aiida-quantumespresso.

sphuber avatar sphuber commented on August 25, 2024 1

Thanks for the analysis @t-reents . I would personally be hesitant to put scheduler based logic in the protocols or even make scheduler specific protocols. You would have to start to do this everywhere and what happens if there is yet another scheduler plugin that you didn't cover yet.

If the scheduler would implement some method to provide default resources that the workflow/protocol could just call blindly, that could work. But I don't think that exists yet in the scheduler interface.

It is true that the current implementation is already scheduler specific due to our own SLURM background, but I think we should rather remove that and make it completely agnostic than start adding more scheduler specific logic. The protocols automate almost everything already. I think it is more than fair to ask the user to think at least about the resources they want to use and specify those. As long as that is properly documented of course.

from aiida-quantumespresso.

t-reents avatar t-reents commented on August 25, 2024

Hi @stevenbos123
AiiDA does support SGE schedulers: https://aiida.readthedocs.io/projects/aiida-core/en/latest/topics/schedulers.html#sge

I've never used SGE myself but only Slurm. However, based on your error message and the documentation (https://aiida.readthedocs.io/projects/aiida-core/en/latest/topics/schedulers.html#parenvjobresource-sge-like), you would need to specify the attributes parallel_env and tot_num_mpiprocs in the metadata.options.resources of your CalcJob.
The following should do the trick:

builder = PwBandsWorkChain.get_builder_from_protocol(
    code=code,
    structure=node.inputs.structure,
    options={'resources': {'parallel_env': 'mpi', 'tot_num_mpiprocs': 64}}
)

The provided values are only used to demonstrate the idea, you would need to adapt them accordingly, based on your system.

Hope that helps!

from aiida-quantumespresso.

stevenbos123 avatar stevenbos123 commented on August 25, 2024

Hi @t-reents,

Thank you for the reply. It seems I did not provide enough information on my problem. Just like you suggested I specified the variables parallel_env and tot_num_mpiprocs. However this poses another problem, namely the variable num_machines is now automatically used as it is part of the default settings. It can be found here:
src/aiida_quantumespresso/workflows/protocols/pw/base.yaml.

This poses a problem for SGE based systems as you do not provide the number of machines for them.
How do I proceed from here? I provided the full stack trace below again for the updated problem.

Kind regards,

Steven

ValueError                                Traceback (most recent call last)
Cell In[58], [line 16](vscode-notebook-cell:?execution_count=58&line=16)
     [14](vscode-notebook-cell:?execution_count=58&line=14)     scf_parms = node.inputs.parameters
     [15](vscode-notebook-cell:?execution_count=58&line=15)     nscf_parms = parameters_builder_relaxed_nscf()
---> [16](vscode-notebook-cell:?execution_count=58&line=16)     builder = PwBandsWorkChain.get_builder_from_protocol(
     [17](vscode-notebook-cell:?execution_count=58&line=17)         code=code,
     [18](vscode-notebook-cell:?execution_count=58&line=18)         structure=node.inputs.structure,
     [19](vscode-notebook-cell:?execution_count=58&line=19)         options={'resources': {'parallel_env': 'mpi', 'tot_num_mpiprocs': 32}}
     [20](vscode-notebook-cell:?execution_count=58&line=20)     )
     [22](vscode-notebook-cell:?execution_count=58&line=22) #     inputs = {
     [23](vscode-notebook-cell:?execution_count=58&line=23) #     'code': code,
     [24](vscode-notebook-cell:?execution_count=58&line=24) #     'structure': structure,
   (...)
     [40](vscode-notebook-cell:?execution_count=58&line=40) #     bands_processes.append(process)
     [41](vscode-notebook-cell:?execution_count=58&line=41) #     bands_pks.append(process.pk)

File [~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:139](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:139), in PwBandsWorkChain.get_builder_from_protocol(cls, code, structure, protocol, overrides, options, **kwargs)
    [136](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:136) inputs = cls.get_protocol_inputs(protocol, overrides)
    [138](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:138) args = (code, structure, protocol)
--> [139](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:139) relax = PwRelaxWorkChain.get_builder_from_protocol(
    [140](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:140)     *args, overrides=inputs.get('relax', None), options=options, **kwargs
    [141](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:141) )
    [142](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:142) scf = PwBaseWorkChain.get_builder_from_protocol(
    [143](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:143)     *args, overrides=inputs.get('scf', None), options=options, **kwargs
    [144](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:144) )
    [145](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:145) bands = PwBaseWorkChain.get_builder_from_protocol(
    [146](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:146)     *args, overrides=inputs.get('bands', None), options=options, **kwargs
    [147](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:147) )

File [~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:160](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:160), in PwRelaxWorkChain.get_builder_from_protocol(cls, code, structure, protocol, overrides, relax_type, options, **kwargs)
    [157](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:157)         raise ValueError(f'Structures with periodic boundary conditions `{structure.pbc}` are not supported.')
    [159](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:159) builder = cls.get_builder()
--> [160](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:160) builder.base = base
    [161](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:161) builder.base_final_scf = base_final_scf
    [162](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:162) builder.structure = structure

File [~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:120](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:120), in ProcessBuilderNamespace.__setattr__(self, attr, value)
    [118](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:118)     validation_error = port.validate(value)  # type: ignore[union-attr]
    [119](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:119)     if validation_error:
--> [120](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:120)         raise ValueError(f'invalid attribute value {validation_error.message}')
    [122](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:122) # If the attribute that is being set corresponds to a port that is a ``PortNamespace`` we need to make sure
    [123](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:123) # that the nested value remains a ``ProcessBuilderNamespace``. Otherwise, the nested namespaces will become
    [124](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:124) # plain dictionaries and no longer have the properties of the ``ProcessBuilderNamespace`` that provide all
    [125](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:125) # the autocompletion and validation when values are being set. Therefore we first construct a new instance
    [126](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:126) # of a ``ProcessBuilderNamespace`` for the port of the attribute that is being set and than iteratively set
    [127](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:127) # all the values within the mapping that is being assigned to the attribute.
    [128](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:128) if isinstance(port, PortNamespace):

ValueError: invalid attribute value input `metadata.options.resources` is not valid for the `SgeScheduler` scheduler: these parameters were not recognized: num_machines```

from aiida-quantumespresso.

t-reents avatar t-reents commented on August 25, 2024

Thanks for the additional details, @stevenbos123! You're right, the problem comes from the default protocols, that specify num_machines.
As a first fix, to get your jobs running, you could simply comment out/remove the following two lines:

If you provide the options as discussed before, this error shouldn't occur anymore.

Important to note: Apparently, the current protocols are designed to work with Slurm out of the box, so you wouldn't necessarily need to provide the options. If you comment out the lines above, you would also always need to provide the corresponding slurm resources via the options (in case you want to run a slurm job).

Another option would be to define scheduler dependent protocols, e.g. default_sge and default_slurm. To add the new protocols, you would need to further adjust the protocol file:

# Additional lines
...
protocols:
    moderate:
        description: 'Protocol to perform the computation at normal precision at moderate computational cost.'
    default_slurm:
        description: 'Default moderate protocol for slurm.'
        pw:
            metadata:
                options:
                    resources:
                        num_machines: 1
    default_sge:
        description: 'Default moderate protocol for sge.'
        pw:
            metadata:
                options:
                    resources:
                        parallel_env: mpi
                        tot_num_mpiprocs: 64
    precise:
...
# Additional lines

You would also need to add the new protocols to the protocol files for relax and bands. In that case, it would be sufficient to add the protocol name and the description, as the resources are only relevant in the base file.

I couldn't test the SGE protocol, but this would be the idea. This way, you should be able to simply call

builder = PwBandsWorkChain.get_builder_from_protocol(
    code=code,
    structure=node.inputs.structure,
    protocol='default_sge'
)

even without specifying the options (as I moved them to the protocol in this example).
Again, the first option should already work, by just removing the corresponding lines and providing the options to the get_builder_from_protocol method.

from aiida-quantumespresso.

t-reents avatar t-reents commented on August 25, 2024

@mbercx Without thinking too much about it, might it make sense to introduce some logic in the get_builder_from_protocol, that chooses the default resources based on the scheduler of the Code and corresponding Computer? Otherwise, one would need to have these kind of scheduler dependent protocols, if the user should be able to use these kind of protocols, no matter which scheduler is used.

Maybe I'm also overlooking something.

from aiida-quantumespresso.

t-reents avatar t-reents commented on August 25, 2024

@stevenbos123 I realized that I made a small mistake in the first version of my comment, I just edited it. To make it easier, I adjusted the protocols on a local branch, in case you would like to use the second option with the new protocols. You can check it out here:

https://github.com/t-reents/aiida-quantumespresso/tree/fix/sge_protocol

If you clone it, you can change to the directory (remember to git checkout the correct branch) and install it via pip install -e . in your virtual environment.

from aiida-quantumespresso.

t-reents avatar t-reents commented on August 25, 2024

Thank you for joining @sphuber.

I would personally be hesitant to put scheduler based logic in the protocols or even make scheduler specific protocols

Yes, this is definitely true. The examples were mainly intended to show how @stevenbos123 could adapt the protocols, in case he prefers that, and to get his setup running. In general, I would also not vote for creating scheduler-specific protocols as one would also need to duplicate them for the different "levels", e.g. precision and fast.

If the scheduler would implement some method to provide default resources that the workflow/protocol could just call blindly, that could work. But I don't think that exists yet in the scheduler interface.

This is what I was thinking about in my comment to Marnik, but I also don't think that it'd be worth to implement this/make things more complicated, for the single purpose of providing initial resources.

As long as that is properly documented of course.

I think this is the most important point of this issue. So far, I've also expected that the protocols work just out of the box, and I've never thought about any scheduler dependency (as I think it's not explicitly mentioned anywhere, especially for new users).

So to sum up, we could think of these two (simple) options, right?

  1. Keep the current protocols but clearly state and document that those are SLURM based and document what the user would need to adjust for other schedulers
  2. Make the protocols independent of the schedulers, so to think of the protocols as a set of tested and reasonable parameters for calculations, while the specification of the resources is a responsibility of the user (which makes sense in a way, as systems will vary anyway). In this case, we would need to remove the inputs for metadata.options.resource from the protocols and probably make the options argument in get_builder_from_protocol non-default, right?

from aiida-quantumespresso.

sphuber avatar sphuber commented on August 25, 2024

Fully agree with the two options. Since they are not really mutually exclusive, we could already start with the first and simply improve the docs. We can then add a deprecation warning id no explicit resources are.provided saying this will start to error in the future. And then at some point we can remove the resource defaults to make the protocols scheduler agnostic

from aiida-quantumespresso.

JPchico avatar JPchico commented on August 25, 2024

Hi! Any update on this? We are also having this problem, I commented out the offending line in the protocol, but I wanted to know if there is a more long term solution in the pipeline.

from aiida-quantumespresso.

t-reents avatar t-reents commented on August 25, 2024

Hi @JPchico! I just continued my work on the corresponding PR introducing the warnings, sorry for the delay!
I guess we will wait for some time after merging it and then remove the Slurm specific line in a future release, as indicated by @sphuber above.

from aiida-quantumespresso.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.