archer2-hpc / archer2-docs Goto Github PK

Repository for ARCHER2 documentation

License: Other

archer2-docs's Introduction

ARCHER2 Documentation

ARCHER2 is the next generation UK National Supercomputing Service. You can find more information on the service and the research it supports on the ARCHER2 website.

This repository contains the documentation for the service and is linked to a rendered version currently hosted on Github pages.

This documentation is drawn from the Cirrus documentation, Sheffield Iceberg documentation and the ARCHER documentation.

Rendered documentation

ARCHER2 Documentation (HTML)

How to contribute

We welcome contributions from the ARCHER2 community and beyond. Contributions can take many different forms, some examples are:

Raising Issues if you spot a mistake or something that could be improved
Adding/updating material via a Pull Request
Adding your thoughts and ideas to any open issues

All people who contribute and interact via this Github repository undertake to abide by the ARCHER2 Code of Conduct so that we, as a community, provide a welcoming and supportive environment for all people, regardless of background or identity.

To contribute content to this documentation, first you have to fork it on GitHub and clone it to your machine, see Fork a Repo for the GitHub documentation on this process.

Once you have the git repository locally on your computer, you will need to install Material for mkdocs to be able to build the documentation. This can be done using a local installation or using a Docker container.

Once you have made your changes and updated your Fork on GitHub you will need to Open a Pull Request.

Building the documentation on a local machine

Once Material for mkdocs is installed, you can preview the site locally using the instructions in the Material for mkdocs documentation.

Making changes and style guide

The documentation consists of a series of Markdown files which have the .md extension. These files are then automatically converted to HTMl and combined into the web version of the documentation by mkdocs. It is important that when editing the files the syntax of the Markdown files is followed. If there are any errors in your changes the build will fail and the documentation will not update, you can test your build locally by running mkdocs serve. The easiest way to learn what files should look like is to read the Markdown files already in the repository.

A short list of style guidance:

Headings should be in sentance case

archer2-docs's People

Stargazers

Watchers

archer2-docs's Issues

Add Quickstart for developers

The ARCHER2 Quickstart for developers needs to be added. Overview of the content required can be found at:

https://docs.archer2.ac.uk/quick-start/overview.html

Add detailed hardware information in a new User and Best Practice Guide section

It would be useful to have a new section in the User and Best Practice Guide that covers the ARCHER2 hardware and architecture in more detail. This could go after Overview but before Connecting. This should cover:

System overview: node types, storage types, interconnect, external networking
Compute node details: layout, interconnect
Processor details: cores, core complexes, infinity core, NUMA regions, FP unit and instruction sets, cache
Memory details: type, speed, volume, bandwidth/latency (theoretical and measured)
Interconnect details: topology, features, bandwidth/latency (theoretical and measured)
Point to IO section for more details on storage

Performance tuning and best practice for OpenMP

We need a section on getting the most out of OpenMP : both generic (e.g. top ten tips for OpenMP, pointing to further documentation) and specific for ARCHER2 and AMD EPYC Zen2 (will need at least the TDS for this). Should also cover what functionality is available in the various PrgEnv and what is not.

Modify MITgcm documentation for clarity on ECCOv4-r4 process

Based on user feedback, I need to:

Mention that after using 'wget' to obtain the forcing data, the files need to be copied from their default directory
For clarity and redundancy, copy the compilation instructions into the ECCOv4-r4 case

(Please feel free to assign this issue to me)

Update STAT documentation

The following document:
https://docs.archer2.ac.uk/user-guide/debug/#stat
does not reflect the following issue:
https://docs.archer2.ac.uk/known-issues/#stat-view-not-working
It'd be useful to link the issue from the STAT section.

Add generic profiling information (CrayPat))

Add information on workaround for MPMD jobs

As Slurm MPMD is not yet working correctly we should document the current workaround in the Scheduler chapter.

Document use of hybrid MPI+OpenMP

The current documentation has an example MPI+OpenMP script but no documentation describing the background of how to run these jobs, more advanced placement information and a description of the best layout to match onto the ARCHER2 NUMA structure. This should be added in the Scheduler chapter. Point to the Tuning chapter for more advanced information on OpenMP.

Add UAN fingerprint to connection section

Are "here" documents useful for job submission on ARCHER2?

In the NERSC scheduler best practice, they use here documents to potentially reduce load on compute nodes and make jobs more efficient. See:

https://docs.nersc.gov/jobs/best-practices/#improve-efficiency-by-preparing-user-environment-before-running

Where they describe creating a script such as:

#!/bin/bash -l

# Submit this script as: "./prepare-env.sh" instead of "sbatch prepare-env.sh"

# Prepare user env needed for Slurm batch job
# such as module load, setup runtime environment variables, or copy input files, etc.
# Basically, these are the commands you usually run ahead of the srun command 

module load cray-netcdf
export OMP_NUM_THREADS=4

# Generate the Slurm batch script below with the here document, 
# then when sbatch the script later, the user env set up above will run on the login node
# instead of on a head compute node (if included in the Slurm batch script),
# and inherited into the batch job.

cat << EOF > prepare-env.sl 
#!/bin/bash
#SBATCH -t 30:00
#SBATCH -N 8
#SBATCH -q debug
#SBATCH -C haswell

srun -n 16 -c 32 --cpu_bind=cores ./myapp.exe 

# Other commands needed after srun, such as copy your output filies,
# should still be included in the Slurm script.
cp <my_output_file> <target_location>/.
EOF

# Now submit the batch job
sbatch prepare-env.sl

@kevinstratford commented

Not sure I like that here document business; if the preparatory work is really
significant, it could be a separate job with the main job as dependency. This
prevents conflating scripts (does prepare-env.sh here document overwrite
the submitted prepare-env.sh??)

What do people think, should we include this advice or not?

Initial version of Debugging section

Need to create the basic documentation for using gdb4hpc. Could base on docs at:

https://www.alcf.anl.gov/support-center/theta/gdb

Could then be improved once we have experience on the system.

Update NAMD page with information on how to get good performance

Early access users identified that particular run options and configuration are required to get good performance using NAMD on ARCHER2. The NAMD page needs to be updated to include this information.

Create initial entries in Data Analysis and Tools section

We need to look at creating some content here. Some initial pages could be:

VisiData
R (Cray R)

Complete Scheduler chapter

Complete initial version of scheduler chapter for initial beat release

Update MITgcm documentation to include new build options file

As part of an eCSE, we are developing a new build options file. We would like to link to a preliminary version of the new file.

(Please feel free to assign this issue to me.)

archer-migration/data-migration

There is currently a duplication of material in

archer-migration/data-migration

and

user-guide/data-migration

This needs to be rationalised.

Update "Debugging" section based on TDS access

List of Issues and Items for review prior to main system going live

Review of ARCHER2 Docs and identifying issues with move to main system (7/ JUL/21)

Changes completed

https://docs.archer2.ac.uk/faq/index.html#archer-work-data
Add year (2021) to date that ARCHER /work was decommissioned - DONE (CB)

https://docs.archer2.ac.uk/user-guide/connecting/#logging-in
Order of password and ssh key passphrase being reversed - DONE (ART)

https://docs.archer2.ac.uk/user-guide/data#work-file-systems - DONE (ART)
Update size to full /work

https://docs.archer2.ac.uk/user-guide/sw-environment - DONE (ART)
@aturner-epcc to look at this. See: #301

https://docs.archer2.ac.uk/user-guide/scheduler/#quality-of-service-qos - DONE (ART)

https://docs.archer2.ac.uk/user-guide/scheduler/#using-modules-in-the-batch-system-the-epcc-job-env-module
Need to review whether epcc-job-env-module will continue
This may break every user submit script if changed! - DONE (ART)
@aturner-epcc to look at this

https://docs.archer2.ac.uk/user-guide/scheduler/#bolt-job-submission-script-creation-tool
*** Julien check if bolt works - DONE (ART)

https://docs.archer2.ac.uk/user-guide/dev-environment/
@aturner-epcc to look at this #302 - DONE (ART)

Add information on resources on ARCHER2

Add information to docs on:

What a CU is and how it corresponds to time use on ARCHER2
How charging works: based on used time rather than requested time
You are charged for the nodes assigned to the job even if you do not use them all. e.g. if you request 4 nodes and and only use 2 then you are charged for the 4 nodes as they are not available to users while assigned to your jobs

Update "Quickstart for developers" section based on TDS access

Add info on ownership of data in subgroup directories

New data created in subgroup directories has the correct ownership due to the setgid bit but data copied/moved from elsewhere on /work (e.g. main project directories) keeps its current ownership (and has the setgid bit set so new data within the directories has original ownership). We should document this issue and the use of the chown command to fix ownership as it does trip users up.

Remove ARCHER to ARCHER2 part of docs

ARCHER is no more so some of this material is no longer relevant. Some of the information may still be of use so should be moved to other sections as required.

Add reservations info

Complete the following section with further guidance about reservations:
https://docs.archer2.ac.uk/user-guide/scheduler/#reservations

Add Quickstart for Package Users

The ARCHER2 Quickstart for package users needs to be added. Overview of the content required can be found at:

https://docs.archer2.ac.uk/quick-start/overview.html

Add note on how to query CPU usage from Slurm on running jobs

Update "Running jobs on ARCHER2" section based on TDS access

Initial version of data analysis section

This section is going to be difficult until we see what is available via the collaboration platform.

Could include information on using the cray-R environment here.

Update "Containers" section based on TDS access

Add research-software templates

I intend to add subdirectories with scraped template content under

reserch-software

The following are relevant with most recent existing source

Cirrus -> CASTEP
ARCHER -> Code Staturne
ARCHER -> PyChemShell/ChemShell
Cirrus -> CP2K
ARCHER -> ELK
ARCHER -> FEniCS
Cirrus -> GROMACS
Cirrus -> LAMMPS
New!!! -> Met Office Unified Model
New!!! -> MITgcm
Cirrus -> NAMD
New!!! -> Nektar++
New!!! -> NEMO
ARCHER -> NWChem
ARCHER -> ONETEP
Cirrus -> OpenFOAM
Cirrus -> Quantum Espresso
Cirrus -> VASP

Initial version of Python chapter

Before we can update the Python chapter, we need to decide on the approach to Python on ARCHER2. Initial proposal is:

For compute node, high-performance Python: use the cray-python environment. Need to document how you use this and how you install further Python modules on top
For data analysis, serial Python: probably provide an Anaconda distribution. Should this be provided as a module or a container environment?
For self-installed Python: need to recommend a solution. Could be miniconda or could advise to pull containers from the DockerHub

Populate Essential Skills section

The essential skills section needs to be updated to point to useful material such as Software Carpentry shell-novice.

Update "Software environment" section based on TDS access

Options for hybrid MPI/OpenMP jobs lead to incorrect thread placement

The options currently specified in the User Guide for hybrid MPI/OpenMP lead to multiple threads being placed on the same core. We need to investigate and find the correct options in Slurm to get the placement working.

Add information on multiple `srun` commands in batch scripts

Show how multiple srun commands can be used in job scripts - including to place multiple calculations on a single node.

Add information on shared directories and their use

Should cover:

Shared directories on /home and /work
Sharing with subgroup, project, others - different directory hierarchies and unix permissions
Impact on quotas
What happens to data in shred directories when user accounts are removed

Update "Application development environment" section based on TDS access

Add information on creating and using Singularity containers with MPI

The Containers section does not currently have information on how to create and use containers with MPI on ARCHER2. This information does exist, see:

https://epcced.github.io/2020-12-08-Containers-Online/12-singularity-mpi/index.html

and the example ARCHER2 job submission script at:

https://github.com/EPCCed/2020-12-08-Containers-Online/blob/gh-pages/files/osu_latency.slurm.template

Update "Quickstart for users" section based on TDS access

Performance tuning and best practice for MPI

We need a section on getting the most out of MPI : both generic (e.g. top ten tips for MPI, pointing to further documentation) and specific for ARCHER2 and Slingshot (will need at least the TDS for this). Should also cover what functionality is available in CrayMPI and what is not, also any limits that users should know (maximum tag counts, eager message defaults, etc.).

Update "Using Python" section based on TDS access

Complete initial version of Application Developer Environment

Template material has been copied across - needs modified to match Cray environment.

Add information on connecting from Windows using command line rather than point and click GUI

At the moment we only cover connecting to ARCHER2 from Windows using MobaXTerm:

https://docs.archer2.ac.uk/user-guide/connecting/#logging-in-from-windows-using-mobaxterm

Now that Windows Powershell supports SSH command line more consistently we should update the docs to cover connecting using that mechanism from Windows too.

Update library modules requiring new versions

New

ARPACK 3.8.0

Deferred

ADIOS 2.6.0

Other

Confirm status of cray-ga or remove

sstat --format=JobID,JobName,averss,maxrss,maxrsstask,avevms,maxvms,maxvmsize -j 12345

Or, to get memory use of a completed job:

sacct --format=JobID,JobName,averss,maxrss,maxrsstask,avevms,maxvms,maxvmsize -j 12345