Git Product home page Git Product logo

olcf-user-docs's Introduction

olcf-user-docs's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

olcf-user-docs's Issues

Add info on sshfs

We frequently see OLCF users get shut out from repeated network connection attempts when sshfs attempts auto-reconnects (that fail due to 2 factor, of course).

Don't run sshfs with reconnect options.

other options for transferring data?

Right now, the "Transferring Data" page only describes Globus. Should we have instructions about using command-line utilities with the DTNs, and/or other methods?

remove cross references in summit guide

as the "connecting" and "data and transfers" sections will likely link back to stand-alone pages on those topics, we shouldn't have cross-ref destinations pointing to those sections; wherever these references are needed, they should point to the authoritative pages.

Identify and migrate relevant "Software Pages"

With this migration of OLCF docs into version control, we will be dropping the "Software Pages".

However, there is a small set of software packages that have important documentation living in these pages. This issue is to identify which software packages should have their content migrated into this repository, and to make it happen.

Most likely, packages of interest are those that are vendor-provided and have vendor-provided site-specific accompanying documentation.

Add info on NVMe NFS caching to Summit User Guide

Starting w/ Sept. 24 outage - 100GB of each compute node's NVMe device will be dedicated to NFS cache to hold default libs. This should improve launch times. It also limits user-writable space on the burst buffer to 1500GB 1400GB (down from 1600 1500).

Possibility of webhooks to push to production

Currently, the OLCF internal K8 cluster polls every few minutes to check for changes.

I suspect it's possible to setup a webook or some other CI/DevOps process to trigger a rebuild on the K8 cluster when a PR is merged into the master branch. We should look into those options and what sort of security and credential management we'd need to do that with a public GitHub repo and a closed-access internal K8 cluster.

Preparing Data Section

I am preparing data section, removing Lustre, organizing some images and need to redo some of the videos.

use note/warning directives

Instead of formatted blocks with bold "Note" or "Warning", we can use .. note:: a note and .. warning: a warning to get colored blocks. We should update throughout.

Tables not visible at low width

Body content is responsive for smaller screens / browser windows. Tables are somewhat responsive, but a table's responsiveness is limited by the amount of content in its cells. When tables cannot be dynamically made smaller, the rightmost columns become inaccessible. Horizontal scrolling is an option, but is not working.

Tested and confirmed a problem in:

  • Chrome
  • Firefox

Screen Shot 2019-09-27 at 1 54 49 PM

re-sync Rhea user guide

The wordpress-based Rhea User Guide at https://www.olcf.ornl.gov/for-users/system-user-guides/rhea/ is frozen as of 06 September 2019.

This means that the content in these new pages needs to be re-synced. Here is the procedure:

howto

  1. put your name by a section below to claim it
  2. update the content of that section:
    • make sure that the topics covered by the previous wordpress page are covered here
    • check that all hyperlinks (both external and internal work). Actually click on them; don't just read them
    • check visual elements: images scale correctly, tables are shaped correctly, etc.
    • fix any other problems you see, or at least submit a separate issue for it to remind us to fix it in the future
    • think about what could make the section better, and submit issue(s) with suggestions
  3. Be sure to mention this issue via something like "Partially addresses #14" in your PR description.
  4. check off your section below once your PR addressing it has been merged.
  5. return to "1" above

todo

*these sections should just refer back to the general pages on the topic, possibly with exceptions mentioned for Rhea

remove volume mounts from container

The openshift build needs to be updated to remove the volume mounts and perform binary builds to pull in the necessary changes to the webserver configuration.

We also need to verify that the space is being managed correctly regarding access logs, etc.

Suggestions for repo settings

We should probably add to this repo:

  • Setup collaborators and team members on the repo; I think multiple people in UA should be maintainers on the repo. This would also be useful for assigning issues to specific people.
  • Protected branches Nevermind, it seems all branches are protected. I like it.
  • Private issues; I don't think we have a github enterprise account, but we should look into that for a way to have issues and discussions that don't require hopping over into MM that can be private and not revealed to the public. Especially for working on docs for systems still under NDA. A potential option would be to use Public GitLab (which allows for free public repos with private issues) instead of GitHub?
  • Setup an MM integration to alert the MM channel to changes.
  • Define a set of issue labels for binning common issues (maybe per system, NDA, infrastructure, deprecated information, etc...)
  • Policy for timely PR approvals: who can and should approve changes and when? One of the issues we've had with the software CI deployments is that we require MR approvals but not everyone who can approve voluntarily reviews changes without prompting.

Update docs to reflect Titan/Eos decommissions

References to Titan and Eos should be removed following the decommissioning of these systems.

There are a handful of references to these individual systems in our Policies, as well as in the System User Guides.

fix relative links from wordpress

todo

Here are is a preliminary list of offenders found with a simple grep grep -r '</for-users/' *, but there are other forms of brokenness as well:

  • getting_started/index.rst
  • olcf_policy_guide.rst
  • documents_and_forms.rst
  • getting_started.rst
  • frequently_asked_questions.rst
  • summit_user_guide.rst (to be done during re-sync)
  • rhea_user_guide.rst (to be done during re-sync)

link types

For in-page relative links, something like this is broken:

`NVIDIA Volta V100 </for-users/system-user-guides/summit/nvidia-v100-gpus/>`__

while something like this is fine:

`hardware threads <#hardware-threads>`__

or the most robust way (should survive page restructuring and section renaming) for linking (espcially to other pages) within the documentation is to provide cross-references: https://www.sphinx-doc.org/en/master/usage/restructuredtext/roles.html#ref-role

full URLs are okay:

`NVIDIA Volta Architecture White Paper <http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf>`_

Run a RST linter

It would be nice to catch syntax errors as early as possible.

re-sync Summit user guide

The wordpress-based Summit User Guide at https://www.olcf.ornl.gov/for-users/system-user-guides/summit/summit-user-guide/#system-overview is frozen as of 06 September 2019.

This means that the content in these new pages needs to be re-synced. Here is the procedure:

howto

  1. put your name by a section below to claim it
  2. update the content of that section:
    • make sure that the topics covered by the previous wordpress page are covered here
    • check that all hyperlinks (both external and internal work). Actually click on them; don't just read them
    • check visual elements: images scale correctly, tables are shaped correctly, etc.
    • fix any other problems you see, or at least submit a separate issue for it to remind us to fix it in the future
    • think about what could make the section better, and submit issue(s) with suggestions
  3. Be sure to mention this issue via something like "Partially addresses #13" in your PR description.
  4. check off your section below once your PR addressing it has been merged.
  5. return to "1" above

todo

*these sections should just refer back to the general pages on the topic, possibly with exceptions mentioned for Rhea

Preserve image aspect ratios on window resize

In some areas of the user guides, say Summit > Running Jobs, images can become distorted when resizing the browser window.

If possible, enforce preservation of aspect ratios site-wide. If not possible, this should be corrected on existing images.

known issue: cuda hooks / pami

When running simple gpu codes with jsrun that do not also have MPI support, sometimes one might run into a warning such as:

CUDA Hook Library: Failed to find symbol mem_find_dreg_entries, ./a.out: undefined symbol: __PAMI_Invalidate_region

This can be solved in a few ways:

  • use jsrun -E LD_PRELOAD=/opt/ibm/spectrum_mpi/lib/pami_451/libpami.so ...
  • use jsrun --smpiargs="off" ...
  • use jsrun --smpiargs="-disable_gpu_hooks ...

Some more discussion can be found at kokkos/kokkos#1985; reports there say it has been reported to IBM.

Move and cleanup the cross-submission stuff from the Rhea users guide

Right now, the Rhea Users Guide is the only place where Cross-submission is discussed.

I think it needs moved to a higher level and then referenced in the summit guide and the rhea guide (similar to "Connecting" #55)

(It also needs slurm and lsf updates, and verified that it all still works)

Mention ORNL-distributed RSA fobs

Add note to Connecting for the First Time for ORNL employees. If using an ORNL-distributed RSA fob, there's no need to set a new PIN.

address fixmes in contributing page

It would be nice to have some useful screenshots, especially for submitting pull requests and submitting issues.
More detail elsewhere might be needed.

Remove the default compiler versions from Summit User Guide

Per discussion on #57. Version information for any software should be found by querying the system itself. The Compiling section of the user guide lists available compiler suites, options, and feature support, but should not be responsible for capturing default versions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.