Git Product home page Git Product logo

sarchive's Introduction

SArchive

crates.io Travis Build Status Actions Status Coverage Status License LoC.

Archival tool for scheduler job scripts and accompanying files.

Note that the master branch here may be running ahead of the latest release on crates.io. During development, we sometimes rely on dependencies that have not yet released a version with the features we use.

Minimum supported rustc

1.70.0

CI tests run against the following Rust versions:

  • stable
  • nightly

If you do not have Rust, please see Rustup for installation instructions.

Usage

sarchive requires that the path to the scheduler's main spool directory is specified. It also requires a cluster (name) to be set.

sarchive supports multiple schedulers, the one to be used must be specified on the command line. Right now, there is support for Slurm and Torque.

For Slurm, the directory to watch is defined as the StateSaveLocation in the slurm config.

Furthermore, sarchive offers various backends. The basic file backend writes a copy of the job scripts and associated files to a directory on a mounted filesystem. We also have limited support for sending job information to Elasticsearch or produce to a Kafka topic. We briefly discuss these backends below.

File archival

Activated using the file subcommand. Note that we do not support using multiple subcommands (i.e., backends) at this moment.

For file archival, sarchive requires the path to the archive's top directory, i.e., where you want to store the backup scripts and accompanying files.

The archive can be further divided into subdirectories per

  • year: YYYY, by provinding --period=yearly
  • month: YYYYMM, by providing --period=monthly
  • day: YYYYMMDD, by providing --period=daily Each of these directories are also created upon file archival if they do not exist. This allows for easily tarring old(er) directories you still wish to keep around, but probably no longer immediately need for user support.

For example,

sarchive --cluster huppel -s /var/spool/slurm file --archive /var/backups/slurm/job-archive

Elasticsearch archival (removed)

The Elasticsearch backend will be revamped, as using the elastic crate is subject to a vulnerability through its hyper dependency (https://rustsec.org/advisories/RUSTSEC-2021-0078)

This will be added again once we can move to the official Elastic.co crate.

Kafka archival

You can ship the job scripts as messages to Kafka.

For example,

./sarchive --cluster huppel -l /var/log/sarchive.log -s /var/spool/slurm/ kafka --brokers mykafka.mydomain:9092 --topic slurm-job-archival

Support for SSL and SASL is available, through the --ssl and --sasl options. Both of these expect a comma-separated list of options to pass to the underlying kafka library.

Features

  • Multithreaded, watching one dir per thread, so no need for hierarchical watching.
  • Separate processing thread to ensure swift draining of the inotify event queues.
  • Clean log rotation when SIGHUP is received.
  • Experimental support for clean termination on receipt of SIGTERM or SIGINT, where job events that have already been seen are processed, to minimise potential loss when restarting the service.
  • Output to a file in a hierarchical directory structure
  • Output to Elasticsearch
  • Output to Kafka

RPMs

We provide a build script to generate an RPM using the cargo-rpm tool. You may tailor the spec file (listed under the .rpm directory) to fit your needs. The RPM includes a unit file so sarchive can be started as a service by systemd. This file should also be changed to fit your requirements and local configuration.

sarchive's People

Contributors

dependabot[bot] avatar itkovian avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

sarchive's Issues

Torque support

It would be really great to also have this for torque: job files and environment.

Support for log rotation

After log rotation, the process will be sent a SIGHUP, so this needs to be handled to allow logging to thee new file instead of to the rotated version.

Addressed by #11.

Torque .JB file can be empty in the file archival

[root@master03 ~]# ls -alh /var/spool/torque/job-archive/20200203/
total 12K
drwxr-xr-x 2 root root 4.0K Feb  3 11:46 .
drwxr-xr-x 3 root root 4.0K Feb  3 11:46 ..
-rw-r--r-- 1 root root    0 Feb  3 11:46 7.master03.manticore.brussel.vsc.JB
-rw-r--r-- 1 root root  226 Feb  3 11:46 7.master03.manticore.brussel.vsc.SC

Handle other signals

sarchive may run as a service under e.g., systemd. This means it needs to respond adequately to SIGTERM. Ideally, SIGINT is also handled properly.

It is not totally clear what the best course of action would be when receiving such a signal:

  1. quit immediately (possible losing the paths that are in the queue)
  2. stop accepting new entries into the queue but process those that already have been seen
  3. keep accepting new entries, but bail once the queue is empty.

Each of these obviously have advantages and disadvantages.

advantage disadvantage
1 - immediately quits - no potential wait time - loses all the events in the queue
2 - guaranteed to finish at some point - guaranteed to process all entries that have been seen up to this point - misses new entries that will not be seen after a restart
3 - no entry loss - might not stop

Allow for a better archive hierarchy

We currently have a very flat format, i.e., job.<jobid>_script and job.<jobid>_environment. While this suffices for finding job scripts, it has several drawbacks.

  • there can be many jobs in the archive, meaning the number of entries in the single archival directory will become quite large.
  • users may not always recall the exact job ID (there might be a few) and looking for a time might help pin down the problematic job.

A better archive could be organised by

  • user
  • cluster
  • timestamps (e.g., yearly, monthly, daily, ...)

Watching hangs when a large number of file are created in a very short time.

This issue occurs in the torque branch, where there is often a single directory to watch and the scheduler may create a large number of file when a job array is submitted.

To reproduce:

  • start sarchive in torque mode, watching a single directory
  • create a large number of files in this directory, e.g. 10K

Increasing the queue size through echo 65535 > /proc/sys/fs/inotify/max_queued_events does not always help. After creating 50 K files with 64K event queue size, the execution stopped at

[2019-06-13T22:08:58.955897586+02:00][sarchive::lib][INFO] copied 11 bytes from "/tmp/torque/17354.SC" to "/tmp/archive/job.17354_SC"
[2019-06-13T22:08:58.956109575+02:00][sarchive::lib][INFO] copied 11 bytes from "/tmp/torque/17355.SC" to "/tmp/archive/job.17355_SC"

Issue occurs with

  • kernel 3.10.0-957.12.2.el7.ug.x86_64. (custom build based on the version number)
  • glibc-2.17-260.el7_6.5.x86_64
  • notify crate main branch with commit #56aac12
  • running on a kvm VM with storage in a Ceph backend

Does not compile using rustc on ubuntu 20.04

How to compile and install ob ubuntu 20.04?
root@server1:/opt/archive/sarchive-production/src# rustc main.rs
error[E0463]: can't find crate for chrono
--> monitor.rs:22:1
|
22 | extern crate chrono;
| ^^^^^^^^^^^^^^^^^^^^ can't find crate

error[E0463]: can't find crate for crossbeam_channel
--> monitor.rs:23:1
|
23 | extern crate crossbeam_channel;
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ can't find crate

error[E0463]: can't find crate for crossbeam_utils
--> monitor.rs:24:1
|
24 | extern crate crossbeam_utils;
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ can't find crate

error: environment variable CARGO_PKG_VERSION not defined
--> main.rs:48:23
|
48 | const VERSION: &str = env!("CARGO_PKG_VERSION");
| ^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: this error originates in the macro env (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0432]: unresolved import clap
--> main.rs:23:5

If a watched directory disappears, the watchers should be renewed.

Watchers are registered with the kernel, but if the directory vanishes, the watcher stops getting notification for the FD it is subscribed for. The FD for the new directory is obviously different, meaning that we are now not watching that directory.

A fix could be to watch a removal event of the watched directory. For this, we probably need an extra watcher that will signal the threads to cease operations and restart them, while waiting for the new directories to become available.

If this happens frequently, it might induce a race condition, where we ask to watch locations that have since disappeared again. It should be OK to quit with an error in this case.

Archival failure: invalid UTF-8

[2020-01-07T20:44:18.820585657+01:00][sarchive][ERROR] processing failed: Custom { kind: InvalidData, error: "stream did not contain valid UTF-8" }

Slurm script files are considered binary files by grep

  • The final character seems to be \0
  • This is the case in the files written by Slurm (not sure if it is a feature to have a \0-terminated string when reading the data)
# hexdump script
<snip>
0000130 2070 203c 616c 6d6d 7370 692e 0a6e 0000
000013f
  • We can pinch off that final char to have an ASCII file that is saved

Allow pushing data to Elasticsearch

Needed:

  • A config file to describe the output choices and settings for ES

    • Should be extensible, i.e., support for other output channels baked in from the start
    • Should be in a simple format (TOML?)
  • The script file can be sent as is

  • The environment file should probably be parsed and sent as a JSON object with an entry per item in the file?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.