Git Product home page Git Product logo

nlp-corpora-backend's Introduction

nlp-corpora-backend

This repository contains the infrastructure to provide a live status of /projects/nlp-corpora/ by crawling its contents.

Features

Checks over contents of all corpora (sub)directories.

Per-corpus:

  • owner
  • group
  • permissions
  • configurable access restrictions (groups and permissions)
  • corpus structure adherence
  • readme existence
  • readme project description
  • readme documentation (of processed variants)
  • size

Overall:

  • can fix permissions errors automatically with a flag
  • total size checks (above a configurable drive limit)
  • log containing detailed status breakdown and all errors
  • report generation
    • copies readme into browsable index
    • concise summary per corpus (name, readme link, description, size, access, status)
    • pie chart of overall size usage
  • configured cron usage:
    • self-updates backend and runs daily
    • pushes updated report to frontend
    • emails full error log on failures (configurable verbosity)

Installation

# create a fresh virtualenv. I use pyenv. You can use whatever.
# Use python >= 3.6.5. Then:
pip install -r requirements.txt

Running

Example usage:

# also prints log to stderr if any checks failed. (This behavior so cron
# auto sends an email to you if anything fails, but not if things pass.)
python check.py \
    --directory /projects/nlp-corpora/ \
    --out-file ~/repos/nlp-corpora/README.md \
    --log-file ~/repos/nlp-corpora/BUILD.txt \
    --doc-dir ~/repos/nlp-corpora/doc \
    --plot-dest ~/repos/nlp-corpora/disk-usage.svg

# The script can attempt to fix permission errors it finds. This isn't normally
# run in the cron job (though it could be). It can be enabled with a flag:
python check.py --fix-perms

# To run on the test directories (sorry Nelson, no automated tests yet), I run
# this to ignore the output markdown and see only the log.
python check.py \
    --directory test/test-nlp-corpora/ \
    --ok-owners max \
    --group-config test/test-groups.json \
    --out-file /dev/null

Full options:

python check.py --help
usage: check.py [-h] [--directory DIRECTORY] [--ok-owners OK_OWNERS]
                [--group-config GROUP_CONFIG] [--fix-perms] [--verbose]
                [--out-file OUT_FILE] [--log-file LOG_FILE]
                [--doc-dir DOC_DIR] [--plot-dest PLOT_DEST]

Tool to check nlp-corpora directory and output documentation.

optional arguments:
  -h, --help            show this help message and exit
  --directory DIRECTORY
                        path to top-level corpus directory (default:
                        /projects/nlp-corpora/)
  --ok-owners OK_OWNERS
                        comma-separated list of allowed owners (default:
                        mbforbes)
  --group-config GROUP_CONFIG
                        json file containing group information (default:
                        groups.json)
  --fix-perms           whether this should attempt to fix permission errors
                        it finds (default: False)
  --verbose             whether to log error messages for every problematic
                        file (default: False)
  --out-file OUT_FILE   path to write output file. If not provided, writes to
                        stdout. (default: None)
  --log-file LOG_FILE   if provided, writes log to this path. If not 100% of
                        checks pass, always writes log to stderr. (default:
                        None)
  --doc-dir DOC_DIR     if provided, DESTROYS this dir if it exists, creates
                        it fresh, and then writes directories and readmes for
                        all corpora under it. (default: None)
  --plot-dest PLOT_DEST
                        if provided, writes a donut plot of corpora disk space
                        usage to this location. (default: None)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.