Git Product home page Git Product logo

cta-data-relay's Introduction

This project has been archived. See https://github.com/vbrik/cta-ingest instead.

CTA Data Relay

This applications relays data from local disk to a GridFTP endpoint via an S3 bucket. CTA Data Relay is designed for an environment with very specific networking restrictions.

"CTA" in CTA Data Relay refers to the Cherenkov Telescope Array Observatory, the research project for which this application was written to assist with certain data transfers.

Overview

CTA Data Relay is like a highly-specialized rsync.

It operates roughly as follows:

On the source host:

  1. Build a list of files in a directory
  2. Retrieve a list of files that have been transferred previously from metadata stored in an S3 bucket
  3. Compress and upload to the S3 bucket files that haven't been uploaded previously

On the relay host:

  1. Download unprocessed files from the bucket, decompress them, and upload them to the GridFTP endpoint
  2. Update metadata in S3 bucket to indicate the file has been processed

CTA Data Relay can also perform various metadata operations, such setting the S3 bucket metadata to reflect what files are already in the GridFTP location (so that they are not re-uploaded).

Installation

CTA Data Relay requires Python3 and zstd (e.g. yum install -y zstd).

git clone https://github.com/vbrik/cta-data-relay.git
cd cta-data-relay
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Sub-commands that need to connect to a GridFTP server require the gfal2 library and its GridFTP plug-in. While not strictly necessary, additional packages are usually needed to use GridFTP in practice. See Dockerfile for GridFTP-related dependencies.

AWS Authentication

If AWS credentials are not supplied as command-line arguments, CTA Data Relay will rely on boto3 to determine them. At the time of writing, this meant environmental variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, or file ~/.aws/config, which could look like this:

[default]
aws_access_key_id = XXX
aws_secret_access_key = YYY

GridFTP Authentication

For commands involving GridFTP, x509 authentication is assumed (e.g. using voms-proxy-init or similar tools).

Usage

CTA Data Relay has three modes of operation.

"Local-to-S3" mode copies data from local files to an S3 bucket.

"S3-to-GridFTP" mode moves data from the S3 bucket to an GridFTP endpoint.

"Metadata" mode allows examining and manipulation of metadata that is stored in S3.

CTA Data Relay is a python module that can be run as python -m cta_data_relay ....

Run the application with the --help flag for more information.

cta-data-relay's People

Contributors

vbrik avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.