This project has been archived. See https://github.com/vbrik/cta-ingest instead.

CTA Data Relay

This applications relays data from local disk to a GridFTP endpoint via an S3 bucket. CTA Data Relay is designed for an environment with very specific networking restrictions.

"CTA" in CTA Data Relay refers to the Cherenkov Telescope Array Observatory, the research project for which this application was written to assist with certain data transfers.

Overview

CTA Data Relay is like a highly-specialized rsync.

It operates roughly as follows:

On the source host:

Build a list of files in a directory
Retrieve a list of files that have been transferred previously from metadata stored in an S3 bucket
Compress and upload to the S3 bucket files that haven't been uploaded previously

On the relay host:

Download unprocessed files from the bucket, decompress them, and upload them to the GridFTP endpoint
Update metadata in S3 bucket to indicate the file has been processed

CTA Data Relay can also perform various metadata operations, such setting the S3 bucket metadata to reflect what files are already in the GridFTP location (so that they are not re-uploaded).

Installation

CTA Data Relay requires Python3 and zstd (e.g. yum install -y zstd).

git clone https://github.com/vbrik/cta-data-relay.git
cd cta-data-relay
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Sub-commands that need to connect to a GridFTP server require the gfal2 library and its GridFTP plug-in. While not strictly necessary, additional packages are usually needed to use GridFTP in practice. See Dockerfile for GridFTP-related dependencies.

AWS Authentication

If AWS credentials are not supplied as command-line arguments, CTA Data Relay will rely on boto3 to determine them. At the time of writing, this meant environmental variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, or file ~/.aws/config, which could look like this:

[default]
aws_access_key_id = XXX
aws_secret_access_key = YYY

GridFTP Authentication

For commands involving GridFTP, x509 authentication is assumed (e.g. using voms-proxy-init or similar tools).

Usage

CTA Data Relay has three modes of operation.

"Local-to-S3" mode copies data from local files to an S3 bucket.

"S3-to-GridFTP" mode moves data from the S3 bucket to an GridFTP endpoint.

"Metadata" mode allows examining and manipulation of metadata that is stored in S3.

CTA Data Relay is a python module that can be run as python -m cta_data_relay ....

Run the application with the --help flag for more information.

vbrik / cta-data-relay Goto Github PK

cta-data-relay's Introduction

CTA Data Relay

Overview

Installation

AWS Authentication

GridFTP Authentication

Usage

cta-data-relay's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent