Git Product home page Git Product logo

sda-sync's Introduction

SDA-sync

The sda-sync is an integration created to solve the data syncing requirement in BigPicture. In this project, every data submission that takes place in a country's node must be mirrored in the other country's node. This means that both the submitted data files and their corresponding database identifiers (e.g. accession and dataset ID's) in one country's node should be replicated in the other country's node. This is achieved by syncing the data between the two nodes and then providing the relevant identifiers for use while ingesting.

Specifically, the files (and datasets) that are uploaded in one node, are synced/backed up to the other node by making sure that the ingestion process that is run on both sides will result in accession and dataset IDs that are the same in both nodes. To achieve this, the sync integration utilizes the sensitive-data-archive sync and sync-api services.

In brief, the syncing will be triggered whenever a local dataset submission, i.e. a dataset that is created on the local node, is detected. The detection is based on the dataset prefix which is unique to each country's node. In such a case, the sync service copies the dataset files to the inbox of the other node after these have been re-encrypted with the recipient nodes's crypt4gh public key. It also posts JSON messages to the receiving node's sync-api service from which sync-api creates and sends the necessary RabbitMQ messages to orchestrate the ingestion cycle on the receiving side of the syncing.

The sync-api service will be triggered whenever it receives a RabbitMQ message from the sync service of the other node. The sync-api service will then create the RabbitMQ messages needed to trigger the ingestion cycle on the receiving node.

The example integration in the docker-compose.yml assummes that both nodes are running inboxes that are S3 based but POSIX and sftp inbox types are also supported and may be configured by uncommenting the noted code segments of the compose file.

How to run the integration

The services for the two nodes can be started in parallel: Start the services in Sweden

docker compose --profile sda-sweden up

Start the services in Finland

docker compose --profile sda-finland up

Note: the version of the sensitive-date-archive services for the whole stack can be changed by editing the .env file that is located in the root of the repo.

Encrypt and upload files

There is a script under the dev_utils folder that cleans up the files folder and exports the keys for encryption. Running the script should download the crypt4gh keys that would allow for encrypting the file.test included in the dev_utils/tools folder

./prepare.sh

Note: This script should be run every time the services are re-run.

The next step is to encrypt and upload the file in the s3Inbox of the Swedish node, using the sda-cli downloadable here. For example, to get the linux release run the following commands from the dev_utils/tools folder:

wget -q https://github.com/NBISweden/sda-cli/releases/download/v0.1.0/sda-cli_.0.1.0_Linux_x86_64.tar.gz
tar -xf sda-cli_.0.1.0_Linux_x86_64.tar.gz sda-cli && rm sda-cli_.0.1.0_Linux_x86_64.tar.gz

From the dev_utils/tools folder, run:

./sda-cli upload --config ../s3cfg -encrypt-with-key ../keys/repo.pub.pem file.test

Ingest the file

Now that the file is uploaded in the S3 backend (that can be checked by logging into the minio via the browser at localhost:9000 and making sure the file is in the inbox bucket), the ingestion process need to be initiated.

That can be achieved using the sda-admin tool located at dev_utils/tools. The script has detailed documentation, however, here are the main commands needed to ingest the specific file. First ingest the file running the following command:

./sda-admin --mq-queue-prefix sda --user test_dummy.org ingest file.test.c4gh

where here test_dummy.org is the <USER-ELIXIR-ID> taken from the s3cfg file.

To check that the file has been ingested, run

./sda-admin --mq-queue-prefix sda --user test_dummy.org accession

You should be able to see the file in the list, similar to:

file.test.c4gh

To give an accession id to this file, run the following command, replacing the <ACCESSION-ID>:

./sda-admin --mq-queue-prefix sda --user test_dummy.org accession <ACCESSION-ID> file.test.c4gh

Finally, to create a dataset including this file, run the following command, replacing the <DATASET-ID>. NOTE: The <DATASET-ID> should have the centerPrefix defined under config.yaml and it should be minimum 11 characters long:

./sda-admin --mq-queue-prefix sda --user test_dummy.org dataset <DATASET-ID> file.test.c4gh

for example, if the centerPrefix value is EGA, the <DATASET-ID> should be of the format EGA-<SOME-ID>.

Making sure everything worked

Once the dataset is created, the swedish sync-api service should send the required messages to the finnish endpoint, which should start the ingestion on that side.

In order to make sure that everything worked, and apart for the docker logs, you can check the bucket of the finnish (receiver) side and make sure that the file exists in the archive. Also, check that the file and the dataset exist in the database of the finnish side.

sda-sync's People

Contributors

aaperis avatar dbampalikis avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sda-sync's Issues

Move the flow of this repo into an integration test in sensitive-data-archive

Create a docker-compose.yml that will be moved to the integration tests of sensitive-data-archive. This compose file does not need to be triggered automatically as an action but it should be easy to run manually so that we can test locally end-to-end syncing.

A/C:

  • The compose runs as in the sda-sync repo.
  • Add this to the Makefile and update documentation about how to run the test

Demo of sda-sync proposal in finland

Prepare the work that's been done on the sync/backup system for demo in the meeting in Finland on the 14-15th.

This should demo (AC):

  • sync of uploaded file data
  • sync of metadata
  • explain design decisions

Sprint Review demo:

  • Quick review of the demo in Finland, together with the feedback from the meeting

Have sda-admin be pulled with script

When we decide where to keep the latest version of sda-admin, add a script to pull this and sda-cli and store them in the dev_utils/tools folder as suggested here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.