neicnordic / sda-pipeline Goto Github PK

View Code? Open in Web Editor NEW

5.0 15.0 5.0 14.45 MB

A federated storage for sensitive data, NeIC SDA

Home Page: https://neicnordic.github.io/sda-pipeline/

License: GNU Affero General Public License v3.0

Go 94.79% Dockerfile 0.33% Shell 4.89%

federatedega sensitive-data sensitive-data-archive neic-sda

sda-pipeline's People

Contributors

Stargazers

Watchers

Forkers

jbygdell gitter-badger yjarosz norling

sda-pipeline's Issues

[Verify] Implement file verfication

Decrypt the file using https://github.com/elixir-oslo/crypt4gh
Don't keep the key unlocked unless it is in use.
Compute SHA256 sum of the unencrypted file as we decrypt it

Log correlation ids

In INFO mode we should also log the correlation ID

Check mapper service Nacks for incorrect requeue

Fix broken test

8_bad_json.sh and 10_trigger_failures.sh are broken and needs investigating.

Ensure the messages get nacked and put back in the queue if the work can not be completed.

[Intercept] move faulty messages to error queue

Encapsulate db logic with structs and interfaces

Db logic should implement interfaces to their main methods.

Document DB permissions

Might need to refactor the DB schema to simplify stuff in the end.

Investigate error when `verify_peer=true` for MQ connection

[Ingest] copy file from inbox to archive

split the c4gh header
decrypt and validate the header

Rename `Sync` to `backup`

There is a service called sync in the pipeline. This should more aptly be named backup (since it does backups). This service needs to be renamed, documentation updated, and references updated.

We need to provide the proper md5 checksum for accession from verify.

Deployment documentation

Update and/or add documentation on how it's supposed to be setup and configured.

Figure out how to cross-reference documentation we have in the project.

Add contributors and licenses

include a section with allcontributors.org
mention licenses with fossa.com

as pointed out in #206

Add ingestion test for standalone deployment

Fail on startup if c4gh keyfile can not be opended

Extend sda-sync service to also copy header

For big picture we need to sync files between Sweden and Finland. To do this we need to be able to sync the header as well.

Extend the s3sync service so it can get the header from the database and reencrypt it using another public key.

Question about protocol? Should header and body be concatenated or should it be sent separately?

The simplest solution would be to fetch the header, encrypt it and pass it through the MQ message
reencrypt with Finish public key
standalone deployment template

Check verify service Nacks for incorrect requeue

Sync reads whole file in memory

The sync service is currently reading the whole file in memory, causing in to crash when the size gets bigger. We need to allow for specific buffer size instead

Fail on startup if S3 backend can not be connected to

FEGA Accesion IDs problematic if same file uploaded several times

Accession ids from FEGA is based on the checksum of the (probably decrypted) file. So if the same file is uploaded several times there will be two entries in the database with the same accession id.
This makes it difficult to figure out which line belongs to which dataset. The reason this is important for example the filename needs to be correct for the private metadata (that is uploaded to the FEGA node). Also there should not be private metadata in the filenames but this could happen and those should not leak by mistake between datasets.

DoD
We need to be able to better distinguish between files in the database and when we ingest them.
This might involve some db changes to https://github.com/neicnordic/sda-db if we want to modify the db schema

sda-pipeline reconnect to MQ

Check finalize service Nacks for incorrect requeue

Add shellsheck and shfmt to actions suite

Run on each push to any branch except master.

Remove verified files from inbox

Dod:

When a file has been verified OK it should be removed from the inbox.

calculate un-encrypted file size in `verify`

we should calculate un-encrypted file size in verify

Problem is which column to use to store in db. inbox_filesize seem unnecessary at the moment and could be used

Create unit tests

All separate functions should have a corresponding test.

[broker] publish message

publish message using exchange and body as payload.

Add docker-compose without TLS

Currently the dev_utils has TLS enabled by default, we need a docker-compose for local development to start the services, sda-db and sda-mq without TLS.

Handle database connection issues

Make sure we handle everything correctly if there are issues with the connection.

Add `sync/backup` service to test suite

DoD: Ingestion test run with the sync/backup service doing backup of file from archive before finalize receive the message to start it's work. sync/backup listens to accessionIDs and publishes with the routing key so that the message ends up in the queue finalize listens to. finalize can not listen to accessionIDs in this scenario.

create integration test

ingest on posix + posix followed by verification
ingest on posix + S3 followed by verification
ingest on S3 + posix followed by verification
ingest on S3 + S3 followed by verification

Publish the initial message to the files routingkey in the localega exchange to start the ingestion process. This should be able to be done using curl via RabbitMQs API.

{
"user": "test",
"filepath": "dummy_data.c4gh"
}

DB TLS configuration

sda-pipeline/internal/config/config.go

Line 247 in 997cfd6

 if viper.IsSet("db.sslMode") && viper.GetString("db.sslMode") == "verify-full" { 

Right now only verify-full mode is supported as optional. We need to support all SSL modes, or, at least, require.

Check ingest service Nacks for incorrect requeue

Run all static code tests on PUSH

That means go ver, go lint gofmt and so on

Refactor db logic

We should use encapsulate db structs in a generic Db struct instead of using the psql struct directly.

shared storage components

posix support
S3 support

Send messages on error

Upon failures that aren't likely local/temporary, we need to make sure we send a message to the correct queue for errors queue for communication to the user as well as nacking the message so we do not get our queues filled up with messages that only lead to failures.

allow custom accession id patterns for standalone deployment

current deployment is restricted to ^EGAF[0-9]{11}$'] pattern however to use it in standalone mode one would need to enable a custom pattern.

The aim is to enable custom schema depending on deployment

validate incomming messages via json schema

Add mapper

Rework mapper to use our shared configuration
Include in functionality tests

Organise storage business

We should be able to get rid of the config.ArchiveType check and keep storage logic contained.

Use pipes to connect readers and writers

At the moment we are not able to perform batched S3 uploads since we cannot perform append-like operations using the S3 API. The solution might be to put all writes into a pipe and then let the uploader consume the pipe's reader.

Messages, s3 and eventual consistency

Since s3 is eventually consistent, it's quite possible that we get messages for a file in inbox and when we try to access. I believe this (or some similar issue) is currently seen in #141.

The fix for that will likely be different, but for ingestion as well as verification, we should probably handle the case that the message arrives before the s3 provider gets its things in order - that is, we should not treat a failure of getting a filesize or reader as fatal but rather see that as that we should retry (up to a certain limit).

I'm not sure how that is best done through the broker (requeue or send a new message) and what can be done (include a counter? Do we have a timestamp we can look at?) but we will probably need to do something.

Deploy a service that gets the message from completed and upon user interaction it creates the dataset id
The dataset id will be given (possibly from the data stewards - DOI)

Inputs:

File paths as they are defined in the metadata
DOI for the specific dataset
Submitter's username

Output:
Message for mapper service

Could even be a Kubernetes job

Note: The dataset id creation is not an automatic process

Standardize error queue messages

The error messages that pour in to the error queue should be made more uniform. This could be done e.g. by using one of the structs that we already have in place for errors.