neicnordic / sda-pipeline Goto Github PK
View Code? Open in Web Editor NEWA federated storage for sensitive data, NeIC SDA
Home Page: https://neicnordic.github.io/sda-pipeline/
License: GNU Affero General Public License v3.0
A federated storage for sensitive data, NeIC SDA
Home Page: https://neicnordic.github.io/sda-pipeline/
License: GNU Affero General Public License v3.0
In INFO
mode we should also log the correlation ID
8_bad_json.sh and 10_trigger_failures.sh are broken and needs investigating.
Db logic should implement interfaces to their main methods.
Might need to refactor the DB schema to simplify stuff in the end.
There is a service called sync
in the pipeline. This should more aptly be named backup
(since it does backups). This service needs to be renamed, documentation updated, and references updated.
Update and/or add documentation on how it's supposed to be setup and configured.
Figure out how to cross-reference documentation we have in the project.
as pointed out in #206
For big picture we need to sync files between Sweden and Finland. To do this we need to be able to sync the header as well.
Extend the s3sync service so it can get the header from the database and reencrypt it using another public key.
Question about protocol? Should header and body be concatenated or should it be sent separately?
The simplest solution would be to fetch the header, encrypt it and pass it through the MQ message
reencrypt with Finish public key
standalone deployment template
The sync service is currently reading the whole file in memory, causing in to crash when the size gets bigger. We need to allow for specific buffer size instead
Accession ids from FEGA is based on the checksum of the (probably decrypted) file. So if the same file is uploaded several times there will be two entries in the database with the same accession id.
This makes it difficult to figure out which line belongs to which dataset. The reason this is important for example the filename needs to be correct for the private metadata (that is uploaded to the FEGA node). Also there should not be private metadata in the filenames but this could happen and those should not leak by mistake between datasets.
DoD
We need to be able to better distinguish between files in the database and when we ingest them.
This might involve some db changes to https://github.com/neicnordic/sda-db if we want to modify the db schema
Run on each push to any branch except master.
Dod:
we should calculate un-encrypted file size in verify
Problem is which column to use to store in db. inbox_filesize
seem unnecessary at the moment and could be used
All separate functions should have a corresponding test.
publish message using exchange and body as payload.
Currently the dev_utils
has TLS enabled by default, we need a docker-compose for local development to start the services, sda-db and sda-mq without TLS.
Make sure we handle everything correctly if there are issues with the connection.
DoD: Ingestion test run with the sync/backup
service doing backup of file from archive before finalize
receive the message to start it's work. sync/backup
listens to accessionIDs
and publishes with the routing key so that the message ends up in the queue finalize
listens to. finalize
can not listen to accessionIDs
in this scenario.
Publish the initial message to the files routingkey in the localega exchange to start the ingestion process. This should be able to be done using curl via RabbitMQs API.
{
"user": "test",
"filepath": "dummy_data.c4gh"
}
sda-pipeline/internal/config/config.go
Line 247 in 997cfd6
Right now only verify-full
mode is supported as optional. We need to support all SSL modes, or, at least, require
.
That means go ver, go lint gofmt and so on
We should use encapsulate db structs in a generic Db struct instead of using the psql struct directly.
Upon failures that aren't likely local/temporary, we need to make sure we send a message to the correct queue for errors queue for communication to the user as well as nacking the message so we do not get our queues filled up with messages that only lead to failures.
current deployment is restricted to ^EGAF[0-9]{11}$'] pattern however to use it in standalone mode one would need to enable a custom pattern.
The aim is to enable custom schema depending on deployment
We should be able to get rid of the config.ArchiveType
check and keep storage logic contained.
At the moment we are not able to perform batched S3 uploads since we cannot perform append-like operations using the S3 API. The solution might be to put all writes into a pipe and then let the uploader consume the pipe's reader.
Since s3 is eventually consistent, it's quite possible that we get messages for a file in inbox and when we try to access. I believe this (or some similar issue) is currently seen in #141.
The fix for that will likely be different, but for ingestion as well as verification, we should probably handle the case that the message arrives before the s3 provider gets its things in order - that is, we should not treat a failure of getting a filesize or reader as fatal but rather see that as that we should retry (up to a certain limit).
I'm not sure how that is best done through the broker (requeue or send a new message) and what can be done (include a counter? Do we have a timestamp we can look at?) but we will probably need to do something.
(Pontus, X)
With inline comments following go documentation practices.
Currently the orchestrator is creating dataset ids based on the path of the file. That needs to change, given that files might be uploaded in the same folder.
Ideas:
Inputs:
Output:
Message for mapper service
Could even be a Kubernetes job
Note: The dataset id creation is not an automatic process
The error messages that pour in to the error queue should be made more uniform. This could be done e.g. by using one of the structs that we already have in place for errors.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.