Git Product home page Git Product logo

sda-pipeline's People

Contributors

aaperis avatar blankdots avatar dbampalikis avatar dependabot-preview[bot] avatar dependabot[bot] avatar dtitov avatar gitter-badger avatar jbygdell avatar jonandernovella avatar kjellp avatar kostas-kou avatar kusalananda avatar lilachic avatar nanjiangshu avatar norling avatar pahatz avatar pontus avatar sstli avatar viklund avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sda-pipeline's Issues

Fix broken test

8_bad_json.sh and 10_trigger_failures.sh are broken and needs investigating.

Rename `Sync` to `backup`

There is a service called sync in the pipeline. This should more aptly be named backup (since it does backups). This service needs to be renamed, documentation updated, and references updated.

Deployment documentation

Update and/or add documentation on how it's supposed to be setup and configured.

Figure out how to cross-reference documentation we have in the project.

Extend sda-sync service to also copy header

For big picture we need to sync files between Sweden and Finland. To do this we need to be able to sync the header as well.

Extend the s3sync service so it can get the header from the database and reencrypt it using another public key.

Question about protocol? Should header and body be concatenated or should it be sent separately?

  • The simplest solution would be to fetch the header, encrypt it and pass it through the MQ message

  • reencrypt with Finish public key

  • standalone deployment template

Sync reads whole file in memory

The sync service is currently reading the whole file in memory, causing in to crash when the size gets bigger. We need to allow for specific buffer size instead

FEGA Accesion IDs problematic if same file uploaded several times

Accession ids from FEGA is based on the checksum of the (probably decrypted) file. So if the same file is uploaded several times there will be two entries in the database with the same accession id.
This makes it difficult to figure out which line belongs to which dataset. The reason this is important for example the filename needs to be correct for the private metadata (that is uploaded to the FEGA node). Also there should not be private metadata in the filenames but this could happen and those should not leak by mistake between datasets.

DoD
We need to be able to better distinguish between files in the database and when we ingest them.
This might involve some db changes to https://github.com/neicnordic/sda-db if we want to modify the db schema

Add docker-compose without TLS

Currently the dev_utils has TLS enabled by default, we need a docker-compose for local development to start the services, sda-db and sda-mq without TLS.

Add `sync/backup` service to test suite

DoD: Ingestion test run with the sync/backup service doing backup of file from archive before finalize receive the message to start it's work. sync/backup listens to accessionIDs and publishes with the routing key so that the message ends up in the queue finalize listens to. finalize can not listen to accessionIDs in this scenario.

create integration test

  • ingest on posix + posix followed by verification
  • ingest on posix + S3 followed by verification
  • ingest on S3 + posix followed by verification
  • ingest on S3 + S3 followed by verification

Publish the initial message to the files routingkey in the localega exchange to start the ingestion process. This should be able to be done using curl via RabbitMQs API.

{
"user": "test",
"filepath": "dummy_data.c4gh"
}

Refactor db logic

We should use encapsulate db structs in a generic Db struct instead of using the psql struct directly.

Send messages on error

Upon failures that aren't likely local/temporary, we need to make sure we send a message to the correct queue for errors queue for communication to the user as well as nacking the message so we do not get our queues filled up with messages that only lead to failures.

Add mapper

  • Rework mapper to use our shared configuration
  • Include in functionality tests

Use pipes to connect readers and writers

At the moment we are not able to perform batched S3 uploads since we cannot perform append-like operations using the S3 API. The solution might be to put all writes into a pipe and then let the uploader consume the pipe's reader.

Messages, s3 and eventual consistency

Since s3 is eventually consistent, it's quite possible that we get messages for a file in inbox and when we try to access. I believe this (or some similar issue) is currently seen in #141.

The fix for that will likely be different, but for ingestion as well as verification, we should probably handle the case that the message arrives before the s3 provider gets its things in order - that is, we should not treat a failure of getting a filesize or reader as fatal but rather see that as that we should retry (up to a certain limit).

I'm not sure how that is best done through the broker (requeue or send a new message) and what can be done (include a counter? Do we have a timestamp we can look at?) but we will probably need to do something.

Create dataset id that is not dependent on path

Currently the orchestrator is creating dataset ids based on the path of the file. That needs to change, given that files might be uploaded in the same folder.

Ideas:

  • Deploy a service that gets the message from completed and upon user interaction it creates the dataset id
    The dataset id will be given (possibly from the data stewards - DOI)

Inputs:

  1. File paths as they are defined in the metadata
  2. DOI for the specific dataset
  3. Submitter's username

Output:
Message for mapper service

Could even be a Kubernetes job

Note: The dataset id creation is not an automatic process

Standardize error queue messages

The error messages that pour in to the error queue should be made more uniform. This could be done e.g. by using one of the structs that we already have in place for errors.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.