neicnordic / sensitive-data-archive Goto Github PK

Home Page: https://neic-sda.readthedocs.io

Dockerfile 0.50% Shell 5.86% Go 66.35% Python 2.04% JavaScript 0.03% HTML 0.93% PLpgSQL 7.71% Java 15.06% Smarty 1.00% CSS 0.03% Makefile 0.49%

sensitive-data-archive's Introduction

Sensitive Data Archive

SDA contains all components of NeIC Sensitive Data Archive It can be used as part of a Federated EGA or as a isolated Sensitive Data Archive.

For more information about the different components see the readme files in the respecive folders. For more information on how to start developing read Getting Started developing components of the SDA stack.

sensitive-data-archive's People

Contributors

Stargazers

Watchers

Forkers

norling kkochel mmcbg bwernerbioluni cerit-sc biobanklab

sensitive-data-archive's Issues

[GH Actions] investigate why Trivy scan fails for the Java container

FATAL image scan error: scan error: scan failed: failed analysis: analyze error: pipeline error: failed to analyze layer (sha256:042f59a781dde0a48077e7a02e56010d3ce043e60b4b4152c96bf9b7b585d4f1): post analysis error: post analysis error: Unable to initialize the Java DB: Java DB update failed: Java DB update error: Java DB metadata error: unable to decode metadata: EOF

Re-running the task will make it pass

[sda-auth] Provide endpoint for sda-cli login

This ticket covers the changes needed in sda-auth
From the auth side there should be an endpoint /info that provides the needed parameters for the cli to proceed with the login.

In the cli we run:
$ sda-cli login bp.nbis.se

This gets the info endpoint from bp.nbis.se/api/info which returns a JSON with the fields:

{
        ClientID
	OidcURI 
	PublicKey
	InboxURI
}

. Which then the sda-cli can use to login using LS AAI to the service. The config file for uploading files and the public key should then be available in.

Description from Miro:
Login to the LocalEGA service for upload using the OIDC "device authentication flow".

Our LifeScience AAI configuration should already be in place.

Upon successful login the configuration parameters should be saved in a .sda-cli-session file (or similar) that can then be used on following usage of sda-cli (such as upload/encrypt and so on). Warn before overwriting.

LS AAI Service ID needs to be in sda-cli or in some configuration file for this to work. An endpoint where this can be reached so it's easy to detect what system (BigPicture or EGA or whatever) is used. Use sda-auth if only the LS AAI client id, otherwise the api service in sda-pipeline (or merged repo).

Endpoint should return everything needed for creating the config file and the following:
Client id
URL for s3 inbox
Public key of repository

Fix known vulnerabilities for DOA

openjdk:18-alpine has a number of known vulnerabilities

A/C

Update the base image
Make sure no CRITICAL/HIGH vulnerabilities are still around

Update sda-download to GO 1.20

A/C

update to GO 1.20
update all related tests

Cancel messages - Ingest

Description
Make a submission, add a file and then cancel it

Current behavior
Ingest currently starts the ingestion process again, without checking if the message with the specific correlation id already exists.

Expected behavior
Ingest should only change the status of the file, instead of creating a new row in the database and ingesting the file again. Then send a message to trigger the verify

[s3inbox] Resume multipart uploads

Currently interrupted multipart uploads cannot be resumed.

DoD: Resume an interrupted upload to a ceph s3 backend using s3cmd.

Merge Golang based apps

Expected structure:

cmd/<APPNAME-1>
cmd/<APPNAME-2>
shared/<COMPONENTNAME-1>
shared/<COMPONENTNAME-2>
Dockerfile

auth
download
re-encryption service

A/C

use similar setups as in sda-pipeline
ensure all tests keep working

[api] API call that can give information about the status of each file in the ingestion pipeline

As a submitter
I want to have an API to get information about the state of files in the system
So that I can see that things are happening and remind myself what files I've uploaded.

Given a specific submission or user. The API should give back results for each file where it is in the system. In the inbox, whether ingestion has started, if there are errors and so forth. It's basically reporting the status and metadata logged in the file_event_log table.

The authentication for this service could/should be similar to the download or upload services.

This is an API for the users

[sda-download] Handle "personalities" / use cases of on/off platform

For "on platform" use in bigpicture, sda-download should support decrypted downloads (so no crypt4gh should be required on the client side).

For "off platform" (normal use), data should be reencrypted for the users key. Probably un- (crypt4gh)-encrypted downloads should not be permitted (TLS will always be applied).

How do we want to do this? Possible strategies:

A simple toggle and different deploys access for on/off platform
Some rule system to behave properly
Something else

[sda] Consistent handling of messages

As a developer
I want to have a consistent way of handling the messages and their errors (nacking, rerouting, sending to error queue etc)
In order to be able to trust the system

There is an effort in this document to map all the cases that can go wrong in the sda-pipeline, in order to handle them properly.

Rules:

On all database errors, the message should be Nack'ed and requeued.

Update sda-common to GO 1.20

A/C

update to GO 1.20
update all related tests

[sda-download] Get public key for receiver reencryption

sda-download currently does not support encrypted downloads, but it should (see e.g. #361)

To be able to reencrypt the requested data for the user, it needs the public key of the receiver. If it can't get it, it should fail the request.

Generally I think we should support accepting the receiver public key in the request, but to support compatibility with "normal" s3 clients, we should also support having a per-user default key (settable somehow through a web interface, out of scope for this issue but the interface needs to be defined).

[sda-db] Append-only schema for auth logs

There should be a separate database schema for logging JW Tokens. There should be a log entry every time a token is generated, and potentially when it is used in a service.

The following data should be logged:

timestamp
authenticated user
submission account
JW Token

[s3inbox] Add integration tests for key rotation

Test that s3inbox can work with both jwk and local key validation.

Easiest way to do this seems to modify the oidc.py by adding a secondary token that is signed by the local key that is created upon the deployment of the compose test environment.

Update sda-s3proxy to GO 1.20

A/C

update to GO 1.20
update all related tests

[sda] Store file errors in the database

As a sda-admin

I want to store the file errors message in the database

The record should be stored in the schema file_event_log, item error as text string

Note This issue needs to be refined. e.g. How should the error be stored
Schema can be found at https://github.com/neicnordic/sensitive-data-archive/blob/main/postgresql/initdb.d/01_main.sql#L111-L144

Add option to put data in the error field in the go function that updates the status of files.
Go through the sda-pipeline and add a call to this function everywhere something goes wrong with files.

Store:

information that should be relayed back to user. File errors

Don't store:

Database connection errors

[sda-download] Implement byte range handling to support random access

To support random-access, sda-download can handle Range requests.

This could include:
[ ] Send Accept-Ranges: bytes header on request
[ ] Change the S3 code to support getting the requested data only, provide as a ReadSeeker
[ ] If serving unencrypted, make sure to provide the s3 as a ReadSeeker to NewCrypt4GHReader and serve the needed region(s) from the Crypt4GHReader
[ ] If serving encrypted serve the needed region(s)
[ ] Update tests

sda-download currently supports getting a segment of the (unencrypted) data by URL parameters startCoordinate and endCoordinate (https://neic-sda.readthedocs.io/en/latest/dataout/#rest-api-endpoints) these likely also needs to be supported going forwards.

Update sda-auth to GO 1.20

A/C

update to GO 1.20
update all related tests

[notify] New service that can consume MQ queues and send emails on certain conditions.

As a submitter
I want to get notified when something goes wrong with my submission
So that I can investigate it

This is another new service that can notify people in case things happen in the system. It should be configured to be able to listen to queues in the system and upon certain conditions send out an email to the submitter.

Tasks:

[sda] Refactor nacking mechanism into function(s)

This could be a good starting point to chop down lengthy functions in the code.

[notify] Listen to the error queue and send an email to the submitter for errors

One functionality in the notify service is to consume messages from the error queue and notify the submitter through email what has happened. The email is submitted by the user when starting the submission (#424).

[Postgres] Remove legacy views from the database

This issue is blocked by #13

This will be a breaking change

A/C

[sda-auth] login context data is not read properly

How to reproduce:

against the backend: docker compose up cega oidc from sda-auth/dev-server:

$ ./auth
{"level":"info","msg":"The logs format is set to JSON","time":"2023-10-23T17:26:45+02:00"}
{"level":"info","msg":"Setting log level to 'info'","time":"2023-10-23T17:26:45+02:00"}
{"level":"info","msg":"Serving content using http","time":"2023-10-23T17:26:45+02:00"}
Iris Version: 12.2.7
Build Time: 2023-10-20T11:13:39Z
Build Revision: ccd56f1

Now listening on:
Local: http://localhost:8080
Application started. Press CTRL+C to shut down.
{"authType":"elixir","level":"info","msg":"User was authenticated","time":"2023-10-23T17:27:16+02:00","user":"test"}
{"level":"error","msg":"Failed to view login form: template: elixir.html:23:42: executing "elixir.html" at \u003c.infoUrl\u003e: can't evaluate field infoUrl in type main.ElixirIdentity","time":"2023-10-23T17:27:16+02:00"}```

[sda] Move logging to functions

As a developer
I want to move the log.Errorf log messages of the pipeline services to functions
So that the code is more readable and easier to change.

These functions can be added in the common folder and be called with different messages, depending on the case

These type could be a good first step:

log.Errorf("Failed to open file to ingest "+
	"(corr-id: %s, user: %s, filepath: %s, reason: %v)",
	delivered.CorrelationId,
	message.User,
	message.Filepath,
	err)

A/C:

Pull request with refactored log.Errorf messages
(After the task is finished) A new card for the next type of log messages that needs refactoring

Rework RabbitMQ

A/C

use stream type queues for things we need to keep around
simplify setup script (make use of things in the base image)

[postgres] Add dataset event log

Similar to how we save the events for each file

[s3inbox] check for reserved characters in upload path

This should be done as early as possible.

! * ' ( ) ; : @ & = + $ , ? % # [ ]

Related line of code:

sensitive-data-archive/sda/cmd/s3inbox/proxy.go

Line 508 in 116ca2a

re := regexp.MustCompile(`[\\:\*\?"<>\|\x00-\x1F\x7F]`)

[api] API call that can list files uploaded by a specific user/project/submission (any of those is ok)

The API request should use some sort of token from the user and list all files uploaded by that user. This information should be in the new database schema.

The inboxes should create this once migration to the new schema is done. But this feature can be implemented before that with some mock database content.

Authenticate from bearer token
extract (project) name from token
list uploaded files (still in inbox??)

[sda-auth] Download configuration file fails the first time

The button for downloading the configuration file fails the first time the user tries to click it. Going through the login process again, solves the problem, however, it's not optimal.

[sda-auth] Upgrade the oidc library

As an sda-developer
I want to upgrade the oidc library from version 2 to 3
In order to keep up with security updates

The oidc library we are using in auth has a new major version. We should consider updating or even moving to the oidc go library recommended by openID

S3proxy should add correlation id to the database.

Since the s3proxy sends the "Uploaded" message it has the correlation id for that file, it should be in the file_event_log already here.

[sda-db] Handle stable_id collisions

Background

CEGA generates file stable id's from the unencrypted file checksum. This means that files with identical content will get the same stable_id. This can cause problems, since it will prevent finalize from assigning the stable_id and continue to mapper.

For CEGA, using the same id works since they only access files by stable_id, but for Bigpicture, there's been requests to download files with the upload filename. So each file needs a correct stable_id and submission_file_path.

Possible Solution

One of the core issues is whether multiple uploads should share the same sda.files entry. Storage deduplication can be solved by pointing to the same archive path regardless of number of items, but there is also a case to be made for having all files containing the same data use the same database entry. One option here is to move the storage information into a separate table, so that multiple "upload-files" could reference a single "storage file".

It is likely required to remove the stable_id field from the sda.files table, and instead use the file_references table to store the ID. This is partly a matter of simplifying the schema so that all potential stable IDs are handled in the same manner. If multiple sda.files-entries are used, this also allows multiple files to have the same ID when needed for FEGA.

To solve the problem of submission_file_path if one sda.files-entry is used, one solution is to add a file_path field to the file_dataset table, so that a file can have a unique path for each dataset it's part of while still only referencing one sda.files entry.

Memory stream-lining and concurrency

The AWS go sdk has some shenaigans and will be happy to try to use concurrency (and allocate buffers for this). To streamline memory usage, we should enable setting a lower concurrency than the default 5).

[s3proxy] integration test for allowed characters

We should have integration tests to show that we allow unicode characters, and disallow special restricted characters. The test does not have to pass to close this card (that is - you do not have to fix any issues that shows up).

The tests should be run locally.

Example of allowed filenames:

🍫 ⋆ 🍡 🎀 𝒻𝒶𝓃𝒸𝓎 𝓂𝑒𝒹𝒾𝒸𝒶𝓁 𝒾𝓂𝒶𝑔𝑒 🎀 🍡 ⋆ 🍫.dcm
ņọŗmal_meḑĩćąl_ĭmagȩ.dcm

Examples of disallowed filenames: (disallowed characters are \/:*?"<>| + control characters)

image\file.dcm
genomic:file.dcm

Cancel messages - verify

When re-ingesting a file and the fileID and checksum already exists in the checksum table we shouldn't exit with an error, just update the file status and send send a message as usual.

[s3-inbox] Rework handling of public keys

When importing public keys from a JWK endpoint we only reads the first key. https://github.com/neicnordic/sda-s3proxy/blob/a3f4d9b3ec9c906692bb604c5317b7dccfa9c978/userauth.go#L171
This means we have no way of handling key rotations.
Fixing this means we no longer can refer to a key by the name of the issuer of a token.

A/C

Keys are imported as a list in memory, (can be separated into one list for RSA and one list for EC to minimize any errors)
Multiple keys from the same issuer can be present, both RSA and EC for example
Instead of looking for a key by name we try to validate the token against all keys in our list until we find one that matches.
We deny access only after we have tested all keys present of a given type.

[notify] Listen to the mq queue that gets "ready" messages and send out an email once the final ready message for a submission is recieved.

Listen to the MQ for "ready" messages. Once all the files for a specific submission has been marked as ready. Maybe we need to ask the metadata-submitter api what files are included in a specific submission or that information exists somewhere in the database.

[sda-auth] Add htsget to s3config

As an sda-user
I want to have the htsget endpoint in the configuration file after logging in
In order to be able to download files from the htsget endpoint

The point is:
Have the url to the htsget in the s3configuration coming from the sda-auth

This should be included in the sda-auth and in the helm charts for deploying the auth service (as optional configuration)

[api] Add authentication to API service

As an sda-user
I want to be able to authenticate with the sda-pipeline API
in order to be able to get information about my submission

The authentication should follow the sda-download way, where we use a JWK endpoint for checking the public key and validate the token

There is a discussion in this PR about this issue

[sda-download] Implement header switch/re-encryption/deencryption

If a user sends a public key, re-encrypt the data
related params:

        - name: Public-Key
          in: header
          description: Public Encryption key
          required: false
          schema:
            type: string
        - name: destinationFormat
          in: query
          description: destinationFormat
          required: false
          schema:
            type: string
            default: plain | crypt4gh

Also take a look at https://github.com/neicnordic/sda-download/blob/0aa5af3ce690a610c5c2b369dd9d0ef8d58468d7/internal/database/database.go#L173-L178

Update sda-pipeline to GO 1.20

A/C

update to GO 1.20
update all related tests

[api] As a technical submitter (user), I want an API that I can interact with so that I can control my submission.

Describe the user story

There are several things that an technical person would like to ask the service. These include for example what files are uploaded, how is the submission going, are there any ingestion errors, how many files are left to be ingested. Are all files backed up and so forth.

This issue is a parent issue to a number of other tasks that we need to implement.

Tasks:

Testing

Unit testing for the API calls. Maybe create a database mock data that has a few files in each state. This can be done per task above.

[SDA] support for multiple s3 buckets

Please describe the feature
As a service operator, I want the entire S3 path stored in the database so that I don't manually have to point the app to the correct backend during startup.

If the storage hostname is expected to be static the path can be stored as: s3://<BUCKETNAME>/<OBJECTPATH>

If support for multiple storage hosts are needed the the path can be stored as AWS standard <BUCKETNAME>.<HOSTNAME>/<OBJECTPATH>

Acceptance criteria

Full path is stored in submission_file_path
Full path is stored in archive_file_path
Full path is stored in backup_path
bucket is removed from the S3 config as it will not be needed
Tests verifying the changes are added/updated

Additional context
Affected components:

postgresql
sda
- cmd/ingest
- cmd/finalize
- cmd/mapper
- cmd/s3inbox
- cmd/verify
- internal/config
- internal/storage
sda-download

Estimation of size: big

Estimation of priority: medium

Cancel messages - Finalize

Description
Make a submission, add a file and then cancel it

Case 1: The file does not have a stable id - Seems to work correctly
Expected behavior
Finalize should do the same job as per usual, as long as the file status is not set to disabled.

Case 2: The file has been given a stable id
Current behavior
Finalize is trying to add the stable id, but since it exists, the database returns an error and the correct message to CEGA is never sent (an error message is sent instead)

Expected behavior
Finalize should only update the file status to ready and send the outgoing message to CEGA

Document go based orchestrator according to the standard

All of the SDA services are documented according to a standard set by the SDA handbook. The orchestrator currently lacks this documentation, so it needs to be added.

An example of the documentation wanted can be found here (or at any other pipeline service).

The documentation should be placed in a file called orchestrator.md alongside the code.

[sda] Add function to `database.go` to check hash format

Currently we assume it is SHA256

Fix known vulnerabilities for sftp-inbox

openjdk:19-alpine has a number of known vulnerabilities

A/C

Update the base image
Make sure no CRITICAL/HIGH vulnerabilities are still around

Mapper container cannot initialize for posix storage type

When running the application in federated and inbox mode as posix, the mapper container is not started.

Log error:
{"level": "fatal", "msg": "stat /inbox/: no such file or directory", "time": "2023-09-13T09:56:45Z"}

The error is generated by the following changes to the mapper:

  inbox, err := storage.NewBackend(conf.Inbox)
  if err != nil {
           log.Fatal(err)
  }

Update database image to use Postgresql 14 or higher

This will be a breaking change - automatic upgrade will not be possible

A/C

base image uses Postgresql 14 or higher

This makes it easier to handle the certificates since the key doesn't need to be owned by the process UID anymore.

Make charts up to date

Include latest changes from sda-helm
Add local changes to sda-db
Add local changes to sda-db
Add local changes to sda-pipeline
Add local shanges to sda-orch

neicnordic / sensitive-data-archive Goto Github PK

sensitive-data-archive's Introduction

Sensitive Data Archive

sensitive-data-archive's People

Contributors

Stargazers

Watchers

Forkers

sensitive-data-archive's Issues

Tasks:

This will be a breaking change

Background

Possible Solution

Describe the user story

Tasks:

Testing

Recommend Projects

Recommend Topics

Recommend Org