Git Product home page Git Product logo

sensitive-data-archive's Introduction

sensitive-data-archive's People

Contributors

aaperis avatar blankdots avatar dbampalikis avatar dependabot[bot] avatar jbygdell avatar kjellp avatar kusalananda avatar malinahlberg avatar nanjiangshu avatar norling avatar pahatz avatar parisa68 avatar pontus avatar shreyasshivakumara avatar tuzov avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sensitive-data-archive's Issues

[GH Actions] investigate why Trivy scan fails for the Java container

FATAL image scan error: scan error: scan failed: failed analysis: analyze error: pipeline error: failed to analyze layer (sha256:042f59a781dde0a48077e7a02e56010d3ce043e60b4b4152c96bf9b7b585d4f1): post analysis error: post analysis error: Unable to initialize the Java DB: Java DB update failed: Java DB update error: Java DB metadata error: unable to decode metadata: EOF

Re-running the task will make it pass

[sda-auth] Provide endpoint for sda-cli login

This ticket covers the changes needed in sda-auth
From the auth side there should be an endpoint /info that provides the needed parameters for the cli to proceed with the login.

In the cli we run:
$ sda-cli login bp.nbis.se

This gets the info endpoint from bp.nbis.se/api/info which returns a JSON with the fields:

{
        ClientID
	OidcURI 
	PublicKey
	InboxURI
}

. Which then the sda-cli can use to login using LS AAI to the service. The config file for uploading files and the public key should then be available in.

Description from Miro:
Login to the LocalEGA service for upload using the OIDC "device authentication flow".

Our LifeScience AAI configuration should already be in place.

Upon successful login the configuration parameters should be saved in a .sda-cli-session file (or similar) that can then be used on following usage of sda-cli (such as upload/encrypt and so on). Warn before overwriting.

LS AAI Service ID needs to be in sda-cli or in some configuration file for this to work. An endpoint where this can be reached so it's easy to detect what system (BigPicture or EGA or whatever) is used. Use sda-auth if only the LS AAI client id, otherwise the api service in sda-pipeline (or merged repo).

Endpoint should return everything needed for creating the config file and the following:
Client id
URL for s3 inbox
Public key of repository

Fix known vulnerabilities for DOA

openjdk:18-alpine has a number of known vulnerabilities

A/C

  • Update the base image
  • Make sure no CRITICAL/HIGH vulnerabilities are still around

Cancel messages - Ingest

Description
Make a submission, add a file and then cancel it

Current behavior
Ingest currently starts the ingestion process again, without checking if the message with the specific correlation id already exists.

Expected behavior
Ingest should only change the status of the file, instead of creating a new row in the database and ingesting the file again. Then send a message to trigger the verify

Merge Golang based apps

Expected structure:

cmd/<APPNAME-1>
cmd/<APPNAME-2>
shared/<COMPONENTNAME-1>
shared/<COMPONENTNAME-2>
Dockerfile
  • auth
  • download
  • re-encryption service

A/C

  • use similar setups as in sda-pipeline
  • ensure all tests keep working

[api] API call that can give information about the status of each file in the ingestion pipeline

As a submitter
I want to have an API to get information about the state of files in the system
So that I can see that things are happening and remind myself what files I've uploaded.

Given a specific submission or user. The API should give back results for each file where it is in the system. In the inbox, whether ingestion has started, if there are errors and so forth. It's basically reporting the status and metadata logged in the file_event_log table.

The authentication for this service could/should be similar to the download or upload services.

This is an API for the users

[sda-download] Handle "personalities" / use cases of on/off platform

For "on platform" use in bigpicture, sda-download should support decrypted downloads (so no crypt4gh should be required on the client side).

For "off platform" (normal use), data should be reencrypted for the users key. Probably un- (crypt4gh)-encrypted downloads should not be permitted (TLS will always be applied).

How do we want to do this? Possible strategies:

  • A simple toggle and different deploys access for on/off platform
  • Some rule system to behave properly
  • Something else

[sda] Consistent handling of messages

As a developer
I want to have a consistent way of handling the messages and their errors (nacking, rerouting, sending to error queue etc)
In order to be able to trust the system

There is an effort in this document to map all the cases that can go wrong in the sda-pipeline, in order to handle them properly.

Rules:

  • On all database errors, the message should be Nack'ed and requeued.

[sda-download] Get public key for receiver reencryption

sda-download currently does not support encrypted downloads, but it should (see e.g. #361)

To be able to reencrypt the requested data for the user, it needs the public key of the receiver. If it can't get it, it should fail the request.

Generally I think we should support accepting the receiver public key in the request, but to support compatibility with "normal" s3 clients, we should also support having a per-user default key (settable somehow through a web interface, out of scope for this issue but the interface needs to be defined).

[sda-db] Append-only schema for auth logs

There should be a separate database schema for logging JW Tokens. There should be a log entry every time a token is generated, and potentially when it is used in a service.

The following data should be logged:

  • timestamp
  • authenticated user
  • submission account
  • JW Token

[s3inbox] Add integration tests for key rotation

Test that s3inbox can work with both jwk and local key validation.

Easiest way to do this seems to modify the oidc.py by adding a secondary token that is signed by the local key that is created upon the deployment of the compose test environment.

[sda] Store file errors in the database

As a sda-admin

I want to store the file errors message in the database

The record should be stored in the schema file_event_log, item error as text string

Note This issue needs to be refined. e.g. How should the error be stored
Schema can be found at https://github.com/neicnordic/sensitive-data-archive/blob/main/postgresql/initdb.d/01_main.sql#L111-L144

  1. Add option to put data in the error field in the go function that updates the status of files.
  2. Go through the sda-pipeline and add a call to this function everywhere something goes wrong with files.

Store:

  • information that should be relayed back to user. File errors

Don't store:

  • Database connection errors

[sda-download] Implement byte range handling to support random access

To support random-access, sda-download can handle Range requests.

This could include:
[ ] Send Accept-Ranges: bytes header on request
[ ] Change the S3 code to support getting the requested data only, provide as a ReadSeeker
[ ] If serving unencrypted, make sure to provide the s3 as a ReadSeeker to NewCrypt4GHReader and serve the needed region(s) from the Crypt4GHReader
[ ] If serving encrypted serve the needed region(s)
[ ] Update tests

sda-download currently supports getting a segment of the (unencrypted) data by URL parameters startCoordinate and endCoordinate (https://neic-sda.readthedocs.io/en/latest/dataout/#rest-api-endpoints) these likely also needs to be supported going forwards.

[Postgres] Remove legacy views from the database

This issue is blocked by #13

This will be a breaking change

A/C

  • remove the local_ega.* schemas
  • remove the legacy users
  • test updating from a v3 database with data in it
  • test updating from a v4 database with data in it
  • switch all apps to work directly with the SDA schema
    • sda-common
    • sda-doa
    • sda-download
    • sda-pipeline
    • sftp-inbox

[sda-auth] login context data is not read properly

How to reproduce:

against the backend: docker compose up cega oidc from sda-auth/dev-server:

$ ./auth
{"level":"info","msg":"The logs format is set to JSON","time":"2023-10-23T17:26:45+02:00"}
{"level":"info","msg":"Setting log level to 'info'","time":"2023-10-23T17:26:45+02:00"}
{"level":"info","msg":"Serving content using http","time":"2023-10-23T17:26:45+02:00"}
Iris Version: 12.2.7
Build Time: 2023-10-20T11:13:39Z
Build Revision: ccd56f1

Now listening on:
Local: http://localhost:8080
Application started. Press CTRL+C to shut down.
{"authType":"elixir","level":"info","msg":"User was authenticated","time":"2023-10-23T17:27:16+02:00","user":"test"}
{"level":"error","msg":"Failed to view login form: template: elixir.html:23:42: executing "elixir.html" at \u003c.infoUrl\u003e: can't evaluate field infoUrl in type main.ElixirIdentity","time":"2023-10-23T17:27:16+02:00"}```

[sda] Move logging to functions

As a developer
I want to move the log.Errorf log messages of the pipeline services to functions
So that the code is more readable and easier to change.

These functions can be added in the common folder and be called with different messages, depending on the case

These type could be a good first step:

log.Errorf("Failed to open file to ingest "+
	"(corr-id: %s, user: %s, filepath: %s, reason: %v)",
	delivered.CorrelationId,
	message.User,
	message.Filepath,
	err)

A/C:

  • Pull request with refactored log.Errorf messages
  • (After the task is finished) A new card for the next type of log messages that needs refactoring

Rework RabbitMQ

A/C

  • use stream type queues for things we need to keep around
  • simplify setup script (make use of things in the base image)

[api] API call that can list files uploaded by a specific user/project/submission (any of those is ok)

The API request should use some sort of token from the user and list all files uploaded by that user. This information should be in the new database schema.

The inboxes should create this once migration to the new schema is done. But this feature can be implemented before that with some mock database content.

  • Authenticate from bearer token
  • extract (project) name from token
  • list uploaded files (still in inbox??)

[sda-db] Handle stable_id collisions

Background

CEGA generates file stable id's from the unencrypted file checksum. This means that files with identical content will get the same stable_id. This can cause problems, since it will prevent finalize from assigning the stable_id and continue to mapper.

For CEGA, using the same id works since they only access files by stable_id, but for Bigpicture, there's been requests to download files with the upload filename. So each file needs a correct stable_id and submission_file_path.

Possible Solution

One of the core issues is whether multiple uploads should share the same sda.files entry. Storage deduplication can be solved by pointing to the same archive path regardless of number of items, but there is also a case to be made for having all files containing the same data use the same database entry. One option here is to move the storage information into a separate table, so that multiple "upload-files" could reference a single "storage file".

It is likely required to remove the stable_id field from the sda.files table, and instead use the file_references table to store the ID. This is partly a matter of simplifying the schema so that all potential stable IDs are handled in the same manner. If multiple sda.files-entries are used, this also allows multiple files to have the same ID when needed for FEGA.

To solve the problem of submission_file_path if one sda.files-entry is used, one solution is to add a file_path field to the file_dataset table, so that a file can have a unique path for each dataset it's part of while still only referencing one sda.files entry.

Memory stream-lining and concurrency

The AWS go sdk has some shenaigans and will be happy to try to use concurrency (and allocate buffers for this). To streamline memory usage, we should enable setting a lower concurrency than the default 5).

[s3proxy] integration test for allowed characters

We should have integration tests to show that we allow unicode characters, and disallow special restricted characters. The test does not have to pass to close this card (that is - you do not have to fix any issues that shows up).

The tests should be run locally.

Example of allowed filenames:

  • 🍫 ⋆ 🍑 πŸŽ€ π’»π’Άπ“ƒπ’Έπ“Ž 𝓂𝑒𝒹𝒾𝒸𝒢𝓁 𝒾𝓂𝒢𝑔𝑒 πŸŽ€ 🍑 ⋆ 🍫.dcm
  • ņọŗmal_meαΈ‘Δ©Δ‡Δ…l_Δ­magΘ©.dcm

Examples of disallowed filenames: (disallowed characters are \/:*?"<>| + control characters)

  • image\file.dcm
  • genomic:file.dcm

Cancel messages - verify

When re-ingesting a file and the fileID and checksum already exists in the checksum table we shouldn't exit with an error, just update the file status and send send a message as usual.

[s3-inbox] Rework handling of public keys

When importing public keys from a JWK endpoint we only reads the first key. https://github.com/neicnordic/sda-s3proxy/blob/a3f4d9b3ec9c906692bb604c5317b7dccfa9c978/userauth.go#L171
This means we have no way of handling key rotations.
Fixing this means we no longer can refer to a key by the name of the issuer of a token.

A/C

  • Keys are imported as a list in memory, (can be separated into one list for RSA and one list for EC to minimize any errors)
  • Multiple keys from the same issuer can be present, both RSA and EC for example
  • Instead of looking for a key by name we try to validate the token against all keys in our list until we find one that matches.
  • We deny access only after we have tested all keys present of a given type.

[sda-auth] Add htsget to s3config

As an sda-user
I want to have the htsget endpoint in the configuration file after logging in
In order to be able to download files from the htsget endpoint

The point is:
Have the url to the htsget in the s3configuration coming from the sda-auth

This should be included in the sda-auth and in the helm charts for deploying the auth service (as optional configuration)

[api] Add authentication to API service

As an sda-user
I want to be able to authenticate with the sda-pipeline API
in order to be able to get information about my submission

The authentication should follow the sda-download way, where we use a JWK endpoint for checking the public key and validate the token

There is a discussion in this PR about this issue

[sda-download] Implement header switch/re-encryption/deencryption

If a user sends a public key, re-encrypt the data
related params:

        - name: Public-Key
          in: header
          description: Public Encryption key
          required: false
          schema:
            type: string
        - name: destinationFormat
          in: query
          description: destinationFormat
          required: false
          schema:
            type: string
            default: plain | crypt4gh

Also take a look at https://github.com/neicnordic/sda-download/blob/0aa5af3ce690a610c5c2b369dd9d0ef8d58468d7/internal/database/database.go#L173-L178

[api] As a technical submitter (user), I want an API that I can interact with so that I can control my submission.

Describe the user story

There are several things that an technical person would like to ask the service. These include for example what files are uploaded, how is the submission going, are there any ingestion errors, how many files are left to be ingested. Are all files backed up and so forth.

This issue is a parent issue to a number of other tasks that we need to implement.

Tasks:

Testing

Unit testing for the API calls. Maybe create a database mock data that has a few files in each state. This can be done per task above.

[SDA] support for multiple s3 buckets

Please describe the feature
As a service operator, I want the entire S3 path stored in the database so that I don't manually have to point the app to the correct backend during startup.

If the storage hostname is expected to be static the path can be stored as: s3://<BUCKETNAME>/<OBJECTPATH>

If support for multiple storage hosts are needed the the path can be stored as AWS standard <BUCKETNAME>.<HOSTNAME>/<OBJECTPATH>

Acceptance criteria

  • Full path is stored in submission_file_path
  • Full path is stored in archive_file_path
  • Full path is stored in backup_path
  • bucket is removed from the S3 config as it will not be needed
  • Tests verifying the changes are added/updated

Additional context
Affected components:

  • postgresql
  • sda
    • cmd/ingest
    • cmd/finalize
    • cmd/mapper
    • cmd/s3inbox
    • cmd/verify
    • internal/config
    • internal/storage
  • sda-download

Estimation of size: big

Estimation of priority: medium

Cancel messages - Finalize

Description
Make a submission, add a file and then cancel it

Case 1: The file does not have a stable id - Seems to work correctly
Expected behavior
Finalize should do the same job as per usual, as long as the file status is not set to disabled.

Case 2: The file has been given a stable id
Current behavior
Finalize is trying to add the stable id, but since it exists, the database returns an error and the correct message to CEGA is never sent (an error message is sent instead)

Expected behavior
Finalize should only update the file status to ready and send the outgoing message to CEGA

Document go based orchestrator according to the standard

All of the SDA services are documented according to a standard set by the SDA handbook. The orchestrator currently lacks this documentation, so it needs to be added.

An example of the documentation wanted can be found here (or at any other pipeline service).

The documentation should be placed in a file called orchestrator.md alongside the code.

Mapper container cannot initialize for posix storage type

When running the application in federated and inbox mode as posix, the mapper container is not started.

Log error:
{"level": "fatal", "msg": "stat /inbox/: no such file or directory", "time": "2023-09-13T09:56:45Z"}

The error is generated by the following changes to the mapper:

  inbox, err := storage.NewBackend(conf.Inbox)
  if err != nil {
           log.Fatal(err)
  }

Update database image to use Postgresql 14 or higher

This will be a breaking change - automatic upgrade will not be possible

A/C

  • base image uses Postgresql 14 or higher

This makes it easier to handle the certificates since the key doesn't need to be owned by the process UID anymore.

Make charts up to date

  • Include latest changes from sda-helm
  • Add local changes to sda-db
  • Add local changes to sda-db
  • Add local changes to sda-pipeline
  • Add local shanges to sda-orch

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.