microsoft / data-appliance-gx Goto Github PK

View Code? Open in Web Editor NEW

7.0 10.0 7.0 2.2 MB

Sovereign Data Sharing Appliance/Connector for enterprise scenarios

License: MIT License

Kotlin 3.31% Java 89.74% Dockerfile 0.23% Shell 0.03% Mustache 1.31% Python 1.73% XSLT 0.12% HCL 2.91% JavaScript 0.63%

data-appliance-gx's Introduction

Project

The Data Appliance GX project is intended as a proving ground for GAIA-X and data transfer techologies.

Getting Started

The project requires JDK 11+. To get started:

git clone https://github.com/microsoft/Data-Appliance-GX

cd Data-Appliance-GX

./gradlew clean shadowJar

To launch the runtime and client from the root build directory, respectively:

java -jar runtime/build/libs/dagx-runtime.jar

java -jar client/build/libs/dagx-client.jar

Build Profiles

The runtime can be configured with custom modules be enabling various build profiles.

By default, no vault is configured. To build with the file system vault, enable the security profile:

./gradlew -Dsecurity.type=fs clean shadowJar

The runtime can then be started from the root clone directory using:

java -Ddagx.vault=secrets/dagx-vault.properties -Ddagx.keystore=secrets/dagx-test-keystore.jks -Ddagx.keystore.password=test123 -jar runtime/build/libs/dagx-runtime.jar

Note the secrets directory referenced above is configured to be ignored. A test key store and vault must be added (or the launch command modified to point to different locations). Also, set the keystore password accordingly.

A word on distributions

The code base is organized in many different modules, some of which are grouped together using so-called "feature bundles". For example, accessing IDS requires a total of 4 modules, which are grouped together in the ids feature bundle. So developers wanting to use that feature only need to reference ids instead all of the 4 modules individually. This allows for a flexible and easy composition of the runtime. We'll call those compositions " distributions".

A distribution basically is a Gradle module that - in its simplest form - consists only of a build.gradle.kts file which declares its dependencies and how the distribution is assembled, e.g. in a *.jar file, as a native binary, as docker image etc. It may also contain further assets like configuration files. An example of this is shown in the distributions/demo folder.

Building and running with Docker

We suggest that all docker interaction be done with a Gradle plugin, because it makes it very easy to encapsulate complex docker commands. An example of its usage can be seen in distributions/demo/build.gradle.kts

The docker image is built with

./gradlew clean buildDemo

which will assemble a JAR file that contains all required modules for the "Demo" configuration (i.e. file-based config and vaults). It will also generate a Dockerfile in build/docker and build an image based upon it.

The container can then be built and started with

./gradlew startDemo

which will launch a docker container based on the previously built image.

Setup Azure resources

A working connector instance will use several resources on Azure, all of which can be easily deployed using a so-called "genesis script" located at ./scripts/genesis.sh. Most Azure resources are grouped together in so-called "resource groups" which makes management quite easy. The Genesis Script will take care of provisioning the most essential resources and put them in a resource group:

KeyVault
Blob Store Account
an AKS cluster
two app registrations / service principals

App registrations are used to authenticate apps against Azure AD and are secured with certificates, so before provisioning any resources, the Genesis Script will generate a certificate (interactively) and upload it to the app registrations.

The script requires Azure CLI (az) and jq (a JSON processor) and you need to be logged in to Azure CLI. Once that's done, simply cd scripts and invoke

./genesis.sh <optional-prefix>

the <optional-prefix> is not necessary, but if specified it should be a string without special characters, as it is used as resource suffix in the Azure resources. If omitted, the current Posix timestamp is used.

After having completed its work, which could take >10mins, the scripts can automatically cleanup resources, please observe the cli output.

Note that the Genesis Script does not deploy any applications such as Nifi, this is handled in a second stage!

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines . Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

data-appliance-gx's People

Contributors

Stargazers

Watchers

Forkers

daniil-pankratov paullatzelsperger imudassir77 jcvanderwal javenzhu

data-appliance-gx's Issues

CosmosDB Implementation

TBD

Create a "Schema" for data addresses

Description

Atlas and NiFi should share a common understanding how DataAddress objects translate to Nifi TransferEndpoints.
With that, the following things become possible:

Atlas can create custom type definition based on the properties, their types and whether they are required or not
The Atlas Data Seeder extension can generate example entities and validate against the schema
NiFi can validate incoming DataAddress objects against the schema
Nifi can have one generic NifiTransferEndpoint and an associated ...Converter that transforms a DataAddress into a NifiTransferEndpoint

Global schema properties

every schema must have a keyName and a type field
the schema should be registered in a SchemaRegistry, which will likely get instantiated in a separate SchemaExtension
JSON Schema should be used as the underlying concept

Upload Nifi flow templates via REST API after Terraform deployment

After everything is deployed to Azure, we need to deploy the default TwoClouds.xml flow template and start the root process group. A provider for Terraform can be found here.

Currently the main problem is how to authenticate against Nifi with OpenID - that requires user interaction.

We probably need input from MSFT`s experts on Nifi on how to best authenticate a machine client to Nifi.

Note: since we're invoking terraform from a command line where we need to be logged in to az cli, we might be able to salvage those credentials?

s an engineer, I need to understand the ODRL policy management system

Ported from the data-transfer repo

Create custom S3-Processor

In order to be able to deal with temporary credentials and STS tokens a custom processor must be created for nifi.
We must also devise a way of either automatically supplying that new processor to nifi (e.g. in I-tests) or modifying the existing official Nifi S3 Processors upstream.

As a Consumer, I need to consume a service using IDS to query datasets available to me given my governance attributes

Moved from the data-transfer repo

Investigate Polymorphic Data Model in Atlas for Storing Providers

GaiaX Types in Atlas currently have a flat schema for storing information about cloud file locations. The properties will be different among Azure, AWS, and GCP, so investigate whether we can have a polymorphic structure so that we only store the properties that make sense for the cloud provider type. (i.e. avoid needing attributes for every possible storage provider as optional properties on each entity, like AzureBlobAccount AwsS3BlobAccount, GcpBlobAccount, etc... for better enforcement of mandatory properties)

`AtlasApiImpl`: check if custom type exists, then create or update

in the current implementation the AtlasApiImpl will always create a custom type and fail with an AtlasServiceException when the custom type already exists, which makes it a bit cumbersome in tests etc.

It would be better to check if a custom type with the given name already exists, then update or create.

Create an `AzureBlobStoreProvisioningPipeline`

in order for Azure Blob Store to be able to act as data destination, there must be a AzureBlobStoreProvisioningPipeline, analogous to the S3ProvisioningPipeline.

Terraform: Add ServicePrincipals for two connectors

In order for the fully-fledged end-to-end test to work, there must be two Azure AD service principals, one for the requesting connector (="client") and one for the providing connector (="provider").

The SP's must be configured such that the client can authenticate with the provider using OAuth2.

The creation of these SP's should be done in Terraform.

Create a better way to run integration tests only on CI

currently we have integration tests that are annotated with @EnabledIfEnvironmentVariable(named = "CI", matches = "true") and then we need to evaluate the CI env var again in the @BeforeAll because that one still would get called.

@BeforeAll
public static void setup() {
     // this is necessary because the @EnabledIf... annotation does not prevent @BeforeAll to be called
     var isCi = propOrEnv("CI", "false");
     if(!Boolean.parseBoolean(isCi)) return;
}

Maybe this can be achieved with custom composed annotations (cf. Junit Docs)

Create an in-memory replacement for Nifi

Create a file-system-based data transfer endpoint

For the self-contained system data should be transferred from a directory on the file system to another directory on the filesystem instead of from Azure Blob Store to an S3 Bucket.

Publish code base to Maven

Our code should be available as Maven/Gradle dependency. In a first step we'll use Github Packages for that, but it's not unlikely we'll eventually deploy to MavenCentral as well.

As a Consumer, I need to consume a service using IDS to support artifact request messages to copy over datasets given my governance attributes

Moved over from the data-transfer repo

As a Provider, I need to expose a service using the IDS protocol to support querying my datasets available given my governance attributes

Moved from the data-transfer repo

`ServiceExtension` with `LoadType=DEFAULT` cannot depend on extensions with `LoadType=PRIMORDIAL`

When the DefaultServiceExtensionContext instantiates ServiceExtensions, it resolves their dependencies based on what each service requires() and provides().
Currently a DEFAULT service cannot have a requires() dependency onto a PRIMORDIAL service, but they should.

Centralize instantiation of `OkHttpClient`

In order to be able to control several features like timeouts and potentially retry policies at a central location, the instantiation of the OkHttpClient should be centralized.

Consolidate all config values to use `ServiceExtensionContext#getSetting`

instead of directly reading System properties or environment variables, all properties should be obtained using ServiceExtensionContext#getSetting().
The naming should be dagx.<extension>.<propertyname>, and every property should be annotated with the @DagxSetting annotation.

As a Consumer connector, I need to generate (fetch) credentials for writing a dataset to my cloud

Ported from Data-Transfer repo

Centralize instantiation of RetryPolicy and other core models

Many extensions might make use of core elements such as a RetryPolicy, an OkHttpClient, etc.
They should therefore be instantiated in a PRIMORDIAL service extension.

Spike: Research Using Azure Managed Identities for Accessing Azure Cloud Files

This could be more efficient than storing an account key in KV + generating a SAS when securely accessing cloud files stored in Azure.

Atlas: Add a `DataEntryPropertyLookup` class that performs the lookup in Atlas

Now we have the GenericDataEntryExtensions, which is just a Hashmap, so we need to replace this with a class that connects to and searches in Atlas.

Dockerize & Setup Build System

The DA-GX Runtime should be containerized with Docker. For that all command line parameters for the runtime should be accessible through Docker environment variables.

The build pipeline should perform the following tasks:

compile code
run tests
build Docker image, tag with build number (e.g. microsoft/dagx-runtime:0.0.1-12345)
upload docker image to Azure Container Registry

Note: for now we'll use Github Actions for the CI/CD

Poll for transfer completion

As specified in issue #61 a special file is created after a transfer process has completed. So consequently the client (i.e. the requesting connector) should poll for that file. Once it's there it can free any resources (e.g. deprovision the bucket, issue #63).

Let the `Monitor` intercept `log4j`

The Apache Atlas client library comes with log4j (unfortunately), so we need to write a logging interceptor for it. The same thing has already been done for SLF4J (cf. MonitorProvider)

Investigate issuing Nifi-Api rest calls to a secured cluster for flow deployment/modification

When the cluster is secured via Azure AD, we were not able to use the acquired token from AAD to call the Nifi-Api. The Auth flow needs to be further investigated to resolve this issue.

Implement Azure Key Vault

Research how to connect to Azure Keyvault and what information is necessary to connect.
Implement a new VaultExtension that connects to Azure Key Vault.
Devise a strategy on how that information is supplied to the newly created extension

Terraform: let the KeyVault manage the storage account

The StorageAccount in Azure should be configured such that it is managed by the KeyVault and thus the Keyvault handles:

Key regeneration
obtaining Keys
obtaining SAS tokens

Ideally Terraform has providers for it, otherwise we must execute a shell script.

Benefit: we do not need to store StorageAccount credentials in code anymore, we can simply ask the KeyVault to generate an SAS token on demand.

Use this link as starting point.
Or this discussion: https://docs.microsoft.com/en-us/answers/questions/144549/generating-sastokens-for-files-inside-my-blob-stor.html

Add Installation Support for Single Cluster Deployment

The deployment docs currently have each component deployed to its own K8S cluster. We want to support the scenario of all three components deployed to a single cluster.

This issue tracks the work to update the docs and add any needed artifacts (Helm charts?) needed to get that done e.g. setting up K8S services

Create Nifi wire format for source and destination

convert a DataAdress into a type that remains stable for Nifi and doesn't change.

Write an extension for the `Monitor`

Currently the Monitor is instantiated hard-coded in DagxRuntime.java. This should be converted such that it is provided by its own ServiceExtension utilizing the service locator pattern and the ServiceLoader mechanism.

Requirements:

If there is no monitor configured, use the default ConsoleMonitor
if there is one configured, use that
if several implementations of Monitor are found, use a MultiplexingMonitor that wraps all monitors and distributes log messages.

As a Provider, I need to expose a service using the IDS protocol to support artifact request messages for transferring my datasets given my governance attributes

Ported from Data-transfer repo

Deploy Atlas in its separate AKS with Terraform

Restructure the build

The following modifications should be made:

provide "BOM"s (bill-of-material) for various modules, such as IDS, so that customers only need a dependency onto the BOM and not all its sub-modules
create distributions, that essentially are combinations of various BOMs, for example the "Demo" distribution consists of the IDS-BOM, the FS-BOM (vault, config, etc), whereas the "Azure" distribution might consist of different BOMs
We'll use a Docker Gradle Plugin (https://github.com/bmuschko/gradle-docker-plugin) to build and publish docker images for the distributions we ship, which ATM are Demo and Azure

Add Capability to Generate SAS from Account Keys in Data Appliance Code

Atlas currently stores a SAS used for accessing a cloud file. This should be changed to be a KeyVault reference to an account key. Once that change is made, the Data Appliance code needs to generate a short-lived SAS (5mins?) to pass to Nifi for executing the transfer while providing a small attack surface, in the event that the SAS is compromised.

Nifi: add completion marker after the transfer has completed

A process step should be added to the Nifi template that puts a special file next to the data destination to indicate that the copy process has finished.
The name of the file should be {DataRequest#id}.complete.

An integration test should also be added.

Nifi: add route option for S3

In addition to ADLS Gen2 and Azure Blob Store the Nifi flow should include an Amazon S3 option

Provide NiFi in a docker container for testing

In order to accelerate integration testing we should provide a docker image that pulls up Apache Nifi with a sample flow already deployed so we can test e.g. the NifDataFlowController against it.
The idea would be to spin up the docker container before the test, run test cases against it, and destroy it afterwards.

End2End Data Seeding

this issue assumes that data is going to be copied from Azure Blob Store to an S3 Bucket

In order for one full end-to-end data transfer to work, the following components must receive seed data:

Apache Atlas Type Defs: needs custom type defs for AzureBlobStore and S3Bucket as well as a Policyand a relation type
Apache Atlas Entities: needs an Entity for AzureBlobStore and S3Bucket as well as an example policy
Nifi: the (updated, #61) flow template must be uploaded (Warning: that means disabling OIDC!)
Example asset: an example file should be put into Azure Blob Store, ideally using Terraform
Azure Vault: Access Keys or SAS tokens for the data source, i.e. Azure Blob Store, must be stored ahead of time.

Create In-Memory Data Catalog

An in-memory catalog should be implemented to be used en lieu of Apache Atlas. We can either re-use the GenericDataCatalog or explicitly create a new one.

Fork or Contribute to NiFi Helm Chart

We have a lot of customization in the Gaia-X deployment of Nifi that is derived from an existing GitHub project. We should either contribute our changes to the GH project or fork that repo to require less customization (that could break if the underlying GH repo changes)

Define a production-ready ingress for the Nifi cluster's load balancer

Currently we're using http_application_routing for the LB's ingress controller, which - according to MSFT docs - is not designed for production use.

We therefor must create a secure ingress controller as described in this document (thanks @bmscholl for the link).

This improvement should be implemented in Terraform.

Use long-lived vault in Azure Vault test

in order to accelerate test run-times, we should use a pre-existing vault for i-tests rather than provision one during every test run.

Expand docker build for different configurations

Currently the docker build process is using the fs security profile by default, but there already are (and will be) others, so the docker build needs to be parameterized.
Options include:

with no vault
with fs vault
with remote debugging enabled
with Azure Key Vault (depends on #2)
The easiest way to do it is probably to introduce a shell script.

Add HTTPS to `JettyService`

As a connector implementor I want a "Genesis Script" to provision basic Azure infrastructure

The script should do the following:

Create a service principal (="primary identity")
- create a certificate
- provision an Azure AD App Registration
- upload the certificate's public key to that App Registration
- pull down TenantId and ClientId
provision a KeyVault
- create a key vault in Azure with the RBAC permission model
- assign the Key Vault Secrets Officer role to the "primary identity" SP
provision other resources
- AKS: for Nifi and Atlas (Note: actually deploying Nifi and Atlas comes later!)
- another service principal. This is needed by Nifi, refer to Nifi's installation guide for details.
- a Blob Storage account (e.g. for file transfer)
- a CosmosDB (for relational data)
Populate KeyVault: all of the services, that require authentication such as the Storage Account Key should be stored in the KeyVault
The script should output the ClientId, TenantId and file path of the certificate

Open Questions:

Is it okay to implement this as Bash script (i.e. *nix only)?

Nota bene:

This could be a bash script using bare Azure CLI or ARM templates.
No applications get deployed, so no helm install or anything of this sort. This script should serve as simple bootstrapper for Azure services.
Deploying Nifi, Atlas, etc. will be done by a separate script
Providing seed data (such as e.g. Atlas datatypes) will be done by a separate script

Add Pagination to Atlas Queries

If we have large catalogs, we should use a pagination scheme to not return a set of thousands of datasets to the data appliance from Atlas

Add KeyVault Support for Atlas Entities

Rather than storing a SAS in Atlas for accessing a cloud file,, we should store a KeyVault path to a key that can be used to generate a short-lived SAS.

Implement deprovisioning of resources

After a data transfer is complete, there will be a completion event (cf. issue #62) at which we need to free resources, e.g. delete Amazon S3 buckets or Azure Blob Storage containers, remove temporary roles and policies etc.
For that, the concept of a "provisioning pipeline" should be re-used, so there will likely be a S3DeprovisioningPipeline and an AzureDeprovisioningPipeline.

Most likely there will be resources that are cleaned up by the providing connector ("producer"), and some that are cleaned up by the requesting connector ("consumer").

There will be an external signal necessary to actually trigger the deprovisioning.