Git Product home page Git Product logo

data-appliance-gx's Introduction

Project

The Data Appliance GX project is intended as a proving ground for GAIA-X and data transfer techologies.

Getting Started

The project requires JDK 11+. To get started:

git clone https://github.com/microsoft/Data-Appliance-GX

cd Data-Appliance-GX

./gradlew clean shadowJar

To launch the runtime and client from the root build directory, respectively:

java -jar runtime/build/libs/dagx-runtime.jar

java -jar client/build/libs/dagx-client.jar

Build Profiles

The runtime can be configured with custom modules be enabling various build profiles.

By default, no vault is configured. To build with the file system vault, enable the security profile:

./gradlew -Dsecurity.type=fs clean shadowJar

The runtime can then be started from the root clone directory using:

java -Ddagx.vault=secrets/dagx-vault.properties -Ddagx.keystore=secrets/dagx-test-keystore.jks -Ddagx.keystore.password=test123 -jar runtime/build/libs/dagx-runtime.jar

Note the secrets directory referenced above is configured to be ignored. A test key store and vault must be added (or the launch command modified to point to different locations). Also, set the keystore password accordingly.

A word on distributions

The code base is organized in many different modules, some of which are grouped together using so-called "feature bundles". For example, accessing IDS requires a total of 4 modules, which are grouped together in the ids feature bundle. So developers wanting to use that feature only need to reference ids instead all of the 4 modules individually. This allows for a flexible and easy composition of the runtime. We'll call those compositions " distributions".

A distribution basically is a Gradle module that - in its simplest form - consists only of a build.gradle.kts file which declares its dependencies and how the distribution is assembled, e.g. in a *.jar file, as a native binary, as docker image etc. It may also contain further assets like configuration files. An example of this is shown in the distributions/demo folder.

Building and running with Docker

We suggest that all docker interaction be done with a Gradle plugin, because it makes it very easy to encapsulate complex docker commands. An example of its usage can be seen in distributions/demo/build.gradle.kts

The docker image is built with

./gradlew clean buildDemo

which will assemble a JAR file that contains all required modules for the "Demo" configuration (i.e. file-based config and vaults). It will also generate a Dockerfile in build/docker and build an image based upon it.

The container can then be built and started with

./gradlew startDemo

which will launch a docker container based on the previously built image.

Setup Azure resources

A working connector instance will use several resources on Azure, all of which can be easily deployed using a so-called "genesis script" located at ./scripts/genesis.sh. Most Azure resources are grouped together in so-called "resource groups" which makes management quite easy. The Genesis Script will take care of provisioning the most essential resources and put them in a resource group:

  • KeyVault
  • Blob Store Account
  • an AKS cluster
  • two app registrations / service principals

App registrations are used to authenticate apps against Azure AD and are secured with certificates, so before provisioning any resources, the Genesis Script will generate a certificate (interactively) and upload it to the app registrations.

The script requires Azure CLI (az) and jq (a JSON processor) and you need to be logged in to Azure CLI. Once that's done, simply cd scripts and invoke

./genesis.sh <optional-prefix>

the <optional-prefix> is not necessary, but if specified it should be a string without special characters, as it is used as resource suffix in the Azure resources. If omitted, the current Posix timestamp is used.

After having completed its work, which could take >10mins, the scripts can automatically cleanup resources, please observe the cli output.

Note that the Genesis Script does not deploy any applications such as Nifi, this is handled in a second stage!

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines . Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

data-appliance-gx's People

Contributors

ivan-shaporov avatar jimmarino avatar microsoftopensource avatar paullatzelsperger avatar scogromsft avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-appliance-gx's Issues

Create a "Schema" for data addresses

Description

Atlas and NiFi should share a common understanding how DataAddress objects translate to Nifi TransferEndpoints.
With that, the following things become possible:

  • Atlas can create custom type definition based on the properties, their types and whether they are required or not
  • The Atlas Data Seeder extension can generate example entities and validate against the schema
  • NiFi can validate incoming DataAddress objects against the schema
  • Nifi can have one generic NifiTransferEndpoint and an associated ...Converter that transforms a DataAddress into a NifiTransferEndpoint

Global schema properties

  • every schema must have a keyName and a type field
  • the schema should be registered in a SchemaRegistry, which will likely get instantiated in a separate SchemaExtension
  • JSON Schema should be used as the underlying concept

Upload Nifi flow templates via REST API after Terraform deployment

After everything is deployed to Azure, we need to deploy the default TwoClouds.xml flow template and start the root process group. A provider for Terraform can be found here.

Currently the main problem is how to authenticate against Nifi with OpenID - that requires user interaction.

We probably need input from MSFT`s experts on Nifi on how to best authenticate a machine client to Nifi.

Note: since we're invoking terraform from a command line where we need to be logged in to az cli, we might be able to salvage those credentials?

Create custom S3-Processor

In order to be able to deal with temporary credentials and STS tokens a custom processor must be created for nifi.
We must also devise a way of either automatically supplying that new processor to nifi (e.g. in I-tests) or modifying the existing official Nifi S3 Processors upstream.

Investigate Polymorphic Data Model in Atlas for Storing Providers

GaiaX Types in Atlas currently have a flat schema for storing information about cloud file locations. The properties will be different among Azure, AWS, and GCP, so investigate whether we can have a polymorphic structure so that we only store the properties that make sense for the cloud provider type. (i.e. avoid needing attributes for every possible storage provider as optional properties on each entity, like AzureBlobAccount AwsS3BlobAccount, GcpBlobAccount, etc... for better enforcement of mandatory properties)

`AtlasApiImpl`: check if custom type exists, then create or update

in the current implementation the AtlasApiImpl will always create a custom type and fail with an AtlasServiceException when the custom type already exists, which makes it a bit cumbersome in tests etc.

It would be better to check if a custom type with the given name already exists, then update or create.

Terraform: Add ServicePrincipals for two connectors

In order for the fully-fledged end-to-end test to work, there must be two Azure AD service principals, one for the requesting connector (="client") and one for the providing connector (="provider").

The SP's must be configured such that the client can authenticate with the provider using OAuth2.

The creation of these SP's should be done in Terraform.

Create a better way to run integration tests only on CI

currently we have integration tests that are annotated with @EnabledIfEnvironmentVariable(named = "CI", matches = "true") and then we need to evaluate the CI env var again in the @BeforeAll because that one still would get called.

@BeforeAll
public static void setup() {
     // this is necessary because the @EnabledIf... annotation does not prevent @BeforeAll to be called
     var isCi = propOrEnv("CI", "false");
     if(!Boolean.parseBoolean(isCi)) return;
}

Maybe this can be achieved with custom composed annotations (cf. Junit Docs)

Publish code base to Maven

Our code should be available as Maven/Gradle dependency. In a first step we'll use Github Packages for that, but it's not unlikely we'll eventually deploy to MavenCentral as well.

Centralize instantiation of `OkHttpClient`

In order to be able to control several features like timeouts and potentially retry policies at a central location, the instantiation of the OkHttpClient should be centralized.

Dockerize & Setup Build System

The DA-GX Runtime should be containerized with Docker. For that all command line parameters for the runtime should be accessible through Docker environment variables.

The build pipeline should perform the following tasks:

  • compile code
  • run tests
  • build Docker image, tag with build number (e.g. microsoft/dagx-runtime:0.0.1-12345)
  • upload docker image to Azure Container Registry

Note: for now we'll use Github Actions for the CI/CD

Poll for transfer completion

As specified in issue #61 a special file is created after a transfer process has completed. So consequently the client (i.e. the requesting connector) should poll for that file. Once it's there it can free any resources (e.g. deprovision the bucket, issue #63).

Let the `Monitor` intercept `log4j`

The Apache Atlas client library comes with log4j (unfortunately), so we need to write a logging interceptor for it. The same thing has already been done for SLF4J (cf. MonitorProvider)

Implement Azure Key Vault

  • Research how to connect to Azure Keyvault and what information is necessary to connect.
  • Implement a new VaultExtension that connects to Azure Key Vault.
  • Devise a strategy on how that information is supplied to the newly created extension

Terraform: let the KeyVault manage the storage account

The StorageAccount in Azure should be configured such that it is managed by the KeyVault and thus the Keyvault handles:

  • Key regeneration
  • obtaining Keys
  • obtaining SAS tokens

Ideally Terraform has providers for it, otherwise we must execute a shell script.

Benefit: we do not need to store StorageAccount credentials in code anymore, we can simply ask the KeyVault to generate an SAS token on demand.

Use this link as starting point.
Or this discussion: https://docs.microsoft.com/en-us/answers/questions/144549/generating-sastokens-for-files-inside-my-blob-stor.html

Add Installation Support for Single Cluster Deployment

The deployment docs currently have each component deployed to its own K8S cluster. We want to support the scenario of all three components deployed to a single cluster.

This issue tracks the work to update the docs and add any needed artifacts (Helm charts?) needed to get that done e.g. setting up K8S services

Write an extension for the `Monitor`

Currently the Monitor is instantiated hard-coded in DagxRuntime.java. This should be converted such that it is provided by its own ServiceExtension utilizing the service locator pattern and the ServiceLoader mechanism.

Requirements:

  • If there is no monitor configured, use the default ConsoleMonitor
  • if there is one configured, use that
  • if several implementations of Monitor are found, use a MultiplexingMonitor that wraps all monitors and distributes log messages.

Restructure the build

The following modifications should be made:

  • provide "BOM"s (bill-of-material) for various modules, such as IDS, so that customers only need a dependency onto the BOM and not all its sub-modules
  • create distributions, that essentially are combinations of various BOMs, for example the "Demo" distribution consists of the IDS-BOM, the FS-BOM (vault, config, etc), whereas the "Azure" distribution might consist of different BOMs
  • We'll use a Docker Gradle Plugin (https://github.com/bmuschko/gradle-docker-plugin) to build and publish docker images for the distributions we ship, which ATM are Demo and Azure

Add Capability to Generate SAS from Account Keys in Data Appliance Code

Atlas currently stores a SAS used for accessing a cloud file. This should be changed to be a KeyVault reference to an account key. Once that change is made, the Data Appliance code needs to generate a short-lived SAS (5mins?) to pass to Nifi for executing the transfer while providing a small attack surface, in the event that the SAS is compromised.

Nifi: add completion marker after the transfer has completed

A process step should be added to the Nifi template that puts a special file next to the data destination to indicate that the copy process has finished.
The name of the file should be {DataRequest#id}.complete.

An integration test should also be added.

Provide NiFi in a docker container for testing

In order to accelerate integration testing we should provide a docker image that pulls up Apache Nifi with a sample flow already deployed so we can test e.g. the NifDataFlowController against it.
The idea would be to spin up the docker container before the test, run test cases against it, and destroy it afterwards.

End2End Data Seeding

this issue assumes that data is going to be copied from Azure Blob Store to an S3 Bucket

In order for one full end-to-end data transfer to work, the following components must receive seed data:

  • Apache Atlas Type Defs: needs custom type defs for AzureBlobStore and S3Bucket as well as a Policyand a relation type
  • Apache Atlas Entities: needs an Entity for AzureBlobStore and S3Bucket as well as an example policy
  • Nifi: the (updated, #61) flow template must be uploaded (Warning: that means disabling OIDC!)
  • Example asset: an example file should be put into Azure Blob Store, ideally using Terraform
  • Azure Vault: Access Keys or SAS tokens for the data source, i.e. Azure Blob Store, must be stored ahead of time.

Create In-Memory Data Catalog

An in-memory catalog should be implemented to be used en lieu of Apache Atlas. We can either re-use the GenericDataCatalog or explicitly create a new one.

Fork or Contribute to NiFi Helm Chart

We have a lot of customization in the Gaia-X deployment of Nifi that is derived from an existing GitHub project. We should either contribute our changes to the GH project or fork that repo to require less customization (that could break if the underlying GH repo changes)

Expand docker build for different configurations

Currently the docker build process is using the fs security profile by default, but there already are (and will be) others, so the docker build needs to be parameterized.
Options include:

  • with no vault
  • with fs vault
  • with remote debugging enabled
  • with Azure Key Vault (depends on #2)
    The easiest way to do it is probably to introduce a shell script.

As a connector implementor I want a "Genesis Script" to provision basic Azure infrastructure

The script should do the following:

  • Create a service principal (="primary identity")

    • create a certificate
    • provision an Azure AD App Registration
    • upload the certificate's public key to that App Registration
    • pull down TenantId and ClientId
  • provision a KeyVault

    • create a key vault in Azure with the RBAC permission model
    • assign the Key Vault Secrets Officer role to the "primary identity" SP
  • provision other resources

    • AKS: for Nifi and Atlas (Note: actually deploying Nifi and Atlas comes later!)
    • another service principal. This is needed by Nifi, refer to Nifi's installation guide for details.
    • a Blob Storage account (e.g. for file transfer)
    • a CosmosDB (for relational data)
  • Populate KeyVault: all of the services, that require authentication such as the Storage Account Key should be stored in the KeyVault

  • The script should output the ClientId, TenantId and file path of the certificate

Open Questions:

  • Is it okay to implement this as Bash script (i.e. *nix only)?

Nota bene:

  • This could be a bash script using bare Azure CLI or ARM templates.
  • No applications get deployed, so no helm install or anything of this sort. This script should serve as simple bootstrapper for Azure services.
  • Deploying Nifi, Atlas, etc. will be done by a separate script
  • Providing seed data (such as e.g. Atlas datatypes) will be done by a separate script

Add Pagination to Atlas Queries

If we have large catalogs, we should use a pagination scheme to not return a set of thousands of datasets to the data appliance from Atlas

Implement deprovisioning of resources

After a data transfer is complete, there will be a completion event (cf. issue #62) at which we need to free resources, e.g. delete Amazon S3 buckets or Azure Blob Storage containers, remove temporary roles and policies etc.
For that, the concept of a "provisioning pipeline" should be re-used, so there will likely be a S3DeprovisioningPipeline and an AzureDeprovisioningPipeline.

Most likely there will be resources that are cleaned up by the providing connector ("producer"), and some that are cleaned up by the requesting connector ("consumer").

There will be an external signal necessary to actually trigger the deprovisioning.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.