Git Product home page Git Product logo

ravendb-datahub-source's Introduction

ravendb-datahub-source

RavenDB ingestion source for Datahub

How to use this package

Clone this package to your machine.

Install the pip package in your working environment (Datahub recommends using Python environments). Make sure the installation takes place in the environment where you are using the Datahub CLI to ingest metadata.

For the pip installation, use the following command from the parent directory:

pip install ./ravendb-datahub-source

You can verify the installation with the following command:

pip show ravendb_datahub_source     

Now you are able to just reference your ingestion source class as a type in the YAML recipe by using the fully qualified package name.

source:
  type: ravendb_datahub_source.ravendb_source.RavenDBSource
  config:
  # place for your custom config defined in the configModel

How to configure a RavenDB Source

The RavenDB source has to be configured within the Ingestion Job deployed in the Business Units, within a programmatic pipeline, or in a yaml configuration file (see Ingestion Job).

The source recipe accepts the following attributes:

  • connect_uri: RavenDB connection URI.
  • database_pattern: Regex patterns for databases to filter in ingestion (AllowDenyPattern, see below). Default: Allow all databases to be read.
  • collection_pattern: Regex patterns for tables to filter in ingestion. Specify regex to match the entire collection name in allowed databases. Default: Allow all databases to be read.
  • enable_schema_inference: Whether to infer a schema from the schemaless document database. Default: True.
  • schema_sampling_size (Optional): Number of documents to use when inferring schema size. If set to 0, all documents will be scanned. Default: 1000
  • remove_metadata_from_schema: Whether to remove @metadata field from schema. Default: True
  • max_schema_size (Optional): Maximum number of fields to include in the schema. If the schema is downsampled, a report warning will appear and the "schema.downampled" dataset property will be "True". Default: 300
  • env: Environment to use in namespace when constructing URNs. Default: "PROD"
  • certificate_file_path (Optional): Path to RavenDB client certificate file
  • trust_store_file_path (Optional): Path to trust store file

The AllowDenyPattern has the following structure: {'allow': ['.*'], 'deny': [], 'ignoreCase': True}

For RavenDB, it would probably make sense to ignore the @hilo collection, as well as the @empty collection. For this, you can simply apply the regex pattern as found in the sample programmatic pipeline and YAML below.

Here is an example of a programmatic pipeline in Python:

pipeline = Pipeline.create(
  {
    "run_id": "ravendb-test",
    "source": {
        "type": "ravendb_datahub_source.ravendb_source.RavenDBSource",
        "config": {
            "connect_uri": "http://localhost:8080",
            "collection_pattern":
            {
                'allow': [".*"],
                'deny': ["@.*"],
                'ignoreCase': True
            },
            "schema_sampling_size": 200,
        },
    },
    # your sink configuration
    "sink": {
      "type": "datahub-rest",
      "config": {
          "server": "http://datahub-datahub-gms.datahub.svc.cluster.local:8080"
      }
    }
  }
)

# to run the pipeline
pipeline.run()
# print the source report
pipeline.pretty_print_summary()

The corresponding starter recipe in YAML format would look like this:

source:
  type: ravendb_datahub_source.ravendb_source.RavenDBSource
  config:
    connect_uri: http://localhost:8080
    collection_pattern:
      allow: 
        - ".*"
      deny: 
        - "@.*"
      ignoreCase: True
    schema_sampling_size: 200

sink:
  # your sink configuration
  type: datahub-rest
  config:
    server: http://datahub-datahub-gms.datahub.svc.cluster.local:8080

Running this recipe is as simple as:

datahub ingest -c ravendb_recipe.yaml

Attention:

The default port of the DataHub GMS (Generalized Metadata Service) is 8080 - the same as your default RavenDB port. If you're running both on the same machine it can cause conflicts. To avoid that, either

How to run the tests

You have to build and run the Docker image RavendbTest.Dockerfile from the parent directory using the following commands:

docker build -f ./ravendb-datahub-source/RavendbTest.Dockerfile . -t test-ravendb
docker run -it -v /var/run/docker.sock:/var/run/docker.sock test-ravendb

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.