Git Product home page Git Product logo

cicd-templates's Introduction

[DEPRECATED] Databricks Labs CI/CD Templates

This repository provides a template for automated Databricks CI/CD pipeline creation and deployment.

NOTE: This repository is deprecated and provided for maintenance purposes only. Please use the dbx init functionality instead.

Table of Contents

Sample project structure (with GitHub Actions)

.
├── .dbx
│   └── project.json
├── .github
│   └── workflows
│       ├── onpush.yml
│       └── onrelease.yml
├── .gitignore
├── README.md
├── conf
│   ├── deployment.json
│   └── test
│       └── sample.json
├── pytest.ini
├── sample_project
│   ├── __init__.py
│   ├── common.py
│   └── jobs
│       ├── __init__.py
│       └── sample
│           ├── __init__.py
│           └── entrypoint.py
├── setup.py
├── tests
│   ├── integration
│   │   └── sample_test.py
│   └── unit
│       └── sample_test.py
└── unit-requirements.txt

Some explanations regarding structure:

  • .dbx folder is an auxiliary folder, where metadata about environments and execution context is located.
  • sample_project - Python package with your code (the directory name will follow your project name)
  • tests - directory with your package tests
  • conf/deployment.json - deployment configuration file. Please read the following section for a full reference.
  • .github/workflows/ - workflow definitions for GitHub Actions

Sample project structure (with Azure DevOps)

.
├── .dbx
│   └── project.json
├── .gitignore
├── README.md
├── azure-pipelines.yml
├── conf
│   ├── deployment.json
│   └── test
│       └── sample.json
├── pytest.ini
├── sample_project_azure_dev_ops
│   ├── __init__.py
│   ├── common.py
│   └── jobs
│       ├── __init__.py
│       └── sample
│           ├── __init__.py
│           └── entrypoint.py
├── setup.py
├── tests
│   ├── integration
│   │   └── sample_test.py
│   └── unit
│       └── sample_test.py
└── unit-requirements.txt

Some explanations regarding structure:

  • .dbx folder is an auxiliary folder, where metadata about environments and execution context is located.
  • sample_project_azure_dev_ops - Python package with your code (the directory name will follow your project name)
  • tests - directory with your package tests
  • conf/deployment.json - deployment configuration file. Please read the following section for a full reference.
  • azure-pipelines.yml - Azure DevOps Pipelines workflow definition

Sample project structure (with GitLab)

.
├── .dbx
│   └── project.json
├── .gitignore
├── README.md
├── .gitlab-ci.yml
├── conf
│   ├── deployment.json
│   └── test
│       └── sample.json
├── pytest.ini
├── sample_project_gitlab
│   ├── __init__.py
│   ├── common.py
│   └── jobs
│       ├── __init__.py
│       └── sample
│           ├── __init__.py
│           └── entrypoint.py
├── setup.py
├── tests
│   ├── integration
│   │   └── sample_test.py
│   └── unit
│       └── sample_test.py
└── unit-requirements.txt

Some explanations regarding structure:

  • .dbx folder is an auxiliary folder, where metadata about environments and execution context is located.
  • sample_project_gitlab - Python package with your code (the directory name will follow your project name)
  • tests - directory with your package tests
  • conf/deployment.json - deployment configuration file. Please read the following section for a full reference.
  • .gitlab-ci.yml - GitLab CI/CD workflow definition

Note on dbx

NOTE:
dbx is a CLI tool for advanced Databricks jobs management. It can be used separately from cicd-templates, and if you would like to preserve your project structure, please refer to dbx documentation on how to use it with customized project structure.

Quickstart

NOTE:
As a prerequisite, you need to install databricks-cli with a configured profile. In this instruction we're based on Databricks Runtime 7.3 LTS ML. If you don't need to use ML libraries, we still recommend to use ML-based version due to %pip magic support.

Local steps

Perform the following actions in your development environment:

  • Create new conda environment and activate it:
conda create -n <your-environment-name> python=3.7.5
conda activate <your-environment-name>
  • If you would like to be able to run local unit tests, you'll need JDK. If you don't have one, It can be installed via:
conda install -c anaconda "openjdk=8.0.152"
  • Install cookiecutter and path:
pip install cookiecutter path
  • Create new project using cookiecutter template:
cookiecutter https://github.com/databrickslabs/cicd-templates
  • Install development dependencies:
pip install -r unit-requirements.txt
  • Install generated package in development mode:
pip install -e .
  • In the generated directory you'll have a sample job with testing and launch configurations around it.
  • Launch and debug your code on an interactive cluster via the following command. Job name could be found in conf/deployment.json:
dbx execute --cluster-name=<my-cluster> --job=<job-name>
  • Make your first deployment from the local machine:
dbx deploy
  • Launch your first pipeline as a new separate job, and trace the job status. Job name could be found in conf/deployment.json:
dbx launch --job <your-job-name> --trace
  • For an in-depth local development and unit testing guidance, please refer to a generated README.md in the root of the project.

Setting up CI/CD pipeline on GitHub Actions

  • Create a new repository on GitHub
  • Configure DATABRICKS_HOST and DATABRICKS_TOKEN secrets for your project in GitHub UI
  • Add a remote origin to the local repo
  • Push the code
  • Open the GitHub Actions for your project to verify the state of the deployment pipeline

Setting up CI/CD pipeline on Azure DevOps

  • Create a new repository on GitHub
  • Connect the repository to Azure DevOps
  • Configure DATABRICKS_HOST and DATABRICKS_TOKEN secrets for your project in Azure DevOps. Note that secret variables must be mapped to env as mentioned here using the syntax env: for example:
variables:
- group: Databricks-environment
stages:
...
...
    - script: |
        dbx deploy
      env:
        DATABRICKS_TOKEN: $(DATABRICKS_TOKEN)
  • Add a remote origin to the local repo
  • Push the code
  • Open the Azure DevOps UI to check the deployment status

Setting up CI/CD pipeline on Gitlab

  • Create a new repository on Gitlab
  • Configure DATABRICKS_HOST and DATABRICKS_TOKEN secrets for your project in GitLab UI
  • Add a remote origin to the local repo
  • Push the code
  • Open the GitLab CI/CD UI to check the deployment status

Deployment file structure

A sample deployment file could be found in a generated project.

General file structure could look like this:

{
    "<environment-name>": {
        "jobs": [
            {
                "name": "sample_project-sample",
                "existing_cluster_id": "some-cluster-id", 
                "libraries": [],
                "max_retries": 0,
                "spark_python_task": {
                    "python_file": "sample_project/jobs/sample/entrypoint.py",
                    "parameters": [
                        "--conf-file",
                        "conf/test/sample.json"
                    ]
                }
            }
        ]
    }
}

Per each environment you could describe any amount of jobs. Job description should follow the Databricks Jobs API.

However, there is some advanced behaviour for a dbx deploy.

When you run dbx deploy with a given deployment file (by default it takes the deployment file from conf/deployment.json), the following actions will be performed:

  • Find the deployment configuration in --deployment-file (default: conf/deployment.json)
  • Build .whl package in a given project directory (could be disabled via --no-rebuild option)
  • Add this .whl package to a job definition
  • Add all requirements from --requirements-file (default: requirements.txt). Step will be skipped if requirements file is non-existent.
  • Create a new job or adjust existing job if the given job name exists. Job will be found by it's name.

Important thing about referencing is that you can also reference arbitrary local files. This is very handy for python_file section. In the example above, the entrypoint file and the job configuration will be added to the job definition and uploaded to dbfs automatically. No explicit file upload is needed.

Different deployment types

Databricks Jobs API provides two methods for launching a particular workload:

Main logical difference between these methods is that Run Submit API allows to submit a workload directly without creating a job. Therefore, we have two deployment types - one for Run Submit API, and one for Run Now API.

Deployment for Run Submit API

To deploy only the files and not to override the job definitions, do the following:

dbx deploy --files-only

To launch the file-based deployment:

dbx launch --as-run-submit --trace

This type of deployment is handy for working in different branches, and it won't affect the job definition.

Deployment for Run Now API

To deploy files and update the job definitions:

dbx deploy

To launch the file-based deployment:

dbx launch --job=<job-name>

This type of deployment shall be mainly used in automated way during new release. dbx deploy will change the job definition (unless --files-only option is provided).

Troubleshooting

Q: When running dbx deploy I'm getting the following exception json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) and stack trace:

...
  File ".../lib/python3.7/site-packages/dbx/utils/common.py", line 215, in prepare_environment
    experiment = mlflow.get_experiment_by_name(environment_data["workspace_dir"])
...

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

What could be causing it and what is the potential fix?

A:
We've seen this exception when in the profile the host=https://{domain}/?o={orgid} format is used for Azure. It is valid for the databricks cli, but not for the API. If that's the cause, once the "?o={orgid}" suffix is removed, the problem should be gone.

FAQ

Q: I'm using poetry for package management. Is it possible to use poetry together with this template?

A:
Yes, it's also possible, but the library management during cluster execution should be performed via libraries section of job description. You also might need to disable the automatic rebuild for dbx deploy and dbx execute via --no-rebuild option. Finally, the built package should be in wheel format and located in /dist/ directory.

Q: How can I add my Databricks Notebook to the deployment.json, so I can create a job out of it?

A:
Please follow this documentation section and add a notebook task definition into the deployment file.

Q: Is it possible to use dbx for non-Python based projects, for example Scala-based projects?

A:
Yes, it's possible, but the interactive mode dbx execute is not yet supported. However, you can just take the dbx wheel to your Scala-based project and reference your jar files in the deployment file, so the dbx deploy and dbx launch commands be available for you.

Q: I have a lot of interdependent jobs, and using solely JSON seems like a giant code duplication. What could solve this problem?

A:
You can implement any configuration logic and simply write the output into a custom deployment.json file and then pass it via --deployment-file option. As an example, you can generate your configuration using Python script, or Jsonnet.

Q: How can I secure the project environment?

A:
From the state serialization perspective, your code and deployments are stored in two separate storages:

  • workspace directory -this directory is stored in your workspace, described per-environment and defined in .dbx/project.json, in workspace_dir field. To control access to this directory, please use Workspace ACLs.
  • artifact location - this location is stored in DBFS, described per-environment and defined in .dbx/project.json, in artifact_location field. To control access this location, please use credentials passthrough (docs for ADLS and for S3).

Q: I would like to use self-hosted (private) pypi repository. How can I configure my deployment and CI/CD pipeline?

A:
To set up this scenario, there are some settings to be applied:

  • Databricks driver should have network access to your pypi repository
  • Additional step to deploy your package to pypi repo should be configured in CI/CD pipeline
  • Package re-build and generation should be disabled via --no-rebuild --no-package arguments for dbx execute
  • Package reference should be configured in job description

Here is a sample for dbx deploy command:

dbx deploy --no-rebuild --no-package

Sample section to libraries configuration:

{
    "pypi": {"package": "my-package-name==1.0.0", "repo": "my-repo.com"}
}

Q: What is the purpose of init_adapter method in SampleJob?

A: This method should be primarily used for adapting configuration for dbx execute based run. By using this method, you can provide an initial configuration in case if --conf-file option is not provided.

Q: I don't like the idea of hosting the host and token variables into ~/.databrickscfg file inside the CI pipeline. How can I make this setup more secure?

A:
dbx now supports environment variables, provided via DATABRICKS_HOST and DATABRICKS_TOKEN. If these variables are defined in env, no ~/.databrickscfg file needed.

Legal Information

This software is provided as-is and is not officially supported by Databricks through customer technical support channels. Support, questions, and feature requests can be communicated through the Issues page of this repo. Please see the legal agreement and understand that issues with the use of this code will not be answered or investigated by Databricks Support.

Feedback

Issues with template? Found a bug? Have a great idea for an addition? Feel free to file an issue.

Contributing

Have a great idea that you want to add? Fork the repo and submit a PR!

Kudos

cicd-templates's People

Contributors

alexott avatar dexianta avatar koernigo avatar miguelperalvo avatar mshtelma avatar renardeinside avatar shasidhar avatar steve148 avatar tabdulghani avatar tonyqiu2020 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cicd-templates's Issues

Error when using dbutils.secrets in Integration tests

Hello,
I would like to write integration tests that need to use one/many Databricks secrets. Unfortunately, for the tests that use dbutils.secrets.get(), the following error message is thrown:

image

This happens when I execute the test when using dbx directly from the command line, as well as when I deploy it as a job on Databricks and launch it from there.

Is there a way that secrets can be accessed when running the tests?

Cannot read the python file dbfs:/Shared/dbx/projects/databricks_pipelines/../artifacts/tests/integration/sample_test.py

I am following the template with Databricks hosted on AWS and running with GitHub actions but bumped into the below error.

What could cause an error like this? Cannot read the python file...

Appreciate any help. Thank you!

Run dbx launch --job=databricks-pipelines-sample-integration-test --as-run-submit --trace
  dbx launch --job=databricks-pipelines-sample-integration-test --as-run-submit --trace
  shell: /usr/bin/bash -e {0}
  env:
    DATABRICKS_HOST: ***
    DATABRICKS_TOKEN: ***
    pythonLocation: /opt/hostedtoolcache/Python/3.7.5/x64
    LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.7.5/x64/lib
[dbx][2021-05-15 09:24:53.764] Launching job databricks-pipelines-sample-integration-test on environment default
[dbx][2021-05-15 09:24:53.765] Using configuration from the environment variables
[dbx][2021-05-15 09:24:55.251] No additional tags provided
[dbx][2021-05-15 09:24:55.254] Successfully found deployment per given job name
[dbx][2021-05-15 09:24:56.436] Launching job via run submit API
[dbx][2021-05-15 09:24:56.943] Run URL: ***#job/2703892/run/1
[dbx][2021-05-15 09:24:56.943] Tracing run with id 3032139
[dbx][2021-05-15 09:25:02.037] [Run Id: 3032139] Current run status info - result state: None, lifecycle state: PENDING, state message: Installing libraries
[dbx][2021-05-15 09:25:07.129] [Run Id: 3032139] Current run status info - result state: None, lifecycle state: PENDING, state message: Installing libraries
[dbx][2021-05-15 09:25:12.227] [Run Id: 3032139] Current run status info - result state: None, lifecycle state: RUNNING, state message: In run
[dbx][2021-05-15 09:25:17.325] [Run Id: 3032139] Current run status info - result state: None, lifecycle state: RUNNING, state message: In run
[dbx][2021-05-15 09:25:22.417] [Run Id: 3032139] Current run status info - result state: None, lifecycle state: RUNNING, state message: In run
[dbx][2021-05-15 09:25:27.509] [Run Id: 3032139] Current run status info - result state: FAILED, lifecycle state: INTERNAL_ERROR, state message: Cannot read the python file dbfs:/Shared/dbx/projects/databricks_pipelines/2dc5616b50a943dc96e014e06174abda/artifacts/tests/integration/sample_test.py. Please check driver logs for more details.
[dbx][2021-05-15 09:25:27.510] Finished tracing run with id 3032139

Where is the source code of bundled tool dbx

Hello,

We would like play with the bundles dbx tool, is there any source available?

For instance, how would you get the "deployed full path in databricks aka dbfs://Shared/Projects/xxxxx/yourlibname" out of the "dbx deploy" command?

Is there a way to pass / export a custom path to dbx deploy command.

Thank you!

Best regards,
Andrej

Pipeline stuck at "Installing Library" because the library has been installed before on an existing cluster

Environment:

  • Databricks on Azure
  • CICD pipeline on Azure DevOps

I followed the cookiecutter template and successfully ran it the first time. However, on the second CI/CD run, it's stuck at the step Installing libraries.

For this, I'm using an existing cluster id instead of new cluster for each CI/CD run. When I check the cluster, I saw that the library was installed (during the 1st run) and now it's trying to install the same library again.

How should we avoid this situation?

[dbx][2021-05-18 16:49:23.342] [Run Id: 12] Current run status info - result state: None, lifecycle state: PENDING, state message: Installing libraries
[dbx][2021-05-18 16:49:28.564] [Run Id: 12] Current run status info - result state: None, lifecycle state: PENDING, state message: Installing libraries
[dbx][2021-05-18 16:49:33.787] [Run Id: 12] Current run status info - result state: None, lifecycle state: PENDING, state message: Installing libraries
[dbx][2021-05-18 16:49:39.012] [Run Id: 12] Current run status info - result state: None, lifecycle state: PENDING, state message: Installing libraries
[dbx][2021-05-18 16:49:44.276] [Run Id: 12] Current run status info - result state: None, lifecycle state: PENDING, state message: Installing libraries

dbx deploy not working for single node clusters

When trying to run dbx deploy on the following job its fails.

{
                "name": "anabricks_cd-sample",
                "new_cluster": {
                    "spark_version": "7.3.x-scala2.12",
                    "spark_conf": {
                        "spark.databricks.delta.preview.enabled": "true",
                        "spark.master": "local[*]",
                        "spark.databricks.cluster.profile": "singleNode"
                    },
                    "node_type_id": "Standard_DS3_v2",
                    "custom_tags": {
                        "ResourceClass": "SingleNode",
                        "job": "anabricks_sample"
                    },
                    "enable_elastic_disk": true,
                    "init_scripts": [
                        {
                            "dbfs": {
                                "destination": "dbfs:/monitoring/datadog_install_driver_only.sh"
                            }
                        }
                    ],
                    "azure_attributes": {
                        "availability": "ON_DEMAND_AZURE"
                    },
                    "num_workers": 0
                },
                "libraries": [],
                "email_notifications": {
                    "on_start": [],
                    "on_success": [],
                    "on_failure": []
                },
                "max_retries": 0,
                "spark_python_task": {
                    "python_file": "anabricks_cd/jobs/sample/entrypoint.py",
                    "parameters": [
                        "--conf-file",
                        "conf/test/sample.json"
                    ]
                }
            }

The error is

Traceback (most recent call last):
  File "c:\tools\miniconda3\envs\risbricks\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\tools\miniconda3\envs\risbricks\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\tools\miniconda3\envs\risbricks\Scripts\dbx.exe\__main__.py", line 7, in <module>
  File "c:\tools\miniconda3\envs\risbricks\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "c:\tools\miniconda3\envs\risbricks\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "c:\tools\miniconda3\envs\risbricks\lib\site-packages\click\core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "c:\tools\miniconda3\envs\risbricks\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\tools\miniconda3\envs\risbricks\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "c:\tools\miniconda3\envs\risbricks\lib\site-packages\dbx\commands\deploy.py", line 94, in deploy
    requirements_payload, package_requirement, _file_uploader)
  File "c:\tools\miniconda3\envs\risbricks\lib\site-packages\dbx\commands\deploy.py", line 191, in _adjust_job_definitions
    _walk_content(adjustment_callback, job)
  File "c:\tools\miniconda3\envs\risbricks\lib\site-packages\dbx\commands\deploy.py", line 244, in _walk_content
    _walk_content(func, item, content, key)
  File "c:\tools\miniconda3\envs\risbricks\lib\site-packages\dbx\commands\deploy.py", line 244, in _walk_content
    _walk_content(func, item, content, key)
  File "c:\tools\miniconda3\envs\risbricks\lib\site-packages\dbx\commands\deploy.py", line 244, in _walk_content
    _walk_content(func, item, content, key)
  File "c:\tools\miniconda3\envs\risbricks\lib\site-packages\dbx\commands\deploy.py", line 249, in _walk_content
    parent[index] = func(content)
  File "c:\tools\miniconda3\envs\risbricks\lib\site-packages\dbx\commands\deploy.py", line 188, in <lambda>
    adjustment_callback = lambda p: _adjust_path(p, artifact_base_uri, file_uploader)
  File "c:\tools\miniconda3\envs\risbricks\lib\site-packages\dbx\commands\deploy.py", line 256, in _adjust_path
    elif pathlib.Path(candidate).exists():
  File "c:\tools\miniconda3\envs\risbricks\lib\pathlib.py", line 1336, in exists
    self.stat()
  File "c:\tools\miniconda3\envs\risbricks\lib\pathlib.py", line 1158, in stat
    return self._accessor.stat(self)
  File "c:\tools\miniconda3\envs\risbricks\lib\pathlib.py", line 387, in wrapped
    return strfunc(str(pathobj), *args)
OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: 'local[*]'

The _adjust_path function needs to be changed to skip spark_conf

If I remove "spark.master": "local[*]", the job will deploy however it won't start since driver will keep throwing this warning

Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

File not found when running integration tests on GitHub Actions

By default, in the onpush workflow, integration tests are first deployed as a job on Databricks and then launched as a job.

- name: Deploy integration test
run: |
dbx deploy --jobs={project_name}-sample-integration-test
- name: Run integration test
run: |
dbx launch --job={project_name}-sample-integration-test --trace

What I would like to do here is to directly execute the integration test on a specific cluster without deploying the test as a job on Databricks and executing the job.

To achieve this, I have removed:

      - name: Deploy integration test
        run: |
          dbx deploy --jobs=<job_name>

      - name: Run integration test
        run: |
          dbx launch --job=<job_name> --trace

and added into the workflow file (onpush.yml):

      - name: Run integration test
        run: |
          dbx execute --cluster-id=<id> --job=<job_name> --requirements-file=unit-requirements.txt

When the GitHub action workflow is executed, it breaks at the Run integration test job, with the error:

FileNotFoundError: [Errno 2] No such file or directory: '.dbx/lock.json'

I understand that this file contains the execution context and hence is in the .gitignore.

As a result, is there a way that an integration test can be run without deploying it as a job on Databricks?

Is `Prepare profile` really necessary step in the `onpush` workflow?

In the onpush workflow we have the following step:

- name: Prepare profile
  run: |
    echo "[dbx-dev-azure]" >> ~/.databrickscfg
    echo "host = $DBX_AZURE_HOST" >> ~/.databrickscfg
    echo "token = $DBX_AZURE_TOKEN" >> ~/.databrickscfg

As I understand correctly it is required for the dbx tool. Is it possible to add support of environment variables to dbx or to pass HOST and TOKEN as an arguments?

I think it is important by two reasons:

  1. If dbx can read environment than we can remove this step and it also means that we remove one more place when we are working with sensitive data. As I know databricks-cli works correctly with the environment.
  2. Keep sensitive data in the custom local file could lead us to security issue. Yes, it should be cleaned up after our workflow run, but looks scary anyway. GitHub recommends pass secrets through Settings or Environment file.

GitLab CI/CD Pipeline fails due to incorrect job name

When using GitLab as the CI/CD tool, the Gitlab Pipeline fails when deploying Databricks jobs. I believe this is because .gitlab-ci.yml uses {project_name} instead of {{cookiecutter.project_name}} in the deploy and launch commands.

dbx launch --trace gets stuck for an hour

We are trying the azure devops flavour of the ci cd pipeline.

when we get to this stage to run integration tests it hangs for over an hour.

- script: |
    dbx launch --job=cicd_demo_1-sample-integration-test --trace
    displayName: 'Launch integration on test'

Actual behaviour:
We get the below error message repeated for over an hour until the azure pipeline times out (default is 1 hour)

Skipping this run because the limit of 1 maximum concurrent runs has been reached.

Expected behaviour:
If there is concurrency run of the integration test job then the dbx launch --trace should either fail immediately or have some configurable retry window (retry ever minute for 5 mins) otherwise exit with an error.

Create and manage persistent clusters

Hello,

I was wondering if it was possible to define persistent cluster within the deployment.json file. Side by side with the jobs definition.
With something like the following syntax:

{
    "default": {
        "clusters": [{
            "cluster_name": "test",
            ...   
            }
        }],
        "jobs": [{
            "name": "test",
            "existing_cluster_id": "test",
            ....
        }]
    }
}

[DISCUSSION] How should this ideally interact with Databricks "Repos for Git integration"?

Databricks Repos for Git Integration seems like a relatively new way to manage / deploy code in a Databricks environment. I understand that cicd-templates uses dbx which (this part may be wrong) packages up jobs into .whl files to be installed in Databricks environments.

From the "Repos for Git Integration" docs, one part says:

If you are using %run commands to make Python or R functions defined in a notebook available to another notebook, or are installing custom .whl files on a cluster, consider including those custom modules in a Databricks repo.

This seems like a different approach vs. the dbx approach. Just wondering if there are any guidelines here on what's appropriate when, how these approaches overlap / complement each other, and if the future of this project / dbx involves a deeper integration with the "Repos" approach.

How to run jobs on an existing databricks cluster?

In the demo workflow, it seems a new databricks cluster gets created. How to appoint the job to run on an existing databricks cluster? Should I change the dev_cicd_pipeline.py in order to specify cluster id?

Btw, it seems the source code ofdev_cicd_pipeline.py is not available, although it is listed in README.md.


Deployment
│ ├── init.py
│ ├── deployment.py
│ ├── dev_cicd_pipeline.py
│ └── release_cicd_pipeline.py

Describe security model for deployment objects

Security is always important, so the following explanations should be given:

  • How to secure the workspace_dir with the underlying mlflow experiment
  • How to secure the artifact_location on the underlying FS (ADLS/S3).

Acceptance criteria:

  • Meaning and location of workspace_dir and artifact_location are described
  • Workspace directory security measures are described
  • Artifact location security measures are described

dbx execute doesn't work on ml cluster

If you attempt to run dbx execute it won't work on ml clusters.

Since

%pip install -U -r {path}

Is only supported on databricks runtime and not databricks ML Cluster

move documentation to github pages

Documentation stored in README.md becomes too big to manage it manually.
A better approach would be to use GitHub Pages as the main storage for documentation.

AC:

  • README.md left only for reference purposes and links to the documentation.
  • Documentation is well-formatted and presented on GitHub Pages.

Is it possible to package Structured Streaming Applications via this CI/CD Framework

Hello -

I've recently stumbled upon this framework and so far am really enjoying the workflow put forth by the team. Thank you!

I'm working on a PySpark application that relies upon structured streaming to get data out of Kafka, and I'm running into issues getting the pipeline to complete after running a deployment to our release cluster. I have versions of this code working in Databricks notebooks, so I'm trying to decipher what it is I'm missing in the context of this ci/cd framework..

The dbx deployments via Azure Pipelines and the sample jobs execute as expected - so I can rule out a deployment issue.

I'm aware of the limitation that databricks-connect has regarding structured streaming support, but am wondering if that's getting passed along to the job cluster I'm spinning up for the job I've crafted.

Any wisdom you could impart would be greatly appreciated. Below is the full stack-trace of the standard error from the spark job.

Thanks in advance,

RS

/databricks/python/lib/python3.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated since IPython 4.0. You should import from traitlets.config instead.
  "You should import from traitlets.config instead.", ShimWarning)
/databricks/python/lib/python3.7/site-packages/IPython/nbconvert.py:13: ShimWarning: The `IPython.nbconvert` package has been deprecated since IPython 4.0. You should import from nbconvert instead.
  "You should import from nbconvert instead.", ShimWarning)
Thu Feb 25 00:55:55 2021 py4j imported
Thu Feb 25 00:55:55 2021 Python shell started with PID  2490  and guid  1c65e51ed47b4b95a033bef96013d3f8
Thu Feb 25 00:55:55 2021 Initialized gateway on port 46845
Thu Feb 25 00:55:56 2021 py4j imported
Thu Feb 25 00:55:56 2021 Python shell executor start
Dropped logging in PythonShell:

b'/local_disk0/tmp/1614214540907-0/PythonShell.py:1084: DeprecationWarning: The `use_readline` parameter is deprecated and ignored since IPython 6.0.\n  parent=self,\n'
/databricks/python/lib/python3.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated since IPython 4.0. You should import from traitlets.config instead.
  "You should import from traitlets.config instead.", ShimWarning)
/databricks/python/lib/python3.7/site-packages/IPython/nbconvert.py:13: ShimWarning: The `IPython.nbconvert` package has been deprecated since IPython 4.0. You should import from nbconvert instead.
  "You should import from nbconvert instead.", ShimWarning)
Thu Feb 25 00:55:59 2021 py4j imported
Thu Feb 25 00:55:59 2021 Python shell started with PID  2514  and guid  5d2f02b085044e0e81d7961dfacf6ab7
Thu Feb 25 00:55:59 2021 Initialized gateway on port 36749
Thu Feb 25 00:56:00 2021 py4j imported
Thu Feb 25 00:56:00 2021 Python shell executor start
Dropped logging in PythonShell:

b'/local_disk0/tmp/1614214540907-0/PythonShell.py:1084: DeprecationWarning: The `use_readline` parameter is deprecated and ignored since IPython 6.0.\n  parent=self,\n'
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<command--1> in <module>
     12 
     13 with open(filename, "rb") as f:
---> 14   exec(f.read())
     15 

<string> in <module>

<string> in launch(self)

<string> in run_topic_ingestion(self, topic, topic_idx)

/databricks/spark/python/pyspark/sql/streaming.py in load(self, path, format, schema, **options)
    418             return self._df(self._jreader.load(path))
    419         else:
--> 420             return self._df(self._jreader.load())
    421 
    422     @since(2.0)

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307         for temp_arg in temp_args:

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    125     def deco(*a, **kw):
    126         try:
--> 127             return f(*a, **kw)
    128         except py4j.protocol.Py4JJavaError as e:
    129             converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o399.load.
: java.lang.NullPointerException
	at org.apache.spark.sql.kafka010.KafkaSourceProvider.validateGeneralOptions(KafkaSourceProvider.scala:243)
	at org.apache.spark.sql.kafka010.KafkaSourceProvider.org$apache$spark$sql$kafka010$KafkaSourceProvider$$validateStreamOptions(KafkaSourceProvider.scala:331)
	at org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:71)
	at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:242)
	at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:122)
	at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:122)
	at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:35)
	at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:221)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:295)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:251)
	at java.lang.Thread.run(Thread.java:748)```


Deploying a script relying on multiple files

Greetings,

I am starting to play a little with what this repository offers and there is a use case I cannot seem to make work.

Basically, I have a single entrypoint that imports classes from sibling files located in the same directory, like this :

|--projectname/
|----jobs/
|------jobsfolder/
|--------jobA.py
|--------jobB.py
|--------entrypoint.py

With entrypoint being this :

from jobA import jobA
from jobB import jobB

if __name__=="__main__":
    jobA().launch()
    jobB().launch()

Relevant part of the job definition is the following :

    "spark_python_task": {
        "python_file": "projectname/jobs/jobsfolder/entrypoint.py"
    }

When trying to deploy or execute this job, I get a ModuleNotFound error on jobA. It seems only logical since only entrypoint.py was uploaded to MlFlow.

Dipping my toes into the code I unwheeled from DBX, it looks like the intended behavior but I feel like it is a fairly frequent use case when trying to have a readable and well documented code.

Am I missing something about this use case or is it just not supported as of now ?

Anyways thank you for your work on this project, it has been very useful and a lot of fun to use for now !

Problem executing my jobs to cluster dbx deploy --jobs=project_name fails

I am keep getting this error when using this project.

response = verify_rest_response(response, endpoint)
File "/Users/charalampossouris/miniforge3/lib/python3.9/site-packages/mlflow/utils/rest_utils.py", line 137, in verify_rest_response
raise MlflowException("%s. Response body: '%s'" % (base_msg, response.text))
mlflow.exceptions.MlflowException: API request to endpoint was successful but the response body was not in a valid JSON format. Response body: '<!doctype html><title>Databricks - Sign In</title>

<script src="/login/login.01615f82.js"></script>'

I am almost confident that my DATABRICKS-HOST and DATABRICKS-TOKEN are correct and placed in the right places.

I will continue debugging it but I was hoping for some hints :)

When executing "databricks workspace list" I can see the workspaces, which indicates to me that my credentials are correctly placed.

com.databricks.backend.daemon.data.common.InvalidMountException

When Databricks runs the Job on the cluster we get the following error;

Message: Library installation failed for library due to infra fault for whl: "dbfs:/databricks/mlflow-tracking/x/x/artifacts/dist/pipeline-0.1.0-py3-none-any.whl" . Error messages: java.lang.RuntimeException: ManagedLibraryInstallFailed: com.google.common.util.concurrent.UncheckedExecutionException: com.databricks.backend.daemon.data.common.InvalidMountException: Error while using path /databricks/mlflow-tracking/x/x/artifacts/dist/pipeline-0.1.0-py3-none-any.whl for resolving path '/x/x/artifacts/dist/pipeline-0.1.0-py3-none-any.whl' within mount at '/databricks/mlflow-tracking'. for library:PythonWhlId(dbfs:/databricks/mlflow-tracking/x/x/artifacts/dist/pipeline-0.1.0-py3-none-any.whl,,NONE),isSharedLibrary=false

Is this anything to do with the below?

https://kb.databricks.com/machine-learning/mlflow-artifacts-no-client-error.html#invalid-mount-exception

Is this commit related to the issue because we pulled but it was still failing

4536390

Generated README sample using wrong dbx command

Was playing with the project today and I noticed a small issue in the docs that get generated after creating the project via cookiecutter. Specifically the snippet in the README can be found at https://github.com/databrickslabs/cicd-templates/blob/master/%7B%7Bcookiecutter.project_slug%7D%7D/README.md#testing for the following snippet.


For a test on a automated job cluster, use launch instead of execute:

dbx execute --job=cicd-sample-project-sample-integration-test

Based on my understanding of the project and its commands, I think the snippet should be the following

dbx launch --job=cicd-sample-project-sample-integration-test

If you guys are taking Pull Requests, I'll fork and put the fix up for review to help out. I've been enjoying go through this project, so kudos to you.

Add integration testing pipelines

To make sure that new functionality is working properly, we need to add integration testing pipelines at least for basic use-cases.

Notebook task job spec

I need to write pipeline tests as Scala Notebooks. Will the databrickslabs_cicdtemplates Python module support more job specs?

for example:

  "notebook_task": {
    "notebook_path": "will get filled automatically"
  }

Problem running dbx execute with example

I'm not able to run dbx execute --cluster-name my-cluster --job job-sample. As far as I understand, in dbx execute, wheel was built, copied to dbfs, and jobs/sample/entrypoint.py was serialized to directly run on the interactive cluster, without providing --conf-file. However the Job class defined in project_name/common.py requires --conf-file, thus exception were thrown
image

========================================
So far, it seems to me that:

  • dbx execute is suitable to launch script without external config files, since it's using 1.2 API command execute.
  • if I want to run the local code on interactive cluster (for fast dev iteration), I'll need a separate job json with "exisiting_cluster_id". then run launch (which is /job/run-now underneath)

Run testing coverage on integration tests

Hello,
I'm trying to check my tests code coverage using coverage.py, it's easy to do it for the unit tests because they are launched using pytest: pytest --cov=standard_datatamers_lib tests/unit --cov-report=xml:cobertura.xml, but I'm struggling to find a way to do it with the integration tests, because they are launched using dbx: launch --job=standard_datatamers_lib-sample-integration-test-feature --trace.
Do you have a way to do that? If not, are you planning to create something in regard to it?
Thanks in advance :)

Introduce self-hosted pypi repo scenario

Some users might use a self-hosted PyPI repository.

API changes

Introduce a new parameter for dbx deploy, which will disable the automated editing of job spec with package location:

dbx deploy --no-rebuild --no-package

This change is backward-compatible.

Doc changes

We need to add the documentation example for the self-hosted PyPI repo.

Acceptance criteria:

  • API change is introduced
  • Doc change is introduced

Windows issue with replacing workspace_dir file forward slash with backward slashes

Running on Windows, the project.json values (+ local file locations?) seem to cause an error. Python seems to be trying to replace forward slashes with backward slashes
windows_file_error.txt

The documentation notes that the dbx commands are executed in bash. Is Windows not supported?

(dbconnect_aws) C:\Users\user1\sample_aws>dbx execute --cluster-name=test-cluster --job=sample_aws-sample-integration-test
Traceback (most recent call last):
File "c:\local\miniconda3\envs\dbconnect_aws\lib\site-packages\databricks_cli\sdk\api_client.py", line 131, in perform_query
resp.raise_for_status()
File "c:\local\miniconda3\envs\dbconnect_aws\lib\site-packages\requests\models.py", line 941, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://x.databricks.com/api/2.0/workspace/mkdirs

...

Response from server:
{ 'error_code': 'INVALID_PARAMETER_VALUE',
'message': "Path (\Shared\dbx\projects) doesn't start with '/'"}

Giving Access to Private pypi Repo Hosted in Azure to Databricks

Hi,
I am running into what is likely authentication issues when running dbx deploy --jobs=<job> --files-only --no-package I tried the solutions below:

  1. In deployments.json:
    "libraries": [
    {
    "pypi": {"package": "mypckg", "repo": "pckglink"}
    }
    ],
  2. Excluding '--no-package' creates the deployment-result.json file under artifacts->dbx in my databricks workspace. It fails since the pypi package reference is incorrectly set up. It runs 'pip install --index-url repourl' since it splits the name as another package and this of course fails. The url has the token so this should work if we had this instead: "pypi": { "package": "package_name --index-url repourl" } . Any thoughts on
    "libraries": [
    {
    "whl": "whllocation.whl"
    },
    {
    "pypi": {
    "package": "--index-url repourl"
    }
    },
    {
    "pypi": {
    "package": "package_name"
    }
    }
    ]

Devops pipeline doesn't fail even though testing failed in cluster job.

Hello,
when I'm running some integration tests and I get an error I can see something like this on the cluster's logs ======================================================================
FAIL: test_datawarehouse (main.DatawarehouseIntegrationTest)

Traceback (most recent call last):
File "", line 116, in test_datawarehouse
File "", line 247, in test_mergePk
AssertionError


Ran 2 tests in 210.850s

FAILED (failures=1)

Then I'm expecting to see a failure in the related Azure devops pipeline but that's not the case, how could I make the pipeline fail if there's an error in the job?

On azure devops i see the following:
[dbx][2021-01-11 15:17:10.918] Job run is not yet finished, current status message: In run
[dbx][2021-01-11 15:17:16.011] Job run finished successfully
[dbx][2021-01-11 15:17:16.011] Launch command finished

Build artifact task error: invalid command 'bdist_wheel'

I'm running this pipeline in Azure DevOps. The Build Artifact task exits with code '1' and the error details

Script contents:
python setup.py bdist_wheel
========================== Starting Command Output ===========================
/bin/bash --noprofile --norc /home/vsts/work/_temp/9ca428f0-653f-47d8-8a33-6ef61518c30c.sh
usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
   or: setup.py --help [cmd1 cmd2 ...]
   or: setup.py --help-commands
   or: setup.py cmd --help

error: invalid command 'bdist_wheel'

##[error]Bash exited with code '1'.

The earlier task Install Python Dependencies was succesful and appears to have installed wheel.

Collecting wheel
  Downloading wheel-0.34.2-py2.py3-none-any.whl (26 kB)

No module named 'path'

$ cookiecutter https://github.com/databrickslabs/cicd-templates.git

You've downloaded /home/zhen/.cookiecutters/cicd-templates before. Is it okay to delete and re-download it? [yes]: 
project_name [cicd-sample-project]: 
version [0.0.1]: 
description [Databricks Labs CICD Templates Sample Project]: 
author []: 
Select cloud:
1 - AWS
2 - Azure
3 - Google Cloud
Choose from 1, 2, 3 (1, 2, 3) [1]: 
Select cicd_tool:
1 - GitHub Actions
2 - Azure DevOps
3 - GitLab
Choose from 1, 2, 3 (1, 2, 3) [1]: 3
project_slug [cicd_sample_project]: 
workspace_dir [/Shared/dbx/cicd_sample_project]: 
artifact_location [dbfs:/Shared/dbx/projects/cicd_sample_project]: 
profile [DEFAULT]: 
Traceback (most recent call last):
  File "/tmp/tmp5scvfog5.py", line 5, in <module>
    from path import Path
ModuleNotFoundError: No module named 'path'
ERROR: Stopping generation because post_gen_project hook script didn't exit successfully
Hook script failed (exit status: 1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.