databricks / dbt-databricks Goto Github PK

A dbt adapter for Databricks.

License: Apache License 2.0

Python 99.95% Shell 0.05%

dbt-databricks's Introduction

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

The Databricks Lakehouse provides one simple platform to unify all your data, analytics and AI workloads.

dbt-databricks

The dbt-databricks adapter contains all of the code enabling dbt to work with Databricks. This adapter is based off the amazing work done in dbt-spark. Some key features include:

Easy setup. No need to install an ODBC driver as the adapter uses pure Python APIs.
Open by default. For example, it uses the the open and performant Delta table format by default. This has many benefits, including letting you use MERGE as the the default incremental materialization strategy.
Support for Unity Catalog. dbt-databricks>=1.1.1 supports the 3-level namespace of Unity Catalog (catalog / schema / relations) so you can organize and secure your data the way you like.
Performance. The adapter generates SQL expressions that are automatically accelerated by the native, vectorized Photon execution engine.

Choosing between dbt-databricks and dbt-spark

If you are developing a dbt project on Databricks, we recommend using dbt-databricks for the reasons noted above.

dbt-spark is an actively developed adapter which works with Databricks as well as Apache Spark anywhere it is hosted e.g. on AWS EMR.

Getting started

Installation

Install using pip:

pip install dbt-databricks

Upgrade to the latest version

pip install --upgrade dbt-databricks

Profile Setup

your_profile_name:
  target: dev
  outputs:
    dev:
      type: databricks
      catalog: [optional catalog name, if you are using Unity Catalog, only available in dbt-databricks>=1.1.1]
      schema: [database/schema name]
      host: [your.databrickshost.com]
      http_path: [/sql/your/http/path]
      token: [dapiXXXXXXXXXXXXXXXXXXXXXXX]

Quick Starts

These following quick starts will get you up and running with the dbt-databricks adapter:

Developing your first dbt project
Using dbt Cloud with Databricks (Azure | AWS)
Running dbt production jobs on Databricks Workflows
Using Unity Catalog with dbt-databricks
Using GitHub Actions for dbt CI/CD on Databricks
Loading data from S3 into Delta using the databricks_copy_into macro
Contribute to this repository

Compatibility

The dbt-databricks adapter has been tested:

with Python 3.7 or above.
against Databricks SQL and Databricks runtime releases 9.1 LTS and later.

Tips and Tricks

Choosing compute for a Python model

You can override the compute used for a specific Python model by setting the http_path property in model configuration. This can be useful if, for example, you want to run a Python model on an All Purpose cluster, while running SQL models on a SQL Warehouse. Note that this capability is only available for Python models.

def model(dbt, session):
    dbt.config(
      http_path="sql/protocolv1/..."
    )

dbt-databricks's People

Contributors

Stargazers

Watchers

Forkers

ueshin superdupershant ksree momataj allisonwang-db troyel mcannamela sathya-reddy-m okayhooni varunsh-coder tovganesh feng-tao danthelion nvinhphuc james-hadoop adelantefinancialholdings binhnefits smhood92 koningjasper chefaref zhengruifeng pramoddoodala lwbayes dave-connors-3 jcalvesoliveira vsamoilov hitesh-goel anogues dbeatty10 chenyulinx shubhm13 lucasouza98 andrefurlan-db luisleon90 edejong-dbc tadtenacious jackyhu-db oynek leedabee brooklyn-data gawalimangesh007 anikolaienko susodapop ballacharan andrefurlan nihil0 harlixxy baocaih pppjain isabella232 pietern seagen dataders rcypher-databricks josephberni arpitjain799 pavs23 sheilaquan nakicam maayan-s govindarajan-d cezarypukownik casperdamen123 benc-db project-echo buxert leo-schick robert-altmiller amihaygil rlsalcido24 ammarchalifah nodejsmith andreas-gjensidige jeffrey-harrison colin-rogers-dbt kdazzle tailorcare davidharting yunbodeng-db mikealfare srggrs mrk-its case-k-git nwilliams71 jacobus-herman lennartkats-db hammondguerin rlh1994 peterallenwebb benltaief thijs-nijhuis dustinvannoy-db khitem dwreeves johnsequeira-paradigm stevenayers arulnidhii gaoshihang gthesheep mi-volodin

dbt-databricks's Issues

Support ZORDER on models as model configuration

Describe the feature

ZORDER is a useful way to get natural colocation for data. It can only be run as part of the OPTIMIZE command. I would like to be able to set it as model configuration. In the implementation, we would run the OPTIMIZE command, which would use the model metadata to figure out the right ZORDER columns

Who will this benefit?

Analytics engineers who want to keep model metadata in one location - the model itself.

support user incremental predicates on merge

Describe the feature

Users should be able to supply incremental models with arbitrary predicates during the merge step to prevent unnecessary table scan for massive data sets.

Describe alternatives you've considered

None

Additional context

See:

Who will this benefit?

users with extremely large datasets who want to limit the scan of their merge operations on incremental models!

Are you interested in contributing this feature?

Yes

'DatabricksSQLConnectionWrapper' object has no attribute 'cancel'

Describe the bug

when running dbt build, got the following error msg

06:26:51  Encountered an error:
'DatabricksSQLConnectionWrapper' object has no attribute 'cancel'
06:26:51  Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/dbt/task/runnable.py", line 379, in execute_nodes
    self.run_queue(pool)
  File "/usr/local/lib/python3.9/site-packages/dbt/task/runnable.py", line 293, in run_queue
    self._raise_set_error()
  File "/usr/local/lib/python3.9/site-packages/dbt/task/runnable.py", line 274, in _raise_set_error
    raise self._raise_next_tick
dbt.exceptions.FailFastException: FailFast Error in model ******(models/******.sql)
  Failing early due to test failure or runtime error
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/dbt/main.py", line 129, in main
    results, succeeded = handle_and_check(args)
  File "/usr/local/lib/python3.9/site-packages/dbt/main.py", line 191, in handle_and_check
    task, res = run_from_args(parsed)
  File "/usr/local/lib/python3.9/site-packages/dbt/main.py", line 238, in run_from_args
    results = task.run()
  File "/usr/local/lib/python3.9/site-packages/dbt/task/runnable.py", line 470, in run
    result = self.execute_with_hooks(selected_uids)
  File "/usr/local/lib/python3.9/site-packages/dbt/task/runnable.py", line 433, in execute_with_hooks
    res = self.execute_nodes()
  File "/usr/local/lib/python3.9/site-packages/dbt/task/runnable.py", line 382, in execute_nodes
    self._cancel_connections(pool)
  File "/usr/local/lib/python3.9/site-packages/dbt/task/runnable.py", line 357, in _cancel_connections
    for conn_name in adapter.cancel_open_connections():
  File "/usr/local/lib/python3.9/site-packages/dbt/adapters/base/impl.py", line 1032, in cancel_open_connections
    return self.connections.cancel_open()
  File "/usr/local/lib/python3.9/site-packages/dbt/adapters/sql/connections.py", line 43, in cancel_open
    self.cancel(connection)
  File "/usr/local/lib/python3.9/site-packages/dbt/adapters/spark/connections.py", line 296, in cancel
    connection.handle.cancel()
AttributeError: 'DatabricksSQLConnectionWrapper' object has no attribute 'cancel'

Steps To Reproduce

using the following version:

dbt-databricks==1.2.2
dbt-core==1.2.1

Expected behavior

No error is expected.

Screenshots and log output

as provided above

System information

dbt-databricks==1.2.2
dbt-core==1.2.1

The output of python --version:

Additional context

Might be related to this PR: https://github.com/databricks/dbt-databricks/pull/163/files

SQL with Japanese/Chinese characters not applied correctly in Databricks

NB

A redirect of this issue. I've copied the whole thing, hope you don't mind.

Current Behavior

Have a model like this (reduced the division list for brevity):

{{ config(materialized='view') }}

{% set divisions = [
    ('JP-23', 'Aichi', 'Aichi', '愛知県', 'Kanjii', 'Chūbu', 'JP'),
    ('JP-05', 'Akita', 'Akita', '秋田県', 'Kanjii', 'Tōhoku', 'JP'),
]
%}

{% for state_iso_code, state_name, state_name_2, state_name_local, state_name_local_type, region, country_code in divisions %}
select
    '{{ state_iso_code }}'        as state_iso_code
  , '{{ state_name }}'            as state_name
  , '{{ state_name_2 }}'          as state_name_2
  , '{{ state_name_local }}'      as state_name_local
  , '{{ state_name_local_type }}' as state_name_local_type
  , '{{ region }}'                as region
  , '{{ country_code }}'          as country_code
union all
{% endfor %}
select
    '-1'      as state_iso_code
  , '(Blank)' as state_name
  , '(Blank)' as state_name_2
  , '(Blank)' as state_name_local
  , '(Blank)' as state_name_local_type
  , '(Blank)' as region
  , '-1'      as country_code

When dbt compile is run, everything seems fine:

select
    'JP-23'        as state_iso_code
  , 'Aichi'            as state_name
  , 'Aichi'          as state_name_2
  , '愛知県'      as state_name_local
  , 'Kanjii' as state_name_local_type
  , 'Chūbu'                as region
  , 'JP'          as country_code
union all

select
    'JP-05'        as state_iso_code
  , 'Akita'            as state_name
  , 'Akita'          as state_name_2
  , '秋田県'      as state_name_local
  , 'Kanjii' as state_name_local_type
  , 'Tōhoku'                as region
  , 'JP'          as country_code

union all

select
    '-1'      as state_iso_code
  , '(Blank)' as state_name
  , '(Blank)' as state_name_2
  , '(Blank)' as state_name_local
  , '(Blank)' as state_name_local_type
  , '(Blank)' as region
  , '-1'      as country_code

Now when dbt run --model administrative_divisions is executed against Databricks profile (type: databricks), the resuling view is this:

CREATE VIEW `auto_replenishment_george_test`.`administrative_divisions` (
  `state_iso_code`,
  `state_name`,
  `state_name_2`,
  `state_name_local`,
  `state_name_local_type`,
  `region`,
  `country_code`)
TBLPROPERTIES (
  'transient_lastDdlTime' = '1645023802')
AS select
    'JP-23'        as state_iso_code
  , 'Aichi'            as state_name
  , 'Aichi'          as state_name_2
  , '???'      as state_name_local
  , 'Kanjii' as state_name_local_type
  , 'Chubu'                as region
  , 'JP'          as country_code
union all

select
    'JP-05'        as state_iso_code
  , 'Akita'            as state_name
  , 'Akita'          as state_name_2
  , '???'      as state_name_local
  , 'Kanjii' as state_name_local_type
  , 'Tohoku'                as region
  , 'JP'          as country_code

union all

select
    '-1'      as state_iso_code
  , '(Blank)' as state_name
  , '(Blank)' as state_name_2
  , '(Blank)' as state_name_local
  , '(Blank)' as state_name_local_type
  , '(Blank)' as region
  , '-1'      as country_code

The ??? is not what we expected to see :)

Expected Behavior

Japanese/Chinese characters are sent correctly to the actual database, without being replaced with questionmarks.

Steps To Reproduce

Described in Current Behaviour

Relevant log output

Log seems to be fine:

15:03:22.101320 [debug] [Thread-1  ]: On model.auto_replenishment.administrative_divisions: /* {"app": "dbt", "dbt_version": "1.0.1", "profile_name": "auto_replenishment", "target_name": "dev", "node_id": "model.auto_replenishment.administrative_divisions"} */
create or replace view auto_replenishment_george_test.administrative_divisions
  
  as
    




select
    'JP-23'        as state_iso_code
  , 'Aichi'            as state_name
  , 'Aichi'          as state_name_2
  , '愛知県'      as state_name_local
  , 'Kanjii' as state_name_local_type
  , 'Chūbu'                as region
  , 'JP'          as country_code
union all

select
    'JP-05'        as state_iso_code
  , 'Akita'            as state_name
  , 'Akita'          as state_name_2
  , '秋田県'      as state_name_local
  , 'Kanjii' as state_name_local_type
  , 'Tōhoku'                as region
  , 'JP'          as country_code
union all
...



### Environment

```markdown
- OS: Ubuntu 20.04
- Python: 3.8.10
- dbt: 1.0.1

What database are you using dbt with?

other (mention it in "Additional Context")

Additional Context

installed version: 1.0.1
latest version: 1.0.1

Up to date!

Plugins:

databricks: 1.0.1
spark: 1.0.0

Materialization strategy `table` for seeds produces broken SQL

Describe the bug

Materialization strategy table for seeds produces broken SQL.

/* {"app": "dbt", "dbt_version": "1.1.0", "profile_name": "z", "target_name": "dev", "node_id": "seed.z.channel_mapping"} */
    create or replace table desimic_zms.channel_mapping
    using delta
    location 's3://bucket/path/channel_mapping'
    as
      None
------^^^

Steps To Reproduce

dbt_project.yml:

seeds:
  +location_root: "s3://bucket/path"
  +file_format: delta
  +materialization: table

I'm not sure what the default materialization for seed is (dbt documentation is not clear on this) , but I put table because by default with a combination of external tables, it appends records to the table, so I'm assuming it's incremental by default (surprisingly).

Expected behavior

This is debatable, but I expect seeds to behave like models with the materialization strategy table.

System information

The output of dbt --version:

Core:
  - installed: 1.1.0
  - latest:    1.1.1 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - spark:      1.1.0 - Up to date!
  - databricks: 1.1.0 - Up to date!

The operating system you're using: OS X Big Sur

The output of python --version: Python 3.8.10

Reintegrate functional adapter tests into GitHub Action CI

I might be lost in tox here, but as far as I can tell, there are tox jobs setup to run the dbt-core's integration test suite, but I don't see them being run in a PR's CI, only the unit tests (which are incredible and very comprehensive btw!)

dbt-databricks/tox.ini

Lines 30 to 64 in e921754

 [testenv:integration-databricks-cluster] 

 basepython = python3 

 commands = /bin/bash -c '{envpython} -m pytest -v --profile databricks_cluster -n4 tests/functional/adapter/* {posargs}; ret=$?; [ $ret = 5 ] && exit 0 || exit $ret' 

 /bin/bash -c '{envpython} -m pytest -v -m profile_databricks_cluster -n4 tests/integration/* {posargs}; ret=$?; [ $ret = 5 ] && exit 0 || exit $ret' 

 passenv = DBT_* PYTEST_ADDOPTS 

 deps = 

 -r{toxinidir}/dev-requirements.txt 

 -r{toxinidir}/requirements.txt 

 [testenv:integration-databricks-uc-cluster] 

 basepython = python3 

 commands = /bin/bash -c '{envpython} -m pytest -v --profile databricks_uc_cluster -n4 tests/functional/adapter/* {posargs}; ret=$?; [ $ret = 5 ] && exit 0 || exit $ret' 

 /bin/bash -c '{envpython} -m pytest -v -m profile_databricks_uc_cluster -n4 tests/integration/* {posargs}; ret=$?; [ $ret = 5 ] && exit 0 || exit $ret' 

 passenv = DBT_* PYTEST_ADDOPTS 

 deps = 

 -r{toxinidir}/dev-requirements.txt 

 -r{toxinidir}/requirements.txt 

 [testenv:integration-databricks-sql-endpoint] 

 basepython = python3 

 commands = /bin/bash -c '{envpython} -m pytest -v --profile databricks_sql_endpoint -n4 tests/functional/adapter/* {posargs}; ret=$?; [ $ret = 5 ] && exit 0 || exit $ret' 

 /bin/bash -c '{envpython} -m pytest -v -m profile_databricks_sql_endpoint -n4 tests/integration/* {posargs}; ret=$?; [ $ret = 5 ] && exit 0 || exit $ret' 

 passenv = DBT_* PYTEST_ADDOPTS 

 deps = 

 -r{toxinidir}/dev-requirements.txt 

 -r{toxinidir}/requirements.txt 

 [testenv:integration-databricks-uc-sql-endpoint] 

 basepython = python3 

 commands = /bin/bash -c '{envpython} -m pytest -v --profile databricks_uc_sql_endpoint -n4 tests/functional/adapter/* {posargs}; ret=$?; [ $ret = 5 ] && exit 0 || exit $ret' 

 /bin/bash -c '{envpython} -m pytest -v -m profile_databricks_uc_sql_endpoint -n4 tests/integration/* {posargs}; ret=$?; [ $ret = 5 ] && exit 0 || exit $ret' 

 passenv = DBT_* PYTEST_ADDOPTS 

 deps = 

 -r{toxinidir}/dev-requirements.txt 

 -r{toxinidir}/requirements.txt

Provide Databricks-specific spark-utils

Describe the feature

Some of the functionality in dbt-utils can be simplified if we had a Databricks-specific implementation. For example, dateadd can be much simpler and more performant.

Delta Live Tables

Describe the feature

Create and orchestrate Delta Live Tables using dbt-databricks

Describe alternatives you've considered

Delta Live Tables can be set up using the Databricks UI, but this misses all the excellent features of dbt.

Additional context

I do not know whether this idea is feasible, since Delta Live Tables function only in tandem with Databricks pipelines, which goes beyond CREATE TABLE.

Who will this benefit?

Anyone who likes both dbt and Delta Live Tables, in particular teams with existing dbt-databricks projects who want to migrate to Delta Live Tables.

Are you interested in contributing this feature?

Potentially, but I am a complete newbie to this repo.

Enable pandas and pandas-on-Spark DataFrames for Databricks

Describe the feature

When a dbt python model returns a Pandas DataFrame, materialize it as a table in Databricks.

Currently, an error will be raised for Pandas DataFrames (whereas PySpark DataFrames work fine).

Same as dbt-labs/dbt-bigquery#312 and dbt-labs/dbt-spark#468

Are you interested in contributing this feature?

I will submit a pull request for this 👍

Can't get Python models running on `1.3.0b0`

Describe the bug

I am trying to run Python models using the dbt-dtabricks adapter but get some errors.

Steps To Reproduce

Create some models in your dbt project.
With only SQL models, they run correctly.
When adding a Python model, the run fails with the error:

Unhandled error while executing model.my_project.my_model
Python model doesn't support SQL Warehouses

My profiles.yml is configured with the http_path config.

If I add a cluster in addition to http_path, I still get the error.
If I keep the cluster and remove http_path I get the error:

dbt.exceptions.DbtProfileError: Runtime Error
`http_path` must set when using the dbsql method to connect to Databricks

Expected behavior

The Python model runs correctly

System information

The output of dbt --version:

Core:
  - installed: 1.3.0-b2
  - latest:    1.2.1    - Ahead of latest version!

Plugins:
  - databricks: 1.3.0b0 - Ahead of latest version!
  - snowflake:  1.3.0b2 - Ahead of latest version!
  - spark:      1.3.0b2 - Ahead of latest version!

Additional context

Is there specific config that needs to filled-in in profiles.yml to work with Python models?

Databricks Workflows GA

Hi,

I build a data vault model for a customer using dbt and want to use the Databricks workflow integration as mentioned here: https://github.com/databricks/dbt-databricks/blob/main/docs/databricks-workflows.md

Is there a way to apply for the private preview, or can you estimate, when the feature becomes public preview or GA?

Best,
Nicolai

Multiple sql statements in a prehook string fails to parse

Describe the bug

Having multiple sql statements in a dbt prehook fails.

Steps To Reproduce

Run an example model such as:

{{
  config(
    pre_hook="select 'a';
              select 'b'; "
  )
}}

select * from some_schema.some_table

using dbt run and get the error:

Runtime Error in model atty (models/simple_demo/atty.sql)
16:58:51    Query execution failed.
16:58:51    Error message: org.apache.spark.sql.catalyst.parser.ParseException: 
16:58:51    extraneous input 'select' expecting {<EOF>, ';'}(line 4, pos 14)
16:58:51    
16:58:51    == SQL ==
16:58:51    /* {"app": "dbt", "dbt_version": "1.0.3", "profile_name": "tdss_dbt_test", "target_name": "dev", "node_id": "model.tdss_dbt_test.atty"} */
16:58:51    
16:58:51            select 'a';
16:58:51                  select 'b'
16:58:51    --------------^^^

Expected behavior

If you run the same prehook sql (select 'a'; select 'b';) directly in a Databricks notebook against the same cluster, it will execute successfully.
Furthermore, multiple sql statements in a prehook are supported in the dbt documentation https://docs.getdbt.com/reference/resource-configs/pre-hook-post-hook

System information

The output of dbt --version:

dbt --version
installed version: 1.0.3
   latest version: 1.0.4

Your version of dbt is out of date! You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation

Plugins:
  - databricks: 1.0.1 - Up to date!
  - spark: 1.0.0 - Up to date!

The operating system you're using:
MacOS 12.2.1 (Monterey)

The output of python --version:
Python 3.8.12

Additional context

The cluster I am connected to uses DBR 9.1.
I have not tried this with post hooks yet, but I assume it will have the same behavior.
My particular instance is using a PVC release.

Generic DBT_ACCESS_TOKEN in dbt Workflow does not have access to catalogs

Context

AWS Cloud
Enterprise Plan
We are using Unity Catalog
An authenticated user without a permission group has access to no catalogs
Only authenticated users with a permission group can access respective catalogs and databases
I am a consultant engaged with a client
This client has regulatory requirements that prevent the feasibility of dbt cloud

Triage Steps

Following the instructions outlined for running dbt-core as a dbt Task Workflow here:
https://github.com/databricks/dbt-databricks/blob/main/docs/databricks-workflows.md

✅ Local Dev + SQL Warehouses works

I can get this to work locally when I specify the http_path to be a SQL Warehouse endpoint and I use my PAT injected into the profiles.yml
This works because my PAT and the SQL Warehouses are assigned to the correct permissions groups in our workspace

😭 dbt Workflows does not work when...

This is because the cluster generates credentials generically for:

DBT_HOST,
DBT_ORG_ID,
DBT_CLUSTER_ID,
DBT_ACCESS_TOKEN

It seems that access token does not have the correct permission group associated and there is no way to add the association.

I tried hard coding a PAT that I know works but instead of using the /sql/1.0/endpoints/<endpoint_id> I tried it against the cluster endpoint "sql/protocolv1/o/{{ env_var('DBT_ORG_ID') }}/{{ env_var('DBT_CLUSTER_ID') }}".

This also did not work.

I could go back to having the workflow use the SQL Warehouse Endpoint and a PAT but that defeats the purpose of spinning up a spark cluster for a job only to send the work to another (more expensive) spark cluster right?

Triage Summary

Here is a summary table of test cases

token type	endpoint type	outcome
PAT	SQL Warehouse	✅
Generic Access Token	Cluster Endpoint	😭
PAT	Cluster Endpoint	😭

Definitions:

Generic Access Token --> DBT_ACCESS_TOKEN
SQL Warehouse --> /sql/1.0/endpoints/<endpoint_id>
Cluster Endpoint --> "sql/protocolv1/o/{{ env_var('DBT_ORG_ID') }}/{{ env_var('DBT_CLUSTER_ID') }}"

Further digging:

It seems the dbt-databricks team have left a TODO with a similar conundrum so we are all on the same page it seems.

https://github.com/databricks/dbt-databricks/blob/main/dbt/adapters/databricks/connections.py#L408-L420

                # TODO: what is the error when a user specifies a catalog they don't have access to
                conn: DatabricksSQLConnection = dbsql.connect(
                    server_hostname=creds.host,

Expected Behaviour

As a user I would expect documentation around how the generic DBT_ACCESS_TOKEN is generated.
As a user I would expect the dbt Workflow task to have some way to grant it permissions
As a user I would expect some documentation around the cluster SQL endpoints as well as a bread crumb trail of links from the related user guides and dbt task documentation

Extra

I do realise that my expectations lie with the Databricks team and not with this library package. However it also makes sense to track this issue in public so that others can follow along as it will undoubtedly link to some code changes too.

I have an active use case I am piloting with a client so very keen to act as a beta tester in our dev environment so there is low risk and we can provide real world feedback.

Cannot run dbt docs generate with JSON logs

Describe the bug

When running dbt docs generate with JSON logs enabled I receive an error: Encountered an error while generating catalog: Object of type DatabricksRelation is not JSON serializable.
This occurs when using dbt-databricks 1.1.0 on all of locally (Windows), Docker (Linux) and the preview Databricks dbt task type.
It does not occur in earlier versions.
It does not occur with the default log format.

Steps To Reproduce

Run dbt --log-format json docs generate

Expected behavior

Doc site generates correctly, with JSON logs.

Screenshots and log output

{"code": "E044", "data": {}, "invocation_id": "4240ec3d-ef2f-4772-8540-d4d522a8c717", "level": "info", "log_version": 2, "msg": "Building catalog", "pid": 1628, "thread_name": "MainThread", "ts": "2022-05-25T14:22:49.779873Z", "type": "log_line"}
{"code": "Z046", "data": {"log_fmt": null, "msg": "Encountered an error while generating catalog: Object of type DatabricksRelation is not JSON serializable"}, "invocation_id": "4240ec3d-ef2f-4772-8540-d4d522a8c717", "level": "warn", "log_version": 2, "msg": "Encountered an error while generating catalog: Object of type DatabricksRelation is not JSON serializable", "pid": 1628, "thread_name": "MainThread", "ts": "2022-05-25T14:22:49.783320Z", "type": "log_line"}
{"code": "E041", "data": {"num_exceptions": 1}, "invocation_id": "4240ec3d-ef2f-4772-8540-d4d522a8c717", "level": "error", "log_version": 2, "msg": "dbt encountered 1 failure while writing the catalog", "pid": 1628, "thread_name": "MainThread", "ts": "2022-05-25T14:22:49.792754Z", "type": "log_line"}

System information

Windows:

Core:
  - installed: 1.1.0
  - latest:    1.1.1 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - databricks: 1.1.0 - Up to date!
  - spark:      1.1.0 - Up to date!

The operating system you're using:
Windows/Linux/Databricks

The output of python --version:
Windows:
Python 3.9.0

Additional context

Unable to install databricks adapter through packages and dbt deps

Describe the bug

We are trying to setup the dbt-adapter in our dbt environment. We've been running the spark-adapter beforehand. We install packages using dbt deps and specifying packages. When we add the dbt-databricks adapter to the packages list we get the following error:

source % dbt --debug  deps --project-dir=/Users/ipenev/repos/repo3/source/data_ops/dbt --profiles-dir=/Users/ipenev/repos/repo3/source/data_ops/dbt --target=dev
2022-10-06 17:23:16.939813 (MainThread): Running with dbt=0.20.2
2022-10-06 17:23:16.997800 (MainThread): running dbt with arguments Namespace(cls=<class 'dbt.task.deps.DepsTask'>, debug=True, defer=None, log_cache_events=False, log_format='default', partial_parse=None, profile=None, profiles_dir='/Users/ipenev/repos/repo3/source/data_ops/dbt', project_dir='/Users/ipenev/repos/repo3/source/data_ops/dbt', record_timing_info=None, rpc_method='deps', single_threaded=False, state=None, strict=False, target='dev', test_new_parser=False, use_cache=True, use_colors=None, use_experimental_parser=False, vars='{}', warn_error=False, which='deps', write_json=True)
2022-10-06 17:23:16.998395 (MainThread): Tracking: tracking
2022-10-06 17:23:17.011898 (MainThread): Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'start', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x12505a910>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1261f23d0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1261f20a0>]}
2022-10-06 17:23:17.013316 (MainThread): Set downloads directory='/var/folders/g7/6hvwgqks4kl69bpl428pdl5h0000gn/T/dbt-downloads-cvlom8wt'
2022-10-06 17:23:17.013788 (MainThread): Making package registry request: GET https://hub.getdbt.com/api/v1/index.json
2022-10-06 17:23:17.098672 (MainThread): Response from registry: GET https://hub.getdbt.com/api/v1/index.json 200
2022-10-06 17:23:17.098890 (MainThread): Making package registry request: GET https://hub.getdbt.com/api/v1/dbt-labs/dbt_utils.json
2022-10-06 17:23:17.180818 (MainThread): Response from registry: GET https://hub.getdbt.com/api/v1/dbt-labs/dbt_utils.json 200
2022-10-06 17:23:17.184042 (MainThread): Making package registry request: GET https://hub.getdbt.com/api/v1/dbt-labs/dbt_utils/0.7.6.json
2022-10-06 17:23:17.280394 (MainThread): Response from registry: GET https://hub.getdbt.com/api/v1/dbt-labs/dbt_utils/0.7.6.json 200
2022-10-06 17:23:17.280695 (MainThread): Making package registry request: GET https://hub.getdbt.com/api/v1/dbt-labs/spark_utils.json
2022-10-06 17:23:17.376761 (MainThread): Response from registry: GET https://hub.getdbt.com/api/v1/dbt-labs/spark_utils.json 200
2022-10-06 17:23:17.378420 (MainThread): Making package registry request: GET https://hub.getdbt.com/api/v1/dbt-labs/spark_utils/0.2.4.json
2022-10-06 17:23:17.476903 (MainThread): Response from registry: GET https://hub.getdbt.com/api/v1/dbt-labs/spark_utils/0.2.4.json 200
2022-10-06 17:23:17.477542 (MainThread): Executing "git clone --depth 1 https://github.com/databricks/dbt-databricks.git c4cd59c6b0d9803e4aa7bb7d61ccf434"
2022-10-06 17:23:18.531971 (MainThread): STDOUT: "b''"
2022-10-06 17:23:18.532513 (MainThread): STDERR: "b"Cloning into 'c4cd59c6b0d9803e4aa7bb7d61ccf434'...\n""
2022-10-06 17:23:18.533404 (MainThread): Pulling new dependency c4cd59c6b0d9803e4aa7bb7d61ccf434.
2022-10-06 17:23:18.533499 (MainThread): Executing "git rev-parse HEAD"
2022-10-06 17:23:18.547641 (MainThread): STDOUT: "b'e92175426dfcef8f536822e1b8b5e626bcee30c3\n'"
2022-10-06 17:23:18.548122 (MainThread): STDERR: "b''"
2022-10-06 17:23:18.548275 (MainThread):   Checking out revision HEAD.
2022-10-06 17:23:18.548353 (MainThread): Executing "git remote set-branches origin HEAD"
2022-10-06 17:23:18.556476 (MainThread): STDOUT: "b''"
2022-10-06 17:23:18.556807 (MainThread): STDERR: "b''"
2022-10-06 17:23:18.556917 (MainThread): Executing "git fetch origin --depth 1 --tags HEAD"
2022-10-06 17:23:19.652205 (MainThread): STDOUT: "b''"
2022-10-06 17:23:19.652629 (MainThread): STDERR: "b'From https://github.com/databricks/dbt-databricks\n * branch            HEAD       -> FETCH_HEAD\n * [new tag]         v0.13.0    -> v0.13.0\n * [new tag]         v0.14.3    -> v0.14.3\n * [new tag]         v0.15.3    -> v0.15.3\n * [new tag]         v0.16.0    -> v0.16.0\n * [new tag]         v0.16.1    -> v0.16.1\n * [new tag]         v0.17.0    -> v0.17.0\n * [new tag]         v0.17.1    -> v0.17.1\n * [new tag]         v0.17.2    -> v0.17.2\n * [new tag]         v0.18.0    -> v0.18.0\n * [new tag]         v0.18.1    -> v0.18.1\n * [new tag]         v0.18.1.1  -> v0.18.1.1\n * [new tag]         v0.18.2    -> v0.18.2\n * [new tag]         v0.19.0    -> v0.19.0\n * [new tag]         v0.19.0.1  -> v0.19.0.1\n * [new tag]         v0.19.0rc1 -> v0.19.0rc1\n * [new tag]         v0.19.1    -> v0.19.1\n * [new tag]         v0.19.1b2  -> v0.19.1b2\n * [new tag]         v0.19.1rc1 -> v0.19.1rc1\n * [new tag]         v0.19.2    -> v0.19.2\n * [new tag]         v0.19.2rc2 -> v0.19.2rc2\n * [new tag]         v0.20.0    -> v0.20.0\n * [new tag]         v0.20.0rc1 -> v0.20.0rc1\n * [new tag]         v0.20.0rc2 -> v0.20.0rc2\n * [new tag]         v0.20.1    -> v0.20.1\n * [new tag]         v0.20.1rc1 -> v0.20.1rc1\n * [new tag]         v0.20.2    -> v0.20.2\n * [new tag]         v0.20.2rc1 -> v0.20.2rc1\n * [new tag]         v0.20.2rc2 -> v0.20.2rc2\n * [new tag]         v0.21.0    -> v0.21.0\n * [new tag]         v0.21.0b1  -> v0.21.0b1\n * [new tag]         v0.21.0b2  -> v0.21.0b2\n * [new tag]         v0.21.0rc1 -> v0.21.0rc1\n * [new tag]         v0.21.0rc2 -> v0.21.0rc2\n * [new tag]         v0.21.1    -> v0.21.1\n * [new tag]         v1.0.0     -> v1.0.0\n * [new tag]         v1.0.1     -> v1.0.1\n * [new tag]         v1.0.2     -> v1.0.2\n * [new tag]         v1.0.3     -> v1.0.3\n * [new tag]         v1.1.0     -> v1.1.0\n * [new tag]         v1.1.1     -> v1.1.1\n * [new tag]         v1.1.2     -> v1.1.2\n * [new tag]         v1.1.3     -> v1.1.3\n * [new tag]         v1.1.4     -> v1.1.4\n * [new tag]         v1.1.5     -> v1.1.5\n * [new tag]         v1.2.0     -> v1.2.0\n * [new tag]         v1.2.1     -> v1.2.1\n * [new tag]         v1.2.2     -> v1.2.2\n * [new tag]         v1.2.3     -> v1.2.3\n'"
2022-10-06 17:23:19.652863 (MainThread): Executing "git tag --list"
2022-10-06 17:23:19.661725 (MainThread): STDOUT: "b'v0.13.0\nv0.14.3\nv0.15.3\nv0.16.0\nv0.16.1\nv0.17.0\nv0.17.1\nv0.17.2\nv0.18.0\nv0.18.1\nv0.18.1.1\nv0.18.2\nv0.19.0\nv0.19.0.1\nv0.19.0rc1\nv0.19.1\nv0.19.1b2\nv0.19.1rc1\nv0.19.2\nv0.19.2rc2\nv0.20.0\nv0.20.0rc1\nv0.20.0rc2\nv0.20.1\nv0.20.1rc1\nv0.20.2\nv0.20.2rc1\nv0.20.2rc2\nv0.21.0\nv0.21.0b1\nv0.21.0b2\nv0.21.0rc1\nv0.21.0rc2\nv0.21.1\nv1.0.0\nv1.0.1\nv1.0.2\nv1.0.3\nv1.1.0\nv1.1.1\nv1.1.2\nv1.1.3\nv1.1.4\nv1.1.5\nv1.2.0\nv1.2.1\nv1.2.2\nv1.2.3\n'"
2022-10-06 17:23:19.662133 (MainThread): STDERR: "b''"
2022-10-06 17:23:19.662268 (MainThread): Executing "git reset --hard origin/HEAD"
2022-10-06 17:23:19.679297 (MainThread): STDOUT: "b'HEAD is now at e921754 Add a test for "consolidate timestamp macros". (#200)\n'"
2022-10-06 17:23:19.679689 (MainThread): STDERR: "b''"
2022-10-06 17:23:19.679794 (MainThread): Executing "git rev-parse HEAD"
2022-10-06 17:23:19.686468 (MainThread): STDOUT: "b'e92175426dfcef8f536822e1b8b5e626bcee30c3\n'"
2022-10-06 17:23:19.686774 (MainThread): STDERR: "b''"
2022-10-06 17:23:19.686863 (MainThread):   Checked out at e921754.
2022-10-06 17:23:19.686990 (MainThread): WARNING: The git package "https://github.com/databricks/dbt-databricks.git" 
	is not pinned, using HEAD (default branch).
	This can introduce breaking changes into your project without warning!

See https://docs.getdbt.com/docs/package-management#section-specifying-package-versions
2022-10-06 17:23:19.687561 (MainThread): Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'end', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x12505ac70>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x126026fa0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1261f22b0>]}
2022-10-06 17:23:19.688023 (MainThread): Flushing usage events
2022-10-06 17:23:20.129044 (MainThread): Encountered an error:
2022-10-06 17:23:20.129536 (MainThread): Runtime Error
  no dbt_project.yml found at expected path /var/folders/g7/6hvwgqks4kl69bpl428pdl5h0000gn/T/dbt-downloads-cvlom8wt/c4cd59c6b0d9803e4aa7bb7d61ccf434/dbt_project.yml
2022-10-06 17:23:20.134408 (MainThread): Traceback (most recent call last):
  File "/Users/ipenev/repos/repo3/source/.venv/lib/python3.8/site-packages/dbt/main.py", line 125, in main
    results, succeeded = handle_and_check(args)
  File "/Users/ipenev/repos/repo3/source/.venv/lib/python3.8/site-packages/dbt/main.py", line 203, in handle_and_check
    task, res = run_from_args(parsed)
  File "/Users/ipenev/repos/repo3/source/.venv/lib/python3.8/site-packages/dbt/main.py", line 256, in run_from_args
    results = task.run()
  File "/Users/ipenev/repos/repo3/source/.venv/lib/python3.8/site-packages/dbt/task/deps.py", line 53, in run
    final_deps = resolve_packages(packages, self.config)
  File "/Users/ipenev/repos/repo3/source/.venv/lib/python3.8/site-packages/dbt/deps/resolver.py", line 137, in resolve_packages
    target = final[package].resolved().fetch_metadata(config, renderer)
  File "/Users/ipenev/repos/repo3/source/.venv/lib/python3.8/site-packages/dbt/deps/base.py", line 85, in fetch_metadata
    self._cached_metadata = self._fetch_metadata(project, renderer)
  File "/Users/ipenev/repos/repo3/source/.venv/lib/python3.8/site-packages/dbt/deps/git.py", line 102, in _fetch_metadata
    loaded = Project.from_project_root(path, renderer)
  File "/Users/ipenev/repos/repo3/source/.venv/lib/python3.8/site-packages/dbt/config/project.py", line 642, in from_project_root
    partial = cls.partial_load(project_root, verify_version=verify_version)
  File "/Users/ipenev/repos/repo3/source/.venv/lib/python3.8/site-packages/dbt/config/project.py", line 609, in partial_load
    return PartialProject.from_project_root(
  File "/Users/ipenev/repos/repo3/source/.venv/lib/python3.8/site-packages/dbt/config/project.py", line 455, in from_project_root
    project_dict = _raw_project_from(project_root)
  File "/Users/ipenev/repos/repo3/source/.venv/lib/python3.8/site-packages/dbt/config/project.py", line 153, in _raw_project_from
    raise DbtProjectError(
dbt.exceptions.DbtProjectError: Runtime Error
  no dbt_project.yml found at expected path /var/folders/g7/6hvwgqks4kl69bpl428pdl5h0000gn/T/dbt-downloads-cvlom8wt/c4cd59c6b0d9803e4aa7bb7d61ccf434/dbt_project.yml

Steps To Reproduce

This is our packages file:

packages:
  - package: dbt-labs/dbt_utils
    version: 0.7.6
  - package: dbt-labs/spark_utils
    version: 0.2.4
  - git: "https://github.com/databricks/dbt-databricks.git"
    revision: "main"
    warn-unpinned: false

dbt --debug deps --project-dir=/Users/ipenev/repos/repo3/source/data_ops/dbt --profiles-dir=/Users/ipenev/repos/repo3/source/data_ops/dbt --target=dev is the command I ran, although the target is irrelevant imo. The project-dir and profiles-dir just point to the directory where we keep the dbt_project.yml, packages.yml, and profiles.yml files.

Expected behavior

We'd expect that the package successfully installs so that we can run our dbt models on the databricks adapter.

System information

The output of dbt --version:

ipenev@RHV4MQM9CN source % dbt --version
installed version: 0.20.2
   latest version: 1.0.0

Your version of dbt is out of date! You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation

Plugins:
  - spark: 0.20.2

The operating system you're using:
macOS Monterey

The output of python --version:

ipenev@RHV4MQM9CN source % python --version
Python 3.8.14

UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown 3

Describe the bug

The instance is killed with the reason OOMKilled.

Seems there is a memory leakage.

Steps To Reproduce

When run dbt build with multiple threads enabled, in a memory constrained environment (i.e. k8s, with limited memory)
The instance will be killed with the reason OOMKilled

Expected behavior

dbt build should not require too much memory

Screenshots and log output

System information

The output of dbt --version:

The operating system you're using:

The output of python --version:

Additional context

Add any other context about the problem here.

Persist docs fails if Delta Column Mapping is enabled and column name contains spaces

Describe the bug

Persist docs fails if Delta Column Mapping is enabled and column name contains spaces

Steps To Reproduce

Create a model with Delta Column Mapping and persist docs enabled:

{{ config(
    tblproperties={'delta.columnMapping.mode':'name'},
    persist_docs={"relation": true, "columns": true}
)}}
... <truncated> ...

In the model, add a space to a column name e.g.

    select
        ...
        customers.first_name as `first name`,
        ...

Run the model, and you will see the ALTER TABLE fail because it's not escaping the column name correctly:

07:52:19    
07:52:19    Renaming column is not supported in Hive-style ALTER COLUMN, please run RENAME COLUMN instead.(line 4, pos 8)
07:52:19    
07:52:19    == SQL ==
07:52:19    /* {"app": "dbt", "dbt_version": "1.1.1", "profile_name": "jaffle_shop", "target_name": "dev", "node_id": "model.jaffle_shop.customers"} */
07:52:19    
07:52:19        
07:52:19            alter table bilal.customers change column
07:52:19    --------^^^
07:52:19                first name
07:52:19                comment 'Customer\'s first name. PII.'
07:52:19    
07:52:19  
07:52:19  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

Expected behavior

I can use column names with spaces

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

The output of dbt --version:

<output goes here>

The operating system you're using:

The output of python --version:

Additional context

Add any other context about the problem here.

Connection test: [ERROR] - dbt-databricks behind proxy

Describe the bug

A clear and concise description of what the bug is. What command did you run? What happened?
dbt debug gives error
Connection test: [ERROR]

1 check failed:
dbt was unable to connect to the specified database.
The database returned the following error:

Runtime Error
Database Error
failed to connect

ENV set
HTTP_PROXY
HTTPS_PROXY

Does not seemed that proxy environment are being used
curl to host/http_path is OK

Steps To Reproduce

In as much detail as possible, please provide steps to reproduce the issue. Sample data that triggers the issue, example model code, etc is all very helpful here.

dbt debug

Expected behavior

A clear and concise description of what you expected to happen.
connection test OK

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

The output of dbt --version:
Core:

installed: 1.1.0
latest: 1.1.0 - Up to date!

Plugins:

databricks: 1.1.0 - Up to date!
spark: 1.1.0 - Up to date!

The operating system you're using:
ubuntu
The output of python --version:
Python 3.8.10

Additional context

Add any other context about the problem here.

Setting table properties is really slow

Describe the feature

Setting many table properties is slow (e.g. when you use them for column documentation) because when you set one table property, Hive Metastore rewrites all of them.

Proposal

Enhance dbt-databricks to fetch all the comments, do a diff with what's in the dbt project and only issue updates if the diffs don't match.

Who will this benefit?

Everyone! Faster table properties would be nice!

Input from dbt Labs's core adapters team

as part of our new adapter verification process, the dbt core adapters engineering team and I met today to do a high-level review of the dbt-databricks codebase.

good news, your adapter passes the sniff test! That means that while the engineers haven’t yet had the time to review every single line of code (though they plan to in the next few sprints), we did not see anything that should stop the verification process.

An interesting output is that the team found opportunities where they might be better able to support, both of which are not of an urgent priority to fix:

the code introduced in #98 was difficult to grok at first (TIL . is this to support both UC (that have database/catalog) and no-UC dbt-databricks (that do not have database/catalog) projects? I imagine that conditionality can pretty tricky, and the core team wanted to meet with y’all to discuss potential simpler solutions

this parsing of our exception message (see below) raised our hackles, in that now we know that things will break if we decide to standardize our Exception handling messages as we have already done with our logging. That said, we recognize the need for it, but wanted to tell you that we do plan to overhaul our Exception handling so that you won't even need this logic moving forward. stay tuned.

dbt-databricks/dbt/adapters/databricks/impl.py

Lines 114 to 125 in dbd58fb

 try: 

 # The catalog for `show table extended` needs to match the current catalog. 

 with self._catalog(schema_relation.database): 

 results = self.execute_macro(LIST_RELATIONS_MACRO_NAME, kwargs=kwargs) 

 except dbt.exceptions.RuntimeException as e: 

 errmsg = getattr(e, "msg", "") 

 if f"Database '{schema_relation}' not found" in errmsg: 

 return [] 

 else: 

 description = "Error while retrieving information about" 

 logger.debug(f"{description} {schema_relation}: {e.msg}") 

 return []

lastly, are there any particular areas you’d like us to review more closely? Are there any ways that the dbt-core adapters team can better support dbt-databricks now and in the future? let us know!

Randomly getting 'at least one column must be specified for the table'

Describe the bug

When deploying materializations I will randomly get:

Runtime Error in <redacted>
20:40:59    Error running query: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: at least one column must be specified for the table

This causes that step to fail.

Steps To Reproduce

I wish I knew. It doesn't happen consistently for any particular materialization. The crazy thing is that it happens sometimes for static materializations, like seeds.

Expected behavior

Not to randomly fail.

Screenshots and log output

See above

System information

The output of dbt --version:

Core:
  - installed: 1.1.0
  - latest:    1.2.0 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - databricks: 1.1.1 - Up to date!
  - spark:      1.1.0 - Update available!

The operating system you're using:
macOS Big Sur 11.6.7

The output of python --version:
Python 3.9.10

Ability to set table/view privileges as part of model metadata

Describe the feature

This feature implements the ability to set privileges on tables and views using model metadata.

Describe alternatives you've considered

An alternative is to run SQL e.g. using the on-run-end hook

Who will this benefit?

Anyone who wants to manage privileges as part of the model definition.

Why not use `dbt-spark`?

Why is this repo needed? Would like to hear the motivation for adding a separate dbt-databricks.

Persist Column level comments when creating views

Describe the feature

The persist_docs config works for tables by materializing a model as a table and then running ALTER TABLE statements to update the comment on each column. This method does not work for views because ALTER VIEW doesn't support adding comments. Instead, comments need to be defined when running the CREATE OR REPLACE VIEW statement. There should be an option to handle column level comments for models materialized as views

Describe alternatives you've considered

I tried creating a macro that adds the necessary comment clause and put it at the top of my model but the SQL gets inserted after the AS in CREATE OR REPLACE VIEW AS

Who will this benefit?

This will benefit anyone using column level comments on views in databricks

Are you interested in contributing this feature?

Yes! I overwrote the adapters.sql macro to do this. I'll create and link to a PR!

Support insert_by_period materialization using MERGE

Describe the feature

dbt has an insert_by_period materialization which lets users process 'chunks' during an incremental insert. This is pretty useful when the initial run can be problematic. Our implementation does not currently support MERGE as a strategy, it should.

Describe alternatives you've considered

None, really.

Who will this benefit?

Users who are doing large initial incremental loads.

Remove submit_python_job from Jinja context

Describe the feature

With the latest change in dbt-core, we should consider remove this line so that user won't have access to adapter.submit_python_job in Jinja macros directly. This function is used by a context member directly in dbt-core.

Also something related to this area, we recently refactored the function here a little bit(dbt-core, dbt-spark). Maybe we want to follow the style there so any updates are automatically included

Additional context

This would help dbt keep being opinionated about when certain operation should happen

Support for COPY INTO

Describe the feature

Databricks support the ability to load data into tables using COPY INTO.

Describe alternatives you've considered

I can write my own macro

Who will this benefit?

A typical use case is to load files from cloud storage into a 'bronze' Delta table.

Are you interested in contributing this feature?

sure!

I'm not able to run code developed using dbt directly in Databricks.

Describe the bug

I'm not able to run code developed using dbt directly in Databricks.

I'm trying to configure a CI/CD process using Databricks jobs to automatically run the SQL codes. We usually do this using Airflow and this is the first time we are configuring it directly in Databricks. However, when trying to connect, DBT (or Python, I don't know) gives an error that it doesn't have the logs function.

Steps To Reproduce

Create a venv with Python 3.8.10
py -m venv venv
source venv/Scripts/activate

In this case, I'm on Windows 10, but I don't think it will impact the next steps for reprodute.

Install and upgrade dbt-databricks

pip install dbt-databricks
pip install --upgrade dbt-databricks

Generate a dummy project

dbt init jaffle_shop

Check if connection is working

dbt debug

Commit all changes in a repository test
Add repository in Repos Databricks and give a pull refresh
Create a notebook in same diretory for test
Install requirements and debug dbt
In this case, we have only dbt-databricks to install

%pip install dbt-databricks
%sh dbt --version

%sh dbt debug

Expected behavior

The expected result of dbt debug is the connection indicator working.
Like this:

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

The output of dbt --version:

Additional context

We tried to install another version of dbt-databricks, but every time we try, the installed version always comes 1.0.4, as if it didn't obey the command directed.

As happen in this case:

dbt-databricks adapter plugin inherits from the dbt-spark plugin

Describe the feature

Teams using dbt-spark and dbt-labs/spark-utils intending to migrate to using dbt-databricks cant do so because adapter name is now databricks instead of spark, so macros that are prefixed with spark__ won't automatically be picked up for use.

Describe alternatives you've considered

The dbt-databricks adapter plugin depends on / inherits from the dbt-spark plugin. This would have the effect of telling dbt, "For adapter-dispatched macros, look for databricks__, then spark__, then default__

Who will this benefit?

What kind of use case will this feature be useful for? Please be specific and provide examples, this will help us prioritize properly.

teams wanting to use dbt-databricks with dbt-labs/spark-utils macros

Data is duplicated on reloading seeds that are using an external table

Describe the bug

When seeds are configured as an external table, the data is getting duplicated on reload.

Steps To Reproduce

I create a repo with a demo of the issue: https://github.com/dejan/dbt-demo-inconsistent-seeds but here is a short instruction on how to reproduce:

Have seeds/cities.yml:

id,name
1,berlin
2,paris

Have seeds/countries.yml:

id,name
1,germany
2,france

Configure one seed to use managed table (by default) and another one to use external (by setting location_root).

seeds:
  foo:
    cities:
    countries:
      location_root: "{{ env_var('LOCATION') }}"

Run dbt seed twice:

dbt seed && dbt seed

Observe the content in both tables.

select * from foo.cities

id	name
1	berlin
2	paris

select * from foo.countries

id	name
1	germany
1	germany
2	france
2	france

Expected behavior

The behavior should be consistent regardless of the table type. The data should be reloaded - ie there should be no duplicates.

System information

The output of dbt --version:

Core:
  - installed: 1.1.0
  - latest:    1.1.1 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - spark:      1.1.0 - Up to date!
  - databricks: 1.1.0 - Up to date!

The operating system you're using: OS X Big Sur

The output of python --version: Python 3.8.10

unable to connect to Databricks 9.1 LTS

Running dbt 1.0.1 with databricks 1.0.1 and spark 1.0.0 plugins.

When running "dbt debug" connection fails giving the following errors

Runtime Error
Database Error
failed to connect

profile.yml contains

projectname:
outputs:
dev:
type: databricks
schema: curated
host: https://adb-XXXXXXX.azuredatabricks/net/
http_path: /sql/protocolv1/o/XXXXXXXXXXXXXXXXXXXXX/XXXXXXXXXXX
token: dapiXXXXXXXXXXXXXXXXXXXXXXX
target: dev

Able to connect to databricks via databricks-connect with the same credentials

The following is from the dbt.log

19:00:40.648080 [debug] [MainThread]: Acquiring new databricks connection "debug"
19:00:40.648903 [debug] [MainThread]: Using databricks connection "debug"
19:00:40.649092 [debug] [MainThread]: On debug: select 1 as id
19:00:40.649255 [debug] [MainThread]: Opening a new connection, currently in state init
19:00:52.823291 [debug] [MainThread]: Databricks adapter: Error while running:
select 1 as id
19:00:52.824351 [debug] [MainThread]: Databricks adapter: Database Error
failed to connect
19:00:52.826960 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'end', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f88da4bddc0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f88da4d54c0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f88da4d5bb0>]}

Steps To Reproduce

From command line run "dbt debug" within the project directory. Profile and Project yml files pass. Then get error message on failure to connect to Azure Databricks.

Expected behavior

Connection to Azure Databricks

System information

installed version: 1.0.1
latest version: 1.0.1

Up to date!

Plugins:

databricks: 1.0.1
spark: 1.0.0

The operating system you're using:
Windows 10 Build 19042.1526 Version 20H2 with WSL 1 running Ubuntu 20.04.

The output of python --version:
Python 3.8.10

Support for azure authentication mechanisms.

Describe the feature

Beyond simple PAT tokens supporting some of Azure AAD based authentication would great.

Additional context

dbt-sqlserver is a good example of how to get vaild auth tokens, and databricks-sql-connector already supports taking auth_token arguments.

Who will this benefit?

Users trying to use AAD based SSO or other features.

Can't generate dbt docs

Describe the bug

We use the dbt-databricks adapter and can successfully read our models from dbt and also write them. However, when we call dbt docs generate we get the following error: Expected only one database in get_catalog, found.

To be on the safe side, I took another look at the source.yml. Unfortunately, only the schema is set there and I don't know why dbt docs generate shouldn't work.

Steps To Reproduce

Try to generate the dbt docs for a databricks destination / source

Expected behavior

Docs should be generated

Screenshots and log output

System information

The output of dbt --version:

╰─$ dbt --version    
installed version: 1.0.1
   latest version: 1.0.1

Up to date!

Plugins:
  - databricks: 1.0.0
  - spark: 1.0.0

The operating system you're using:

The output of python --version:

╰─$ python --version              
Python 3.8.12

'append_new_columns' flag causes failures for 'append' incremental models

Describe the bug

When using the append_new_columns flag for incremental models with an append incremental strategy, the first dbt run adding new columns will fail. However, the next run and all runs after will succeed.
The reason for this appears to be that new columns are successfully added, but the first insert statement after the table has been altered still uses the old columns (i.e., 3 instead of 4 columns).

Note that this does not happen for the merge incremental strategy.

Steps To Reproduce

1.) Configure an incremental model to with the append incremental_strategy and append_new_columns for on_schema_change, like so:

{{
    config(
        materialized = 'incremental'
        , incremental_strategy = 'append'
        , on_schema_change = 'append_new_columns'
    )
}}

2.) Execute an initial (full refresh) build of the model
3.) Add a new column
4.) Execute an incremental build of the model (it should fail)
5.) Execute an incremental build of the model again (it should succeed)

Expected behavior

This should succeed the first run after a schema change (new column), rather than fail the first time.

Screenshots and log output

Here's an example of what the error looks like:
Cannot write to '[REDACTED].incremental_test', not enough data columns; target table has 4 column(s) but the inserted data has 3 column(s)

Example of databricks query log showing the insert failure after the alter table, then succeeding on the next insert statement:

System information

The output of dbt --version:

dbtenv info:  Using dbt-databricks==1.1.0 (set by dbt_project.yml).
Core:
  - installed: 1.1.1
  - latest:    1.2.0 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - databricks: 1.1.0 - Update available!
  - spark:      1.1.0 - Update available!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

(I realize this isn't the most up-to-date version, but I didn't see any closed issues related to this)

The operating system you're using:
macOS Monterey v12.0.1

The output of python --version:
Python 3.8.13

Add support for identity columns

Describe the feature

Now that identity columns are GA, it would be great to be able to specify the identity column as part of the model configuration.

Describe alternatives you've considered

monotonically_increasing_id()
row_number()
rank OVER
hash()
md5()

Who will this benefit?

Every dbt-databricks user who wants to generate surrogate keys.

Are you interested in contributing this feature?

Why not 😊

Support for Databricks CATALOG as a DATABASE in DBT compilations

Describe the feature

Unity Catalog for Databricks now supports a three-level namespace via CATALOG in analogy to how most SQL dialects support DATABASES: https://docs.databricks.com/data-governance/unity-catalog/queries.html#three-level-namespace-notation

It would be excellent if DBT was able to treat DATABASE as CATALOG when compiling SQL for databricks. This is done in analogy to custom databases in DBT. In fact, the BigQuery connector seems to already support a similar configuration via project configuration: https://docs.getdbt.com/docs/building-a-dbt-project/building-models/using-custom-databases

Describe alternatives you've considered

We would have to not use DATABASE or custom databases at all with Databricks, or we would have to do so using a pre-build hook that issues a USE CATALOG statement.

Additional context

PRs #89 and #94 are already working on this, I added this in support of this work!

Who will this benefit?

Anyone who wants to use custom "databases" feature of DBT in databricks.

Are you interested in contributing this feature?

Would love to help any way I can.

Support for partition overwrite with Delta

Describe the feature

This was reported in dbt-labs/dbt-spark#155 but I think you might be more interested in resolving the issue

Currently, the insert_overwrite strategy throws an error if file format is set to delta because it doesn't support dynamic partition overwrite

Delta already supports partitions overwrite but it seems that dbt adapter implementation is not making use of it.

Describe alternatives you've considered

I could not find a way to atomically overwrite a partition.

Who will this benefit?

Everyone using dbt and Delta.

Ability to set TBLPROPERTIES from model metadata

Describe the feature

TBLPROPERTIES are key-value metadata on tables in Databricks. This feature adds the ability to set these properties from model metadata.

Describe alternatives you've considered

An alternative is run SQL to do this.

Who will this benefit?

Anyone who likes to track metadata e.g. owners, teams, etc. on tables

Are you interested in contributing this feature?

Sure!

dbt snapshot doesn't work for a query

Describe the bug

when you define dbt snapshot using query. it produces the error as "Create or Replace Table DB_name.snapshot_name" at Table section. Please find attached screenshot for it.

Steps To Reproduce

Define any simple snapshot using query.
then execute command : dbt snapshot --select SNAPSHOT_NAME

Expected behavior

we expect that SNAPSHOT should have been created in DBKS Delta table

Screenshots and log output

System information

The output of dbt --version:
dbt version 1..0.4

dbt-databricks 1.0.1

```

The operating system you're using:
Windows

The output of python --version:
Python 3.9.11

Additional context

Add any other context about the problem here.

Support to set tblproperties when creating tables and views.

Describe the feature

In the models config, set tblpropeties as follows.

{{ config(
  materialized='incremental', 
  tblproperties={
    'delta.autoOptimize.optimizeWrite' : 'true',
    'delta.autoOptimize.autoCompact' : 'true'
  }
) }}

There is a similar issue in spark-dbt issues, but it doesn't seem to be implemented.

dbt-labs/dbt-spark#190

Describe alternatives you've considered

There is the way to execute Alter Table statements with dbt post-hooks

Who will this benefit?

All users who create tables or views in Databricks.

As described in the following Databricks document, tblproperties settings such as delta.autoOptimize.autoCompact are often set at table creation time.

https://docs.databricks.com/delta/optimizations/auto-optimize.html

ResourceWarning: Unclosed SSLSocket

Describe the bug

Finishing commands like dbt run result in a message with an unclosed socket

Steps To Reproduce

Run a command like dbt run

Expected behavior

Close the socket?

Screenshots and log output

ResourceWarning: unclosed <ssl.SSLSocket fd=9, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.1.0.237', 47582), raddr=('40.74.30.80', 443)>

System information

The output of dbt --version:

installed version: 1.0.0
   latest version: 1.0.0

Up to date!

Plugins:
  - databricks: 1.0.0
  - postgres: 1.0.0
  - redshift: 1.0.0
  - bigquery: 1.0.0
  - snowflake: 1.0.0

The operating system you're using:
Ubuntu 21.10 in WSL2 (Windows 10)

The output of python --version:
Python 3.9.7

Additional context

Add any other context about the problem here.

[bug] dbt-databricks v1.0.0 did not update dba-spark ('str' object has no attribute '_message')

Describe the bug

After upgrading to v1.0.0 of dbt, I re-installed dbt-databricks. Now I get this error:

Runtime Error
'str' object has no attribute '_message'

Steps To Reproduce

dbt debug -t http

Expected behavior

[OK]

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

 % dbt debug -t http
01:07:05  Running with dbt=1.0.0
dbt version: 1.0.0
python version: 3.9.9
python path: /Users/nauto/Developer/dbt/dbt-env/bin/python3
os info: macOS-12.0.1-x86_64-i386-64bit
Using profiles.yml file at /Users/nauto/.dbt/profiles.yml
Using dbt_project.yml file at /Users/nauto/Developer/dbt/dbt_project.yml

Configuration:
  profiles.yml file [OK found and valid]
  dbt_project.yml file [OK found and valid]

Required dependencies:
 - git [OK found]

Connection:
  host: nauto-biz-prod-us.cloud.databricks.com
  port: 443
  cluster: 1004-233546-lager607
  endpoint: None
  schema: default
  organization: 0
  Connection test: [ERROR]

1 check failed:
dbt was unable to connect to the specified database.
The database returned the following error:

  >Runtime Error
  'str' object has no attribute '_message'

Check your database credentials and try again. For more information, visit:
https://docs.getdbt.com/docs/configure-your-profile

System information

The output of dbt --version:

dbt --version
installed version: 1.0.0
   latest version: 1.0.0

Up to date!

Plugins:
  - databricks: 1.0.0
  - spark: 0.21.0

AHA. That was it. I uninstalled dbt-spark and re-installed it back with v1, and it works.

dot-core complained about outdated modules but dbt-databricks did not

QUALIFY clause does not work with incremental models

Describe the bug

When using QUALIFY in an incremental model, the view dbt creates does not exist. This prevents dbt from doing the merge statement.

Steps To Reproduce

test.sql

{{ config(
    materialized = 'incremental',
    incremental_strategy = 'merge',
    unique_key = 'id',
) }}

with test as (
  select 1 as id
)
select *
from test
qualify row_number() over (order BY id) = 1

log results:

    create temporary view test__dbt_tmp as
    

with test as (
  select 1 as id
)
select *
from test
qualify row_number() over (order BY id) = 1

  
17:50:20.156423 [debug] [Thread-1  ]: Opening a new connection, currently in state closed
17:50:21.114806 [debug] [Thread-1  ]: SQL status: OK in 0.96 seconds
17:50:22.030437 [debug] [Thread-1  ]: Writing runtime SQL for node "model.lakehouse.test"
17:50:22.031221 [debug] [Thread-1  ]: Spark adapter: NotImplemented: add_begin_query
17:50:22.031552 [debug] [Thread-1  ]: Using databricks connection "model.lakehouse.test"
17:50:22.031835 [debug] [Thread-1  ]: On model.lakehouse.test: /* {"app": "dbt", "dbt_version": "1.0.4", "profile_name": "limebi", "target_name": "stg", "node_id": "model.lakehouse.test"} */

    
  
  
  
  
    merge into silver_limeade_platform.test as DBT_INTERNAL_DEST
      using test__dbt_tmp as DBT_INTERNAL_SOURCE
      
    
        
            
            
        

        on 
                DBT_INTERNAL_SOURCE.id = DBT_INTERNAL_DEST.id
            
    
  
      when matched then update set *
      when not matched then insert *

17:50:22.578607 [debug] [Thread-1  ]: Databricks adapter: Error while running:
/* {"app": "dbt", "dbt_version": "1.0.4", "profile_name": "limebi", "target_name": "stg", "node_id": "model.lakehouse.test"} */

    
  
  
  
  
    merge into silver_limeade_platform.test as DBT_INTERNAL_DEST
      using test__dbt_tmp as DBT_INTERNAL_SOURCE
      
    
        
            
            
        

        on 
                DBT_INTERNAL_SOURCE.id = DBT_INTERNAL_DEST.id
            
    
  
      when matched then update set *
      when not matched then insert *

17:50:22.579077 [debug] [Thread-1  ]: Databricks adapter: Query execution failed. State: ERROR_STATE; Error code: 0; SQLSTATE: None; Error message: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: Table or view not found: test__dbt_tmp; line 9 pos 12
	at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:47)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:418)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:245)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.sql.hive.thriftserver.ThriftLocalProperties.withLocalProperties(ThriftLocalProperties.scala:123)
	at org.apache.spark.sql.hive.thriftserver.ThriftLocalProperties.withLocalProperties$(ThriftLocalProperties.scala:48)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:52)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:223)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:208)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:257)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.sql.AnalysisException: Table or view not found: test__dbt_tmp; line 9 pos 12
	at org.apache.spark.sql.AnalysisException.copy(AnalysisException.scala:71)
	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:410)
	... 16 more

Expected behavior

Expect view to be created and queryable to execute merge statement.

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

The output of dbt --version:
1.04
The operating system you're using:
WSL
The output of python --version:
Python 3.9.11

Additional context

Add any other context about the problem here.

`dbt run` fails when the `location_root` is updated

Describe the bug

dbt run fails for models materialized as tables when the location_root is updated.

Steps To Reproduce

Define model_a as materialized as a table
Define location_root for model (i.e s3://my-bucket/location1)
Run model
Update model's location_root to a different path (i.e s3://my-bucket/location2)
Run model

Error is thrown

The location of the existing table hive_metastore.db.model_a is s3://my-bucket/location1/modela. It doesn't match the specified location s3://my-bucket/location2/modela.

Expected behavior

The storage location is updated without an error. This is in-fact the behaviour when the model is defined with an incremental materialization and run with the full-refresh flag.
The incremental materialization is working because a drop table if exists command runs before the create or replace table as, which is not the case for the table materialization

System information

Core:
  - installed: 1.1.1
  - latest:    1.1.1 - Up to date!

Plugins:
  - databricks: 1.1.0 - Update available!
  - spark:      1.1.0 - Up to date!

The operating system you're using: macOS Monterrey - 12.4

The output of python --version: Python 3.9.12

upgrade to support dbt-core v1.2.0

We've just published the release cut of dbt-core 1.2.0, dbt-core 1.2.0rc1 (PyPI | GitHub release notes).

dbt-labs/dbt-core#5468 is an open discussion with more detailed information, and dbt-labs/dbt-core#5474 is for keeping track of the communities progress on releasing 1.2.0

Below is a checklist of work that would enable a successful 1.2.0 release of your adapter.

migrate necessary cross-db macros into adapter and ensure they're tested accordingly
remove any copy-and-pasted materialization (if your adapter inherits from another adapter)
add new basic tests BaseDocsGenerate and BaseDocsGenReferences
consider checking and testing support for Python 3.10

dbt-labs/dbt-core#5432 might make it into the second release cut in the next week, in which case, you'll also might want to:

implement method and tests for connection retry logic

upgrade to support dbt-core v1.3.0

Background

The latest release cut for 1.3.0, dbt-core==1.3.0rc2 was published on October 3, 2022 (PyPI | Github). We are targeting releasing the official cut of 1.3.0 in time for the week of October 16 (in time for Coalesce conference).

We're trying to establish a following precedent w.r.t. minor versions:
Partner adapter maintainers release their adapter's minor version within four weeks of the initial RC being released. Given the delay on our side in notifying you, we'd like to set a target date of November 7 (four weeks from today) for maintainers to release their minor version

Timeframe	Date (intended)	Date (Actual)	Event
D - 3 weeks	Sep 21	Oct 10	dbt Labs informs maintainers of upcoming minor release
D - 2 weeks	Sep 28	Sep 28	core 1.3 RC is released
Day D	October 12	Oct 12	core 1.3 official is published
D + 2 weeks	October 26	Nov 7	dbt-adapter 1.3 is published

How to upgrade

dbt-labs/dbt-core#6011 is an open discussion with more detailed information, and dbt-labs/dbt-core#6040 is for keeping track of the community's progress on releasing 1.2.0

Below is a checklist of work that would enable a successful 1.2.0 release of your adapter.

Python Models (if applicable)
Incremental Materialization: cleanup and standardization
More functional adapter tests to inherit

Would be nice if documentation from schema.yml propagated to Databricks

Describe the feature

When I provide a column description in schema.yml, it would be nice if that populated the column comment in Date Explorer.

Describe alternatives you've considered

Copy-pasting? I'm new enough to both products that I don't know if there is some other way I should be getting documentation to show up in Data Explorer.

Who will this benefit?

Will increase the likelihood that tables are actually documented when viewed in Data Explorer (again, assuming I'm not missing some easy path outside of this library to accomplish it)

Are you interested in contributing this feature?

Sure, if someone can point me down the path.

upgrade to support dbt-core v1.2.0

We've just published the release cut of dbt-core 1.2.0, dbt-core 1.2.0rc1 (PyPI | GitHub release notes).

dbt-labs/dbt-core#5468 is an open discussion with more detailed information, and dbt-labs/dbt-core#5474 is for keeping track of the communities progress on releasing 1.2.0

Below is a checklist of work that would enable a successful 1.2.0 release of your adapter.

migrate necessary cross-db macros into adapter and ensure they're tested accordingly
remove any copy-and-pasted materialization (if your adapter inherits from another adapter)
add new basic tests BaseDocsGenerate and BaseDocsGenReferences
consider checking and testing support for Python 3.10

dbt-labs/dbt-core#5432 might make it into the second release cut in the next week, in which case, you'll also might want to:

implement method and tests for connection retry logic

upgrade to support dbt-core v1.2.0

We've just published the release cut of dbt-core 1.2.0, dbt-core 1.2.0rc1 (PyPI | GitHub release notes).

dbt-labs/dbt-core#5468 is an open discussion with more detailed information, and dbt-labs/dbt-core#5474 is for keeping track of the communities progress on releasing 1.2.0

Below is a checklist of work that would enable a successful 1.2.0 release of your adapter.

migrate necessary cross-db macros into adapter and ensure they're tested accordingly
remove any copy-and-pasted materialization (if your adapter inherits from another adapter)
add new basic tests BaseDocsGenerate and BaseDocsGenReferences
consider checking and testing support for Python 3.10

dbt-labs/dbt-core#5432 might make it into the second release cut in the next week, in which case, you'll also might want to:

implement method and tests for connection retry logic

Select SQL endpoint at runtime via model configuration

Describe the feature

Databricks SQL has endpoints that are t-shirt sized, similar to Snowflake warehouses. Models with a lot of rows need larger endpoints, but these would be overkill for smaller models. When using the Snowflake adapter, it is easy to right-size the warehouse via configuration; however in the Databricks adapter, the endpoint is selected in the profile.

This feature would bring the Databricks adapter to parity with Snowflake, allowing the endpoint to be set via configuration and override what is in the profile.

Describe alternatives you've considered

The workaround is to use env vars in the profile and make multiple invocations of dbt. Large models can be tagged as such and selected via normal model selection mechanisms. But this could get quite sticky depending on the topology of the DAG, and it may not be possible to know a priori how many invocations you would need to cover the whole DAG.

Who will this benefit?

This will benefit anybody running Databricks SQL who has sufficient number and diversity of models such that some are much larger than others and would be more efficiently run on a larger endpoint.

Are you interested in contributing this feature?

Yes, I would likely just need some advice on approach to get started and occasional help if I get stuck.

	[testenv:integration-databricks-cluster]
	basepython = python3
	commands = /bin/bash -c '{envpython} -m pytest -v --profile databricks_cluster -n4 tests/functional/adapter/* {posargs}; ret=$?; [ $ret = 5 ] && exit 0 \|\| exit $ret'
	/bin/bash -c '{envpython} -m pytest -v -m profile_databricks_cluster -n4 tests/integration/* {posargs}; ret=$?; [ $ret = 5 ] && exit 0 \|\| exit $ret'
	passenv = DBT_* PYTEST_ADDOPTS
	deps =
	-r{toxinidir}/dev-requirements.txt
	-r{toxinidir}/requirements.txt

	[testenv:integration-databricks-uc-cluster]
	basepython = python3
	commands = /bin/bash -c '{envpython} -m pytest -v --profile databricks_uc_cluster -n4 tests/functional/adapter/* {posargs}; ret=$?; [ $ret = 5 ] && exit 0 \|\| exit $ret'
	/bin/bash -c '{envpython} -m pytest -v -m profile_databricks_uc_cluster -n4 tests/integration/* {posargs}; ret=$?; [ $ret = 5 ] && exit 0 \|\| exit $ret'
	passenv = DBT_* PYTEST_ADDOPTS
	deps =
	-r{toxinidir}/dev-requirements.txt
	-r{toxinidir}/requirements.txt

	[testenv:integration-databricks-sql-endpoint]
	basepython = python3
	commands = /bin/bash -c '{envpython} -m pytest -v --profile databricks_sql_endpoint -n4 tests/functional/adapter/* {posargs}; ret=$?; [ $ret = 5 ] && exit 0 \|\| exit $ret'
	/bin/bash -c '{envpython} -m pytest -v -m profile_databricks_sql_endpoint -n4 tests/integration/* {posargs}; ret=$?; [ $ret = 5 ] && exit 0 \|\| exit $ret'
	passenv = DBT_* PYTEST_ADDOPTS
	deps =
	-r{toxinidir}/dev-requirements.txt
	-r{toxinidir}/requirements.txt

	[testenv:integration-databricks-uc-sql-endpoint]
	basepython = python3
	commands = /bin/bash -c '{envpython} -m pytest -v --profile databricks_uc_sql_endpoint -n4 tests/functional/adapter/* {posargs}; ret=$?; [ $ret = 5 ] && exit 0 \|\| exit $ret'
	/bin/bash -c '{envpython} -m pytest -v -m profile_databricks_uc_sql_endpoint -n4 tests/integration/* {posargs}; ret=$?; [ $ret = 5 ] && exit 0 \|\| exit $ret'
	passenv = DBT_* PYTEST_ADDOPTS
	deps =
	-r{toxinidir}/dev-requirements.txt
	-r{toxinidir}/requirements.txt

	try:
	# The catalog for `show table extended` needs to match the current catalog.
	with self._catalog(schema_relation.database):
	results = self.execute_macro(LIST_RELATIONS_MACRO_NAME, kwargs=kwargs)
	except dbt.exceptions.RuntimeException as e:
	errmsg = getattr(e, "msg", "")
	if f"Database '{schema_relation}' not found" in errmsg:
	return []
	else:
	description = "Error while retrieving information about"
	logger.debug(f"{description} {schema_relation}: {e.msg}")
	return []