dbt-athena / dbt-athena Goto Github PK

View Code? Open in Web Editor NEW

193.0 9.0 84.0 1.08 MB

The athena adapter plugin for dbt (https://getdbt.com)

Home Page: https://dbt-athena.github.io

License: Apache License 2.0

Python 99.62% Makefile 0.38%

athena dbt dbt-athena dbt-athena-community glue-catalog s3 iceberg

dbt-athena's Issues

Vacuum with iceberg not supported

When running a post-hook with vacuum in iceberg table, I'm getting this error:

botocore.errorfactory.InvalidRequestException: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: line 3:9: mismatched input 'vacuum'. Expecting: 'ALTER', 'ANALYZE', 'CALL', 'COMMENT', 'COMMIT', 'CREATE', 'DEALLOCATE', 'DELETE', 'DENY', 'DESC', 'DESCRIBE', 'DROP', 'EXECUTE', 'EXPLAIN', 'GRANT', 'INSERT', 'MERGE', 'PREPARE', 'REFRESH', 'RESET', 'REVOKE', 'ROLLBACK', 'SET', 'SHOW', 'START', 'TRUNCATE', 'UNLOAD', 'UPDATE', 'USE', 'USING', <query>

~Having a look at the error seems that the issue is on boto3 side~~
Running this

cursor = connect(
    work_group='athena_v3',
    s3_staging_dir="s3://my_bucket/py_athena/",).cursor()
data = cursor.execute("vacuum silver.iceberg_increment")
print(cursor)

works from the same project of dbt-athena, so there is some extra validation, that is preventing us to use vacuuming.

Invalid incremental strategy provided: None

I noticed this in a couple of my projects already, after switching to dbt 1.3.0 and 1.3.1

Given a simple model with this config block:

{{
    config(materialized='incremental')
}}

When I deploy this model using dbt, it throws:

23:09:13.423010 [error] [MainThread]: Compilation Error in model event_login (models/silver/content/XXX.sql)
23:09:13.423138 [error] [MainThread]:   Invalid incremental strategy provided: None
23:09:13.423268 [error] [MainThread]:       Expected one of: 'append', 'insert_overwrite'
23:09:13.423403 [error] [MainThread]:   
23:09:13.423529 [error] [MainThread]:   > in macro validate_get_incremental_strategy (macros/materializations/models/incremental/helpers.sql)
23:09:13.423653 [error] [MainThread]:   > called by macro materialization_incremental_athena (macros/materializations/models/incremental/incremental.sql)
23:09:13.423777 [error] [MainThread]:   > called by model event_login (models/silver/content/XXX.sql)

I can reproduce in both the Tomme athena adapter and this new dbt-athena adapter, when I'm using dbt 1.3.1.
Therefore, it seems like something that was introduced with dbt 1.3? When I switch to dbt-core==1.2.1 it works fine. However, I can't find something related in the dbt 1.3.0 or 1.3.1 release changelog...

It's weird because we set insert_overwrite as a default here:

dbt-athena/dbt/include/athena/macros/materializations/models/incremental/incremental.sql

Lines 11 to 12 in 91f793f

 {% set raw_strategy = config.get('incremental_strategy', default='insert_overwrite') %} 

 {% set strategy = validate_get_incremental_strategy(raw_strategy) %}

And then we call validate_get_incremental_strategy which throws the error.

When I debug the incremental.sql file:

{% set raw_strategy = config.get('incremental_strategy', default='insert_overwrite') %}
{{ log("RAW STRATEGY: " ~ raw_strategy, True) }}
{% set strategy = validate_get_incremental_strategy(raw_strategy) %}

Then the console prints RAW STRATEGY: None when I run dbt. Somehow the default is not being applied since dbt 1.3?!

The quick fix is to add an explicit incremental_strategy='insert_overwrite' in the model config block. However, I'd like to highlight this issue because the default stopped working in this macro... 🙏

Bump to dbt-core 1.3.0

Dynamic `config.external_location` setting does not play well with `state:modified`

Summary

The external_location config setting is respected during state comparison when running dbt with --state ... -s state:modified option.
This is troublesome in CI scenarios where we compare a development branch to the manifest.json of the production target.
As a consequence, dbt considers all models modified which make use of the external_location setting in a target-specific manner.

Expected behavior

The model should not be considered as being changed, if only the dynamically set external_location differs.

We use many other target-specific settings, like, e.g. custom schemas or databases and all of that play nicely with state:modified. So, the observed behavior is unexpected in my eyes.

How to reproduce

Have at least 2 targets defined, in this example we use dev and prod.
Set your model's config to something like

{{
    config(
        materialized='table'
        , external_location='s3://whatever/' ~ target.name
    )
}}

Compile the project for one of the two targets: dbt compile --target prod
Check which models dbt considers changed when using the other target: dbt ls --state target -s state:modified --target dev

Then this model will be shown as modified.

Use case

For our PII data we make use of the external_location setting in order to ensure that this data is written to a different S3 bucket, so we can make use of a bucket-specific access control policy.

A typical model config then looks as follows:

{{
    config(
        materialized='table'
        , external_location=get_pii_external_location_path()
        , tags=['daily', 'pii']
    )
}}

And the corresponding get_pii_external_location_path() is defined as:

{%- macro get_pii_external_location_path() -%}
    {#
        This macro is used for overriding the `external_location` config setting
        of PII models.
        Each model in the PII directory/schema must make use of it, in order
        to ensure that the model's data will get written to the PII S3 bucket.
    -#}

    {{ 's3://aws-athena-tables-data-pii/dbt/' ~ target.name ~ '/' ~ env_var('DBT_SCHEMA_NAME', this.schema) ~ '/' ~ this.name ~ '/' ~ run_started_at.isoformat() ~ '/' ~ invocation_id }}

{%- endmacro -%}

So we always use the same designated PII bucket but tweak the paths depending on the target.name and an environment variable DBT_SCHEMA_NAME, so concurrent CI runs don't clash.

If we now compile the project for --target prod and then compare what would be run in --target dev, we get all the PII models detected as changed:

dbt ls --state ./target -s state:modified --target dev

Publish new version on tag/release

As maintainer I want to trigger the publishing on tag creation.

Notes

the version in the setup.py needs to be aligned with the tag

Compatibility with Athena Engine v3

👋🏻 Hello dbt-athena squad

For now the adapter uses the version 2 of Athena Engine, according to README. On 2022.10.13, Athena release the V3 of their Engine, reducing the gap between Athena and Trino features.

I don't know what we want to do about this :

Does the adapter only support the engine v2 ?
Since we are responsible of the SQL we are creating and we can configure the workgroup, is the engine version really limited in the adapter ?

FYI there are the breaking changes

Bump dbt-core to 1.4.0

Use latest version of dbt-core 1.4.0.
This change imply to bump the minor version of the adapter too.

Add Pull request template

What

Add Pull request template, the template can follow this:

Description

Brief description on what you want to fix or tackle with the PR

Models used to test - Optional

Small section including the models that you use to test.

Checklist

You followed our contributing section
You tested your changes
....

Issue on consecutive run on incremental model

The first run works, when I run for the second time the model get stuck. No error raised.

{{ config(
    materialized='incremental',
    format='parquet',
    incremental_strategy='insert_overwrite',
    partitioned_by=['report_date']
) }}

SELECT
    999999 AS cost,
    cast(t.ts AS date) as report_date
FROM unnest(sequence(cast('2022-10-01' AS date), cast('2022-12-04' AS date), INTERVAL '1' Day)) AS t(ts)

Getting random "mismatched input 'cascade'." during dbt test

First of all: Thank you for your amazing contribution and bringing Athena engine v3 features to the DBT community!

I am currently tracking a bug that seems to occur non-deterministically. Essentially, during dbt test, I get syntax errors like these:

botocore.errorfactory.InvalidRequestException: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: line 1:100: mismatched input 'cascade'. Expecting: '.', <EOF

It happens during the initial cleanup phase when DBT tries to delete existing test tables. To me, it looks like sometimes, the macro drop_relation from this repo is not found and DBT falls back to using default__drop_relation, which does a

drop {{ relation.type }} if exists {{ relation }} cascade

Below is the relevant log output of such an error:

-- /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "prelive", "node_id": "test.data_importer.not_null_bestaende_cleaned_lagerbestand.d2103837db"} */
12:21:35.254829 [debug] [Thread-7 (]: Opening a new connection, currently in state closed
Failed to execute query.
Traceback (most recent call last):
  File "/opt/pysetup/.venv/lib/python3.10/site-packages/pyathena/common.py", line 494, in _execute
    query_id = retry_api_call(
  File "/opt/pysetup/.venv/lib/python3.10/site-packages/pyathena/util.py", line 68, in retry_api_call
    return retry(func, *args, **kwargs)
  File "/opt/pysetup/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 406, in __call__
    do = self.iter(retry_state=retry_state)
  File "/opt/pysetup/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 351, in iter
    return fut.result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/opt/pysetup/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 409, in __call__
    result = fn(*args, **kwargs)
  File "/opt/pysetup/.venv/lib/python3.10/site-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/opt/pysetup/.venv/lib/python3.10/site-packages/botocore/client.py", line 960, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.InvalidRequestException: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: line 1:102: mismatched input 'cascade'. Expecting: '.', <EOF>
12:21:35.631219 [debug] [Thread-8 (]: Athena adapter: Error running SQL: drop table if exists dbt_test__audit.dbt_expectations_expect_column_93d8b305ae3a3af2a182dc0d9423d314 cascade
-- /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "prelive", "node_id": "test.data_importer.dbt_expectations_expect_column_value_lengths_to_equal_rueres_wm_ek__wm_ek_IS_NOT_NULL__5.bc16ec6d3f"} */
12:21:35.631863 [debug] [Thread-8 (]: Athena adapter: Error running SQL: macro drop_relation
12:21:35.632307 [debug] [Thread-8 (]: finished collecting timing info
12:21:35.632645 [debug] [Thread-8 (]: On test.data_importer.dbt_expectations_expect_column_value_lengths_to_equal_rueres_wm_ek__wm_ek_IS_NOT_NULL__5.bc16ec6d3f: Close
12:21:35.633406 [debug] [Thread-8 (]: Runtime Error in test dbt_expectations_expect_column_value_lengths_to_equal_rueres_wm_ek__wm_ek_IS_NOT_NULL__5 (models/host/schema.yml)

I had some success on my local machine by adding a simple Jinja comment in front of this line.

Is it possible that there is a race condition when registering the macro that leads to spurious errors when the macro is not yet registered when test tables are being deleted? I am not very familiar with the underlying process of how DBT adapters work together with DBT core, just guessing here.

dbt_valid_to date should be set to udpated_at column date value

The dbt_valid_to date should be set to udpated_at column date value not the current timestamp.

Per the documentation: https://docs.getdbt.com/docs/build/snapshots#snapshot-meta-fields

dbt-athena/dbt/include/athena/macros/materializations/snapshots/snapshot.sql

Line 213 in 5598286

THEN {{ current_timestamp() }}

WHEN dbt_valid_to=CAST('9999-01-01' as timestamp) AND is_current_record=True THEN {{ current_timestamp() }}

This is causing overlaps in our valid date ranges.
Here are the meta data columns from one id:
id updated_at dbt_valid_from dbt_valid_to is_current_record
47a555a1-1455-481a-8937-d3cc03f02362 2021-09-02 13:58:09.961000 2021-09-02 13:58:09.961000 2023-02-14 20:05:44.433000 false
47a555a1-1455-481a-8937-d3cc03f02362 2021-09-03 13:57:59.941000 2021-09-03 13:57:59.941000 9999-01-01 00:00:00.000000 true

This was implemented with the following pull request:
#111

Final location uses backslash in windows os

Hi,

After the model generated by DBT Athena,the table location was looks like this

s3://my-test-bucket/dbt/resident_communication\ticket_last_comment\6975c131-96c7-4b48-baf3-3b18acb6a431

and i think it should be like this

s3://my-test-bucket/dbt/resident_communication/ticket_last_comment/6975c131-96c7-4b48-baf3-3b18acb6a431

I think this is cuased by Windows OS path behavior, below is my profile.
s3_staging_dir: s3://my-test-bucket/dbt/
s3_data_dir: s3://my-test-bucket/dbt/
s3_data_naming: schema_table_unique

Is there a wrong setting with me? thank you very much.

Support Athena Spark

AWS recently released Athena Spark.
Being able to use it would be an awesome addition

Issues with tmp tables and incremental load

tmp table is not cleaned when using v1.3.3 and incremental mode. After model run, the table is dropped but the s3 path is not pruned.

Iceberg table support

Hi,

I'm looking at the Iceberg tables support plan:

As iceberg doesn't support CTA, the implementation do the following:

create tmp table as parquet
drop the old table, if exist
create iceberg table based on tmp table definition, only metadata
insert into from tmp
drop tmp table

I think that the tmp table creation is going to kill performance (I have 10s of billions of rows).

How about this approach:

create the tmp table with the model's SELECT with "LIMIT 0" appended
create iceberg table based on tmp table definition, only metadata
insert into the iceberg table with the model's SQL (don't append the LIMIT this time)
drop the tmp table

This would save one copying of all rows (hours of processing in my case).

Regards,

Support on schema change for iceberg incremental

What

Add support on_schema_change for iceberg incremental.
Append mode seems supported, we need just to cover the merge incremental behavior.

Partition Limitations

Quick question, wasn't sure where else to post this. The readme mentions that the plugin only supports the creation of 100 partitions. Is that 100 partition values (i.e. partitioning by date, you can only create 100 days worth of partitions) or partition fields (i.e. I can partition by 100 fields, field1 -> field100)?

iceberg merge: configure if nulls should be updated

Would come handy to be able to customize further the behaviour of the merge approach in order to be able to decide whether nulls should be updated during the merge step

NOTE: This behaviour is custom and diverges from the standard dbt macro so it might not really fit here.

As a footprint I will anyway link a reference implementation of such a feature:

https://github.com/ringier-data/dbt-athena/tree/feature/iceberg_merge_update_nulls

dbt seed raise error mismatched input 's' when there is 's in csv file

Hi team,

I initially raised this issue to dbt-core and it seems fine with dbt-snowflake.

Would you please take a look? the dbt seed will fail if there is a ' character in the seed file

details:

dbt-labs/dbt-core#6306

Thanks

prehook sql throw exception

I wanted to exceute a pre_hook query like this
{{ config(materialized='view',pre_hook="drop table if exists my_first_dbt_model") }}
But I found that in athena , the sql become
-- /* {"app": "dbt", "dbt_version": "1.2.1", "profile_name": "local", "target_name": "dev", "node_id": "model.dbtest.my_first_dbt_model"} */ drop table if exists my_first_dbt_model
since all the sql stay in one line , the sql had been miss as a comment , It failed

Lakeformation integration

Add lakeformation integration to the adapter https://aws.amazon.com/about-aws/whats-new/2022/11/amazon-athena-support-lake-formation-fine-grained-access-control/

Maybe as an idea, if lakeformation=true, the adapter take care of enabling lakeformation, and register the buckets used by the adapter on it.

Add dependabot

Add dependabot to the repository:

e.g. https://github.com/aws-samples/dbt-glue/blob/main/.github/dependabot.yml

PRs name should follow a convention and be enforced

PR names should follow conventions and be enforced by the ci.

The community should agree on a convention and implement the necessary steps to enforce it.

Personally, I like conventionalcommits format.

See aws-cdk for reference.

What do you think?

upgrade to support dbt-core v1.4.0

Background

The latest version of dbt Core,dbt-core==1.4.0, was published on January 25, 2023 (PyPI | Github). In fact, a patch, dbt-core==1.4.1 (PyPI | Github), was also released on the same day.

How to upgrade

dbt-labs/dbt-core#6624 is an open discussion with more detailed information. If you have questions, please put them there! dbt-labs/dbt-core#6849 is for keeping track of the community's progress on releasing 1.4.0

The above linked guide has more information, but below is a high-level checklist of work that would enable a successful 1.4.0 release of your adapter.

support Python 3.11 (only if your adapter's dependencies allow)
Consolidate timestamp functions & macros
Replace deprecated exception functions
Add support for more tests

the next minor release: 1.5.0

FYI, dbt-core==1.5.0 is expected to be released at the end of April. Please plan on allocating a more effort to upgrade support compared to previous minor versions. Expect to hear more in the middle of April.

At a high-level expect much greater adapter test coverage (a very good thing!), and some likely heaving renaming and restructuring as the API-ification of dbt-core is now well underway. See https://github.com/dbt-labs/dbt-core/milestone/82 for more information.

Issues when using post_hooks with incremental materialization

When using post_hooks to an incremental model seems that we are having some issues.

{{ config(
    materialized='incremental',
    format='iceberg',
    incremental_strategy='merge',
    unique_key='user_id',
    partitioned_by=['bucket(5, user_id)', 'status'],
    schema='silver',
    table_properties={
    	'optimize_rewrite_delete_file_threshold': '2'
    },
    post_hook = [
        "ALTER TABLE {{this}} SET TBLPROPERTIES ('vacuum_max_snapshot_age_seconds'='86400');"
    ]
) }}

with data as (
SELECT
	'a' AS user_id,
	'pi_updated' AS name,
	'inactive' AS status,
	17.89 AS cost,
	1 AS quantity,
	100000000 AS quantity_big,
	current_date AS my_date,
	cast(REPLACE(cast(current_timestamp as varchar), ' UTC', '') as timestamp(3)) as now
UNION ALL
SELECT
	'b' AS user_id,
	'beta_updated_yet_another one' AS name,
	'inactive' AS status,
	3 AS cost,
	50 AS quantity,
	100000000 AS quantity_big,
	current_date AS my_date,
	cast(REPLACE(cast(current_timestamp as varchar), ' UTC', '') as timestamp(3)) as now
)

select *
from data

produce this error

Completed with 1 error and 0 warnings:
08:07:30  
08:07:30  Runtime Error in model iceberg_increment (models/iceberg/iceberg_increment.sql)
08:07:30    line 1:243: no viable alternative at input '<EOF>'

looking at the query produced I see this

-- /* {"app": "dbt", "dbt_version": "1.3.0", "profile_name": "athena", "target_name": "dev", "node_id": "model.lakehouse.iceberg_increment"} */ ALTER TABLE silver.iceberg_increment SET TBLPROPERTIES ('vacuum_max_snapshot_age_seconds'='86400')

somehow we have double comments.

[bug] v1.3.4: Invalid bucket name ""

Description

On dbt-athena-community==1.3.4, when running something like dbt run -s +my_model, the run fails with error messages of this kind:

11:57:16  Parameter validation failed:
11:57:16  Invalid bucket name "": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:(s3|s3-object-lambda):[a-z\-0-9]*:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-.]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"

So it seems that the bucket location is set to an empty string.

When switching to dbt-athena-community==1.3.3 and keeping everything else as is, the same command succeeds without any errors.

This issue can be observed on both Athena v2 and v3.

This is my profiles.yml:

athena:
  outputs:
    dev:
      type: athena
      s3_staging_dir: s3://aws-athena-query-results-eu-central-1-9999999999999/dbt/
      s3_data_dir: s3://aws-athena-tables-data/dbt/bumblebee/dev
      s3_data_naming: schema_table_unique
      region_name: eu-central-1
      database: awsdatacatalog
      schema: dbt_dev_philipp
      work_group: whatever
      num_retries: 1
      threads: 8

Seeds containing `'` fail

Tomme/dbt-athena#109

Use glue APIs instead of information schema

Based on #4 (comment)

Implementation should be covered here northvolt/dbt-athena@49c7111

Using Glue API is better as faster and should allow multi-threading.

TODO

Use glue api instead of information schema
Use delete table from glue api - consider to do for Iceberg as well, as timeout lead to not having full deleted data. Before dropping pick the location from glue catalog.

Different table creation strategies for Iceberg

Athena engine v3 supports CTAS for Iceberg.

Since engine v2 does not support that, what about adding a table_strategy in the table materialization with the following alternatives : tmp_table and ctas.

ctas is by default for all non iceberg tables and the only possibility.
tmp_table must be set for engine v2 for Iceberg tables.
ctas or tmp_table can be chosen for engine v3 for Iceberg tables.

What do you think ?

Override `dbt-core` datatype macros with Athena-specific implementations

dbt-core provides a number of macros to support cross-database compatibility for dbt packages: https://docs.getdbt.com/reference/dbt-jinja-functions/cross-database-macros

A few of the datatype macros have defaults which are incorrect for Athena, notably:

float
numeric
int

Governance

First I'd like to open with that it's great that someone has taken the initiative to fork and revive this adapter under an organization. Hopefully this can sustain active maintenance and all of us avoid multiple abandoned forks.

@nicor88 requested me to submit my quoting support PR from the old repo here which I have done, this got me thinking about the governance of this project. As it stands there are no public members of this organization, there is no clear contribution guidelines or any code of conduct.

If this is to become a healthy project that people will adopt instead of their own forks I think there are some things that would be good to address:

Who owns this org?
Who are part of the org? Are all from the same company, is there a risk that this will be abandoned if you stop using Athena?
Can one become a member?
How are decisions about the future of the project made?
What is expected of a contributor? I think that clear contribution guidelines and a code of conduct would be good to have.

Separate table type from table format

👋🏻 Hello

Right now, when we use our adapter we can set format to parquet, iceberg, json ... The issue with that is that with the new implementation of create_table_as we are mixing table_type and table_format and we enforce that iceberg tables are in parquet format. This is ok right now because parquet is the best format to put with Iceberg but it would be great to have 2 parameters when defining models :

table_type : format or iceberg
format : parquet, avro, json, text

What do you think ?

User agent string grows with each call of pyathena

The root cause of the issue below is caused by pyathena.

After building a bunch of models the header gets to big so that s3 listObjects api call fails. It is caused by an user-agent header that grows with every invocation of Pyathena.

So far I just hotfixed it by pinning the version of pyathena to <2.8.0

Where the config is created in dbt-athena and pyathena

Example request after some models are built:
2022-12-27 15:18:27,965 botocore.endpoint [DEBUG] Sending http request: <AWSPreparedRequest stream_output=False, method=GET, url=https://*****.s3.eu-central-1.amazonaws.com/?prefix=datalakesensitive%2Foutput%2Fqueryresults%2Ftables%2Fbc32f394-cfc2-4e9e-b3dd-f67e09599921%2F&encoding-type=url, headers={'User-Agent': b'Boto3/1.26.37 Python/3.8.15 Linux/5.10.102.1-microsoft-standard-WSL2 Botocore/1.29.37 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 dbt-athena-community/1.3.3', 'X-Amz-Date': b'20221227T141827Z', 'X-Amz-Security-Token': b'****', 'X-Amz-Content-SHA256': b'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855', 'Authorization': b'****', 'amz-sdk-invocation-id': b'****', 'amz-sdk-request': b'attempt=1'}>

#############################
Initial problem:
#############################

It is caused by a boto3 call in the cleanup_table function of the adapter.

In our project with about 300 dbt models it failed when we build the full project. When we build it partially it is OK for some selectors and for others it failed.

The boto3 call:
s3_bucket.objects.filter(Prefix=prefix).delete()

Here is the actual exception:
botocore.exceptions.ClientError: An error occurred (RequestHeaderSectionTooLarge) when calling the ListObjects operation: Your request header section exceeds the maximum allowed size.

Here the full stack trace:

[08:52:16.828249 [debug] [Thread-9  ]: Began running node model.climatepartner.stg_fpm__base_project_reservation_account
[08:52:16.839619 [info ] [Thread-9  ]: 392 of 573 START sql table model dev_datalakesensitive_fabi.stg_fpm__base_project_reservation_account  [RUN]
[08:52:16.854983 [debug] [Thread-9  ]: Acquiring new athena connection "model.climatepartner.stg_fpm__base_project_reservation_account"
[08:52:16.875331 [debug] [Thread-9  ]: Began compiling node model.climatepartner.stg_fpm__base_project_reservation_account
[08:52:16.884906 [debug] [Thread-9  ]: Compiling model.climatepartner.stg_fpm__base_project_reservation_account
[08:52:16.916737 [debug] [Thread-9  ]: Writing injected SQL for node "model.climatepartner.stg_fpm__base_project_reservation_account"
[08:52:33.177364 [debug] [Thread-9  ]: Athena adapter: Deleting table data from 's3://bucket-name/datalakesensitive/output/tables/dev_datalakesensitive_fabi/stg_fpm__base_project_reservation_account/ee9fab99-03d0-44eb-964a-47e71e0aa5d1/'
[08:52:34.331420 [debug] [Thread-9  ]: finished collecting timing info
[08:52:34.331870 [debug] [Thread-9  ]: On model.climatepartner.stg_fpm__base_project_reservation_account: Close
[08:52:34.332278 [error] [Thread-9  ]: Unhandled error while executing model.climatepartner.stg_fpm__base_project_reservation_account
An error occurred (RequestHeaderSectionTooLarge) when calling the ListObjects operation: Your request header section exceeds the maximum allowed size.
[08:52:34.332610 [debug] [Thread-9  ]: 
Traceback (most recent call last):
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/dbt/task/base.py", line 385, in safe_run
    result = self.compile_and_execute(manifest, ctx)
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/dbt/task/base.py", line 338, in compile_and_execute
    result = self.run(ctx.node, manifest)
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/dbt/task/base.py", line 429, in run
    return self.execute(compiled_node, manifest)
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/dbt/task/run.py", line 281, in execute
    result = MacroGenerator(
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/dbt/clients/jinja.py", line 326, in __call__
    return self.call_macro(*args, **kwargs)
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/dbt/clients/jinja.py", line 253, in call_macro
    return macro(*args, **kwargs)
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/jinja2/runtime.py", line 763, in __call__
    return self._invoke(arguments, autoescape)
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/jinja2/runtime.py", line 777, in _invoke
    rv = self._func(*arguments)
  File "<template>", line 49, in macro
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/jinja2/sandbox.py", line 393, in call
    return __context.call(__obj, *args, **kwargs)
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/jinja2/runtime.py", line 298, in call
    return __obj(*args, **kwargs)
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/dbt/clients/jinja.py", line 326, in __call__
    return self.call_macro(*args, **kwargs)
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/dbt/clients/jinja.py", line 253, in call_macro
    return macro(*args, **kwargs)
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/jinja2/runtime.py", line 763, in __call__
    return self._invoke(arguments, autoescape)
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/jinja2/runtime.py", line 777, in _invoke
    rv = self._func(*arguments)
  File "<template>", line 22, in macro
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/jinja2/sandbox.py", line 393, in call
    return __context.call(__obj, *args, **kwargs)
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/jinja2/runtime.py", line 298, in call
    return __obj(*args, **kwargs)
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/dbt/adapters/athena/impl.py", line 163, in clean_up_table
    s3_bucket.objects.filter(Prefix=prefix).delete()
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/boto3/resources/collection.py", line 561, in batch_action
    return action(self, *args, **kwargs)
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/boto3/resources/action.py", line 134, in __call__
    for page in parent.pages():
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/boto3/resources/collection.py", line 171, in pages
    for page in pages:
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/botocore/paginate.py", line 269, in __iter__
    response = self._make_request(current_kwargs)
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/botocore/paginate.py", line 357, in _make_request
    return self._method(**current_kwargs)
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/botocore/client.py", line 960, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (RequestHeaderSectionTooLarge) when calling the ListObjects operation: Your request header section exceeds the maximum allowed size.

[08:52:34.422051 [error] [Thread-9  ]: 392 of 573 ERROR creating sql table model dev_datalakesensitive_fabi.stg_fpm__base_project_reservation_account  [ERROR in 17.50s]

Add unit tests

What

We want to add some basic unit tests that run on the CI.
In case we need to mock requests, we can use moto. Some inspiration can be find at: https://github.com/laughingman7743/PyAthena/tree/master/tests

Zero Downtime options for table materialization

See this PR: Tomme/dbt-athena#95

Also using iceberg should be quit easy:

ALTER TABLE target RENAME TO target_bkp;
ALTER TABLE target_tmp RENAME TO target;

DROP target_bkp; -- only if all is fine

It's not 100% downtime, but almost.

Regarding not iceberg table we can do something like this:

if the table do not exist we use a normal cta in the right location
if the table exist we:
- crate the table via cta in a location that is unique
- call an glue api update table to use the new location and make a new table version
- drop the table create in step 1
  So pretty much we leverage a table update to add a total new data schema and a new location. https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.update_table

Support for the faster incremental materialization with Iceberg

Thanks for implementing the faster Iceberg table materialization. Can you please implement the same feature for the incremental materialization as well?

we first use parquet with limit 0
do the schema evolution
use the merge from the original SQL

Please see the issue 11 for context and more details.

Support incremental materialization for iceberg

What

Support merge materialisation incremental for iceberg.

Multi-threading issues on information_schema queries

I'm switching a few projects to this dbt-athena community adapter and I notice a regression bug 🐛

Locally, I use threads: 4. I have a few databases defined in my dbt_project.yml:

models:
  test_jesse:
    silver:
      test_a:
        +schema: silver_test_a
      test_b:
        +schema: silver_test_b
      test_c:
        +schema: silver_test_c

When I run dbt --debug run I see it starts by making 3 parallel queries to INFORMATION_SCHEMA (corresponding to the 3 custom schemas)

Output of the failing `dbt --debug run`:

23:22:00.639968 [info ] [MainThread]: Found 79 models, 9 tests, 0 snapshots, 0 analyses, 498 macros, 0 operations, 0 seed files, 64 sources, 0 exposures, 0 metrics
23:22:00.642584 [info ] [MainThread]:
23:22:00.642973 [debug] [MainThread]: Acquiring new athena connection "master"
23:22:00.645492 [debug] [ThreadPool]: Acquiring new athena connection "list_awsdatacatalog"
23:22:00.652730 [debug] [ThreadPool]: Acquiring new athena connection "list_awsdatacatalog"
23:22:00.654241 [debug] [ThreadPool]: Acquiring new athena connection "list_awsdatacatalog"
23:22:00.655511 [debug] [ThreadPool]: Using athena connection "list_awsdatacatalog"
23:22:00.655686 [debug] [ThreadPool]: Using athena connection "list_awsdatacatalog"
23:22:00.655861 [debug] [ThreadPool]: Using athena connection "list_awsdatacatalog"
23:22:00.656061 [debug] [ThreadPool]: On list_awsdatacatalog: /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "dev", "connection_name": "list_awsdatacatalog"} */

    select
        distinct schema_name

    from awsdatacatalog.INFORMATION_SCHEMA.schemata

23:22:00.656238 [debug] [ThreadPool]: On list_awsdatacatalog: /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "dev", "connection_name": "list_awsdatacatalog"} */

    select
        distinct schema_name

    from awsdatacatalog.INFORMATION_SCHEMA.schemata

23:22:00.656446 [debug] [ThreadPool]: On list_awsdatacatalog: /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "dev", "connection_name": "list_awsdatacatalog"} */

    select
        distinct schema_name

    from awsdatacatalog.INFORMATION_SCHEMA.schemata

23:22:00.656774 [debug] [ThreadPool]: Opening a new connection, currently in state init
23:22:00.660507 [debug] [ThreadPool]: Opening a new connection, currently in state init
23:22:00.673268 [debug] [ThreadPool]: Opening a new connection, currently in state init
23:22:00.675392 [error] [ThreadPool]: Athena adapter: Got an error when attempting to open a Athena connection due to 'credential_provider'
Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.9/site-packages/dbt/adapters/athena/connections.py", line 155, in open
    handle = AthenaConnection(
  File "/opt/homebrew/lib/python3.9/site-packages/pyathena/connection.py", line 139, in __init__
    self._client = self._session.client(
  File "/opt/homebrew/lib/python3.9/site-packages/boto3/session.py", line 299, in client
    return self._session.create_client(
  File "/opt/homebrew/lib/python3.9/site-packages/botocore/session.py", line 951, in create_client
    credentials = self.get_credentials()
  File "/opt/homebrew/lib/python3.9/site-packages/botocore/session.py", line 507, in get_credentials
    self._credentials = self._components.get_component(
  File "/opt/homebrew/lib/python3.9/site-packages/botocore/session.py", line 1112, in get_component
    del self._deferred[name]
KeyError: 'credential_provider'
23:22:00.688091 [debug] [ThreadPool]: Athena adapter: Error running SQL: /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "dev", "connection_name": "list_awsdatacatalog"} */

    select
        distinct schema_name

    from awsdatacatalog.INFORMATION_SCHEMA.schemata

23:22:00.688642 [debug] [ThreadPool]: Athena adapter: Error running SQL: macro list_schemas
23:22:00.688916 [debug] [ThreadPool]: On list_awsdatacatalog: No close available on handle
23:22:07.156437 [debug] [ThreadPool]: SQL status: OK -1 in 6.5 seconds
23:22:07.158140 [debug] [ThreadPool]: On list_awsdatacatalog: Close
23:22:07.159025 [debug] [ThreadPool]: SQL status: OK -1 in 6.5 seconds
23:22:07.159950 [debug] [ThreadPool]: On list_awsdatacatalog: Close
23:22:07.160520 [debug] [MainThread]: Connection 'master' was properly closed.
23:22:07.160681 [debug] [MainThread]: Connection 'list_awsdatacatalog' was properly closed.
23:22:07.160892 [debug] [MainThread]: Connection 'list_awsdatacatalog' was properly closed.
23:22:07.161071 [debug] [MainThread]: Connection 'list_awsdatacatalog' was properly closed.
23:22:07.161847 [debug] [MainThread]: Flushing usage events
23:22:07.162046 [error] [MainThread]: Encountered an error:
Runtime Error
  Runtime Error
    Database Error
      'credential_provider'

When I switch back to Tomme's adapter (dbt-athena-adapter==1.0.1), It makes three queries in parallel successfully.

Output of the failing `dbt --debug run`:

23:23:51.801549 [info ] [MainThread]: Found 79 models, 9 tests, 0 snapshots, 0 analyses, 490 macros, 0 operations, 0 seed files, 64 sources, 0 exposures, 0 metrics
23:23:51.804227 [info ] [MainThread]:
23:23:51.804618 [debug] [MainThread]: Acquiring new athena connection "master"
23:23:51.807446 [debug] [ThreadPool]: Acquiring new athena connection "list_awsdatacatalog"
23:23:51.814927 [debug] [ThreadPool]: Acquiring new athena connection "list_awsdatacatalog"
23:23:51.816278 [debug] [ThreadPool]: Acquiring new athena connection "list_awsdatacatalog"
23:23:51.817486 [debug] [ThreadPool]: Using athena connection "list_awsdatacatalog"
23:23:51.817641 [debug] [ThreadPool]: Using athena connection "list_awsdatacatalog"
23:23:51.817786 [debug] [ThreadPool]: Using athena connection "list_awsdatacatalog"
23:23:51.817940 [debug] [ThreadPool]: On list_awsdatacatalog: -- /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "dev", "connection_name": "list_awsdatacatalog"} */

    select
        distinct schema_name

    from awsdatacatalog.INFORMATION_SCHEMA.schemata

23:23:51.818153 [debug] [ThreadPool]: On list_awsdatacatalog: -- /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "dev", "connection_name": "list_awsdatacatalog"} */

    select
        distinct schema_name

    from awsdatacatalog.INFORMATION_SCHEMA.schemata

23:23:51.818459 [debug] [ThreadPool]: On list_awsdatacatalog: -- /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "dev", "connection_name": "list_awsdatacatalog"} */

    select
        distinct schema_name

    from awsdatacatalog.INFORMATION_SCHEMA.schemata

23:23:51.818715 [debug] [ThreadPool]: Opening a new connection, currently in state init
23:23:51.828140 [debug] [ThreadPool]: Opening a new connection, currently in state init
23:23:51.832063 [debug] [ThreadPool]: Opening a new connection, currently in state init
23:23:58.460654 [debug] [ThreadPool]: SQL status: OK -1 in 6.64 seconds
23:23:58.461749 [debug] [ThreadPool]: On list_awsdatacatalog: Close
23:23:58.465697 [debug] [ThreadPool]: SQL status: OK -1 in 6.65 seconds
23:23:58.466534 [debug] [ThreadPool]: On list_awsdatacatalog: Close
23:23:58.473755 [debug] [ThreadPool]: SQL status: OK -1 in 6.66 seconds
23:23:58.474907 [debug] [ThreadPool]: On list_awsdatacatalog: Close

When I use threads: 1 or threads: 2 it works and successfully deploys the dbt project, but it starts failing with 3 or more threads.

This might be a regression? I found a similar issue Tomme/dbt-athena#41

Support on_schema_change feature on incremental model #3

Currently, dbt-athena does not support on_schema_change option.

Original work done here https://github.com/dbt-athena/dbt-athena/pull/3/files implemented by @hiro-o918

dbt init returns "No adapters available"

Hi there, I've done a pip install dbt-athena-community and I'm unable to init a new project. I'm getting this output.

23:38:46  Running with dbt=1.3.1
No adapters available. Go to https://docs.getdbt.com/docs/available-adapters`

I've tried on another machine and running the pip install pointed at the repo with no luck. I've verified that installing other adapters successfully get detected by dbt.

num_retries doesn't work with env_var

Hi,

I'm using env_var in my profile.yml file, but when trying to set the num_retires attribute with env_var it fails with the following error:

ERROR: Runtime Error
  Credentials in profile "my_profile", target "dev" invalid: '0' is not valid under any of the given schemas

To set the env var I ran this: export DBT_ATHENA_NUM_RETRIES=0

My profile.yml:

my_profile:
  outputs:
    dev:
      database: "{{ env_var('DBT_ATHENA_DB') }}"
      region_name: "{{ env_var('DBT_REGION_NAME') }}"
      s3_staging_dir: "{{ env_var('DBT_S3_STAGING_DIR') }}"
      schema: "{{ env_var('DBT_ATHENA_SCHEMA') }}"
      type: athena
      num_retries: "{{ env_var('DBT_ATHENA_NUM_RETRIES') }}"
  target: dev

A few notes:

When using the 0 directly it works fine: num_retries: 0
The other attributes are assigned as expected using the env_var

Add iceberg table materialization

Add Iceberg table materialisation, original work was done in Tomme/dbt-athena#135

Implement linting via github actions

I read the call for help in slack and wanted to ask what kind of linting you want?

My suggestion (based on what I usually implemented in python repos and actually nowadays like):

black, isort for code formatting
flake8 + mypy for finding code problems
pre-commit for running all the linter on commits or on demand (for faster cycle times as you can run the same stuff as in CI), including some linters for whitespace/EOF/... thingies, see e.g. the wolt cockiecutter example)
Github actions to run all pre-commit defined linters on all files

I usually add a small makefile to run the linter/formatter locally on demand (instead of using pre-commit as git pre-commit hook) and setup the local dev virtualenv, either via pip from requirements.txt or via poetry/pyproject.toml, which seems to be the new standard in python packaging (my preference would be the latter... requirements.txt can still be autogenerated).

Not sure what to use for tests, usually it's pytest, but I have no idea how dbt adapters are usually tested. -> leave it out for now?

Would that meet your needs and is this what you had in mind?

dbt found two macros named "set_table_classification" in the project "dbt_athena".

I ran into this when installing dbt-athena-community via pip install dbt-athena-community on dbt=1.3.1.

The issue was that the package dbt-athena (https://github.com/Tomme/dbt-athena) has already been installed on the target system, leading to the two macros of the same name being being defined.

The fix was to uninstall one of the dbt-athena* packages -- in my case pip uninstall dbt-athena.

DBT_TEST_ATHENA_AWS_PROFILE_NAME should be optional if using SSO

DBT_TEST_ATHENA_AWS_PROFILE_NAME should be optional for those who use SSO and don't have a profile

dbt-athena/tests/conftest.py

Line 23 in f31a863

"aws_profile_name": os.getenv("DBT_TEST_ATHENA_AWS_PROFILE_NAME"),

Add specific functional testing to check table materialisation

What

Following what described in testing-a-new-adapter)
we need to add some extra functional testings to standardize the minimum functionalities of the adapter:

table materialisation
- hive and parquet format
- iceberg and parquet format
incremental
- update partitions
- merge

Support snapshots

With the introduction of Iceberg, we can start to support snapshots

Move all utlity macros in utils

Move all the macros that are utils in a specific path:
dbt/include/athena/macros/utils/
instead of being in
dbt/include/athena/macros/adapters/

Prune path before using a CTA

What

to avoid: HIVE_PATH_ALREADY_EXISTS, ensure that the path is pruned. Check implementation here, in our case we can build something similar.

Iceberg table should avoid to use that, as the table drop take care of pruning the location.

Implement `athena__get_tables_by_pattern` macro

https://getdbt.slack.com/archives/C013MLFR7BQ/p1670012162896509

Summary of the above Slack discussion: codegen calls dbt-utils macro which doesn't work on Athena (see this issue). We could improve the out-of-the box compatibility of this adapter with codegen and dbt-utils by implementing athena__get_tables_by_pattern().

It might be as easy as copying the implementation from this PR into this adapter.

	{% set raw_strategy = config.get('incremental_strategy', default='insert_overwrite') %}
	{% set strategy = validate_get_incremental_strategy(raw_strategy) %}

dbt-athena / dbt-athena Goto Github PK

dbt-athena's Issues

Summary

Expected behavior

How to reproduce

Use case

Notes

What

Description

Models used to test - Optional

Checklist

What

Background

How to upgrade

the next minor release: 1.5.0

Description

TODO

What

What

What

What

Recommend Projects

Recommend Topics

Recommend Org