dbt-athena / dbt-athena Goto Github PK
View Code? Open in Web Editor NEWThe athena adapter plugin for dbt (https://getdbt.com)
Home Page: https://dbt-athena.github.io
License: Apache License 2.0
The athena adapter plugin for dbt (https://getdbt.com)
Home Page: https://dbt-athena.github.io
License: Apache License 2.0
When running a post-hook with vacuum in iceberg table, I'm getting this error:
botocore.errorfactory.InvalidRequestException: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: line 3:9: mismatched input 'vacuum'. Expecting: 'ALTER', 'ANALYZE', 'CALL', 'COMMENT', 'COMMIT', 'CREATE', 'DEALLOCATE', 'DELETE', 'DENY', 'DESC', 'DESCRIBE', 'DROP', 'EXECUTE', 'EXPLAIN', 'GRANT', 'INSERT', 'MERGE', 'PREPARE', 'REFRESH', 'RESET', 'REVOKE', 'ROLLBACK', 'SET', 'SHOW', 'START', 'TRUNCATE', 'UNLOAD', 'UPDATE', 'USE', 'USING', <query>
~Having a look at the error seems that the issue is on boto3 side~~
Running this
cursor = connect(
work_group='athena_v3',
s3_staging_dir="s3://my_bucket/py_athena/",).cursor()
data = cursor.execute("vacuum silver.iceberg_increment")
print(cursor)
works from the same project of dbt-athena, so there is some extra validation, that is preventing us to use vacuuming.
I noticed this in a couple of my projects already, after switching to dbt 1.3.0
and 1.3.1
Given a simple model with this config block:
{{
config(materialized='incremental')
}}
When I deploy this model using dbt, it throws:
23:09:13.423010 [error] [MainThread]: Compilation Error in model event_login (models/silver/content/XXX.sql)
23:09:13.423138 [error] [MainThread]: Invalid incremental strategy provided: None
23:09:13.423268 [error] [MainThread]: Expected one of: 'append', 'insert_overwrite'
23:09:13.423403 [error] [MainThread]:
23:09:13.423529 [error] [MainThread]: > in macro validate_get_incremental_strategy (macros/materializations/models/incremental/helpers.sql)
23:09:13.423653 [error] [MainThread]: > called by macro materialization_incremental_athena (macros/materializations/models/incremental/incremental.sql)
23:09:13.423777 [error] [MainThread]: > called by model event_login (models/silver/content/XXX.sql)
I can reproduce in both the Tomme athena adapter and this new dbt-athena adapter, when I'm using dbt 1.3.1.
Therefore, it seems like something that was introduced with dbt 1.3? When I switch to dbt-core==1.2.1
it works fine. However, I can't find something related in the dbt 1.3.0 or 1.3.1 release changelog...
It's weird because we set insert_overwrite
as a default here:
And then we call validate_get_incremental_strategy
which throws the error.
When I debug the incremental.sql
file:
{% set raw_strategy = config.get('incremental_strategy', default='insert_overwrite') %}
{{ log("RAW STRATEGY: " ~ raw_strategy, True) }}
{% set strategy = validate_get_incremental_strategy(raw_strategy) %}
Then the console prints RAW STRATEGY: None
when I run dbt. Somehow the default is not being applied since dbt 1.3?!
The quick fix is to add an explicit incremental_strategy='insert_overwrite'
in the model config block. However, I'd like to highlight this issue because the default stopped working in this macro... ๐
Bump to dbt-core 1.3.0
The external_location
config setting is respected during state comparison when running dbt with --state ... -s state:modified
option.
This is troublesome in CI scenarios where we compare a development branch to the manifest.json
of the production target.
As a consequence, dbt considers all models modified which make use of the external_location
setting in a target-specific manner.
The model should not be considered as being changed, if only the dynamically set external_location
differs.
We use many other target-specific settings, like, e.g. custom schemas or databases and all of that play nicely with state:modified
. So, the observed behavior is unexpected in my eyes.
dev
and prod
.{{
config(
materialized='table'
, external_location='s3://whatever/' ~ target.name
)
}}
dbt compile --target prod
dbt ls --state target -s state:modified --target dev
Then this model will be shown as modified.
For our PII data we make use of the external_location
setting in order to ensure that this data is written to a different S3 bucket, so we can make use of a bucket-specific access control policy.
A typical model config then looks as follows:
{{
config(
materialized='table'
, external_location=get_pii_external_location_path()
, tags=['daily', 'pii']
)
}}
And the corresponding get_pii_external_location_path()
is defined as:
{%- macro get_pii_external_location_path() -%}
{#
This macro is used for overriding the `external_location` config setting
of PII models.
Each model in the PII directory/schema must make use of it, in order
to ensure that the model's data will get written to the PII S3 bucket.
-#}
{{ 's3://aws-athena-tables-data-pii/dbt/' ~ target.name ~ '/' ~ env_var('DBT_SCHEMA_NAME', this.schema) ~ '/' ~ this.name ~ '/' ~ run_started_at.isoformat() ~ '/' ~ invocation_id }}
{%- endmacro -%}
So we always use the same designated PII bucket but tweak the paths depending on the target.name
and an environment variable DBT_SCHEMA_NAME
, so concurrent CI runs don't clash.
If we now compile the project for --target prod
and then compare what would be run in --target dev
, we get all the PII models detected as changed:
dbt ls --state ./target -s state:modified --target dev
As maintainer I want to trigger the publishing on tag creation.
the version in the setup.py needs to be aligned with the tag
๐๐ป Hello dbt-athena squad
For now the adapter uses the version 2 of Athena Engine, according to README. On 2022.10.13, Athena release the V3 of their Engine, reducing the gap between Athena and Trino features.
I don't know what we want to do about this :
FYI there are the breaking changes
Use latest version of dbt-core 1.4.0.
This change imply to bump the minor version of the adapter too.
Add Pull request template, the template can follow this:
Brief description on what you want to fix or tackle with the PR
Small section including the models that you use to test.
The first run works, when I run for the second time the model get stuck. No error raised.
{{ config(
materialized='incremental',
format='parquet',
incremental_strategy='insert_overwrite',
partitioned_by=['report_date']
) }}
SELECT
999999 AS cost,
cast(t.ts AS date) as report_date
FROM unnest(sequence(cast('2022-10-01' AS date), cast('2022-12-04' AS date), INTERVAL '1' Day)) AS t(ts)
First of all: Thank you for your amazing contribution and bringing Athena engine v3 features to the DBT community!
I am currently tracking a bug that seems to occur non-deterministically. Essentially, during dbt test
, I get syntax errors like these:
botocore.errorfactory.InvalidRequestException: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: line 1:100: mismatched input 'cascade'. Expecting: '.', <EOF
It happens during the initial cleanup phase when DBT tries to delete existing test tables. To me, it looks like sometimes, the macro drop_relation
from this repo is not found and DBT falls back to using default__drop_relation
, which does a
drop {{ relation.type }} if exists {{ relation }} cascade
Below is the relevant log output of such an error:
-- /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "prelive", "node_id": "test.data_importer.not_null_bestaende_cleaned_lagerbestand.d2103837db"} */
12:21:35.254829 [debug] [Thread-7 (]: Opening a new connection, currently in state closed
Failed to execute query.
Traceback (most recent call last):
File "/opt/pysetup/.venv/lib/python3.10/site-packages/pyathena/common.py", line 494, in _execute
query_id = retry_api_call(
File "/opt/pysetup/.venv/lib/python3.10/site-packages/pyathena/util.py", line 68, in retry_api_call
return retry(func, *args, **kwargs)
File "/opt/pysetup/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 406, in __call__
do = self.iter(retry_state=retry_state)
File "/opt/pysetup/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 351, in iter
return fut.result()
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/opt/pysetup/.venv/lib/python3.10/site-packages/tenacity/__init__.py", line 409, in __call__
result = fn(*args, **kwargs)
File "/opt/pysetup/.venv/lib/python3.10/site-packages/botocore/client.py", line 530, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/opt/pysetup/.venv/lib/python3.10/site-packages/botocore/client.py", line 960, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.errorfactory.InvalidRequestException: An error occurred (InvalidRequestException) when calling the StartQueryExecution operation: line 1:102: mismatched input 'cascade'. Expecting: '.', <EOF>
12:21:35.631219 [debug] [Thread-8 (]: Athena adapter: Error running SQL: drop table if exists dbt_test__audit.dbt_expectations_expect_column_93d8b305ae3a3af2a182dc0d9423d314 cascade
-- /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "prelive", "node_id": "test.data_importer.dbt_expectations_expect_column_value_lengths_to_equal_rueres_wm_ek__wm_ek_IS_NOT_NULL__5.bc16ec6d3f"} */
12:21:35.631863 [debug] [Thread-8 (]: Athena adapter: Error running SQL: macro drop_relation
12:21:35.632307 [debug] [Thread-8 (]: finished collecting timing info
12:21:35.632645 [debug] [Thread-8 (]: On test.data_importer.dbt_expectations_expect_column_value_lengths_to_equal_rueres_wm_ek__wm_ek_IS_NOT_NULL__5.bc16ec6d3f: Close
12:21:35.633406 [debug] [Thread-8 (]: Runtime Error in test dbt_expectations_expect_column_value_lengths_to_equal_rueres_wm_ek__wm_ek_IS_NOT_NULL__5 (models/host/schema.yml)
I had some success on my local machine by adding a simple Jinja comment in front of this line.
Is it possible that there is a race condition when registering the macro that leads to spurious errors when the macro is not yet registered when test tables are being deleted? I am not very familiar with the underlying process of how DBT adapters work together with DBT core, just guessing here.
The dbt_valid_to date should be set to udpated_at column date value not the current timestamp.
Per the documentation: https://docs.getdbt.com/docs/build/snapshots#snapshot-meta-fields
WHEN dbt_valid_to=CAST('9999-01-01' as timestamp) AND is_current_record=True THEN {{ current_timestamp() }}
This is causing overlaps in our valid date ranges.
Here are the meta data columns from one id:
id updated_at dbt_valid_from dbt_valid_to is_current_record
47a555a1-1455-481a-8937-d3cc03f02362 2021-09-02 13:58:09.961000 2021-09-02 13:58:09.961000 2023-02-14 20:05:44.433000 false
47a555a1-1455-481a-8937-d3cc03f02362 2021-09-03 13:57:59.941000 2021-09-03 13:57:59.941000 9999-01-01 00:00:00.000000 true
This was implemented with the following pull request:
#111
Hi,
After the model generated by DBT Athena,the table location was looks like this
s3://my-test-bucket/dbt/resident_communication\ticket_last_comment\6975c131-96c7-4b48-baf3-3b18acb6a431
and i think it should be like this
s3://my-test-bucket/dbt/resident_communication/ticket_last_comment/6975c131-96c7-4b48-baf3-3b18acb6a431
I think this is cuased by Windows OS path behavior, below is my profile.
s3_staging_dir: s3://my-test-bucket/dbt/
s3_data_dir: s3://my-test-bucket/dbt/
s3_data_naming: schema_table_unique
Is there a wrong setting with me? thank you very much.
AWS recently released Athena Spark.
Being able to use it would be an awesome addition
tmp table is not cleaned when using v1.3.3 and incremental mode. After model run, the table is dropped but the s3 path is not pruned.
Hi,
I'm looking at the Iceberg tables support plan:
As iceberg doesn't support CTA, the implementation do the following:
I think that the tmp table creation is going to kill performance (I have 10s of billions of rows).
How about this approach:
This would save one copying of all rows (hours of processing in my case).
Regards,
ZD
Add support on_schema_change for iceberg incremental.
Append mode seems supported, we need just to cover the merge
incremental behavior.
Quick question, wasn't sure where else to post this. The readme mentions that the plugin only supports the creation of 100 partitions. Is that 100 partition values (i.e. partitioning by date, you can only create 100 days worth of partitions) or partition fields (i.e. I can partition by 100 fields, field1 -> field100)?
Would come handy to be able to customize further the behaviour of the merge approach in order to be able to decide whether nulls should be updated during the merge step
NOTE: This behaviour is custom and diverges from the standard dbt macro so it might not really fit here.
As a footprint I will anyway link a reference implementation of such a feature:
https://github.com/ringier-data/dbt-athena/tree/feature/iceberg_merge_update_nulls
Hi team,
I initially raised this issue to dbt-core and it seems fine with dbt-snowflake.
Would you please take a look? the dbt seed will fail if there is a ' character in the seed file
details:
Thanks
I wanted to exceute a pre_hook query like this
{{ config(materialized='view',pre_hook="drop table if exists my_first_dbt_model") }}
But I found that in athena , the sql become
-- /* {"app": "dbt", "dbt_version": "1.2.1", "profile_name": "local", "target_name": "dev", "node_id": "model.dbtest.my_first_dbt_model"} */ drop table if exists my_first_dbt_model
since all the sql stay in one line , the sql had been miss as a comment , It failed
Add lakeformation integration to the adapter https://aws.amazon.com/about-aws/whats-new/2022/11/amazon-athena-support-lake-formation-fine-grained-access-control/
Maybe as an idea, if lakeformation=true, the adapter take care of enabling lakeformation, and register the buckets used by the adapter on it.
Add dependabot to the repository:
e.g. https://github.com/aws-samples/dbt-glue/blob/main/.github/dependabot.yml
PR names should follow conventions and be enforced by the ci.
The community should agree on a convention and implement the necessary steps to enforce it.
Personally, I like conventionalcommits format.
See aws-cdk for reference.
What do you think?
The latest version of dbt Core,dbt-core==1.4.0
, was published on January 25, 2023 (PyPI | Github). In fact, a patch, dbt-core==1.4.1
(PyPI | Github), was also released on the same day.
dbt-labs/dbt-core#6624 is an open discussion with more detailed information. If you have questions, please put them there! dbt-labs/dbt-core#6849 is for keeping track of the community's progress on releasing 1.4.0
The above linked guide has more information, but below is a high-level checklist of work that would enable a successful 1.4.0 release of your adapter.
FYI, dbt-core==1.5.0
is expected to be released at the end of April. Please plan on allocating a more effort to upgrade support compared to previous minor versions. Expect to hear more in the middle of April.
At a high-level expect much greater adapter test coverage (a very good thing!), and some likely heaving renaming and restructuring as the API-ification of dbt-core is now well underway. See https://github.com/dbt-labs/dbt-core/milestone/82 for more information.
When using post_hooks to an incremental model seems that we are having some issues.
{{ config( materialized='incremental', format='iceberg', incremental_strategy='merge', unique_key='user_id', partitioned_by=['bucket(5, user_id)', 'status'], schema='silver', table_properties={ 'optimize_rewrite_delete_file_threshold': '2' }, post_hook = [ "ALTER TABLE {{this}} SET TBLPROPERTIES ('vacuum_max_snapshot_age_seconds'='86400');" ] ) }} with data as ( SELECT 'a' AS user_id, 'pi_updated' AS name, 'inactive' AS status, 17.89 AS cost, 1 AS quantity, 100000000 AS quantity_big, current_date AS my_date, cast(REPLACE(cast(current_timestamp as varchar), ' UTC', '') as timestamp(3)) as now UNION ALL SELECT 'b' AS user_id, 'beta_updated_yet_another one' AS name, 'inactive' AS status, 3 AS cost, 50 AS quantity, 100000000 AS quantity_big, current_date AS my_date, cast(REPLACE(cast(current_timestamp as varchar), ' UTC', '') as timestamp(3)) as now ) select * from data
produce this error
Completed with 1 error and 0 warnings:
08:07:30
08:07:30 Runtime Error in model iceberg_increment (models/iceberg/iceberg_increment.sql)
08:07:30 line 1:243: no viable alternative at input '<EOF>'
looking at the query produced I see this
-- /* {"app": "dbt", "dbt_version": "1.3.0", "profile_name": "athena", "target_name": "dev", "node_id": "model.lakehouse.iceberg_increment"} */ ALTER TABLE silver.iceberg_increment SET TBLPROPERTIES ('vacuum_max_snapshot_age_seconds'='86400')
somehow we have double comments.
On dbt-athena-community==1.3.4
, when running something like dbt run -s +my_model
, the run fails with error messages of this kind:
11:57:16 Parameter validation failed:
11:57:16 Invalid bucket name "": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:(s3|s3-object-lambda):[a-z\-0-9]*:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-.]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"
So it seems that the bucket location is set to an empty string.
When switching to dbt-athena-community==1.3.3
and keeping everything else as is, the same command succeeds without any errors.
This issue can be observed on both Athena v2 and v3.
This is my profiles.yml
:
athena:
outputs:
dev:
type: athena
s3_staging_dir: s3://aws-athena-query-results-eu-central-1-9999999999999/dbt/
s3_data_dir: s3://aws-athena-tables-data/dbt/bumblebee/dev
s3_data_naming: schema_table_unique
region_name: eu-central-1
database: awsdatacatalog
schema: dbt_dev_philipp
work_group: whatever
num_retries: 1
threads: 8
Based on #4 (comment)
Implementation should be covered here northvolt/dbt-athena@49c7111
Using Glue API is better as faster and should allow multi-threading.
Athena engine v3 supports CTAS for Iceberg.
Since engine v2 does not support that, what about adding a table_strategy
in the table materialization with the following alternatives : tmp_table
and ctas
.
ctas
is by default for all non iceberg tables and the only possibility.tmp_table
must be set for engine v2 for Iceberg tables.ctas
or tmp_table
can be chosen for engine v3 for Iceberg tables.What do you think ?
dbt-core
provides a number of macros to support cross-database compatibility for dbt packages: https://docs.getdbt.com/reference/dbt-jinja-functions/cross-database-macros
A few of the datatype macros have defaults which are incorrect for Athena, notably:
float
numeric
int
First I'd like to open with that it's great that someone has taken the initiative to fork and revive this adapter under an organization. Hopefully this can sustain active maintenance and all of us avoid multiple abandoned forks.
@nicor88 requested me to submit my quoting support PR from the old repo here which I have done, this got me thinking about the governance of this project. As it stands there are no public members of this organization, there is no clear contribution guidelines or any code of conduct.
If this is to become a healthy project that people will adopt instead of their own forks I think there are some things that would be good to address:
๐๐ป Hello
Right now, when we use our adapter we can set format to parquet
, iceberg
, json
... The issue with that is that with the new implementation of create_table_as
we are mixing table_type
and table_format
and we enforce that iceberg tables are in parquet format. This is ok right now because parquet is the best format to put with Iceberg but it would be great to have 2 parameters when defining models :
table_type
: format
or iceberg
format
: parquet
, avro
, json
, text
What do you think ?
The root cause of the issue below is caused by pyathena.
After building a bunch of models the header gets to big so that s3 listObjects api call fails. It is caused by an user-agent header that grows with every invocation of Pyathena.
So far I just hotfixed it by pinning the version of pyathena to <2.8.0
Where the config is created in dbt-athena and pyathena
Example request after some models are built:
2022-12-27 15:18:27,965 botocore.endpoint [DEBUG] Sending http request: <AWSPreparedRequest stream_output=False, method=GET, url=https://*****.s3.eu-central-1.amazonaws.com/?prefix=datalakesensitive%2Foutput%2Fqueryresults%2Ftables%2Fbc32f394-cfc2-4e9e-b3dd-f67e09599921%2F&encoding-type=url, headers={'User-Agent': b'Boto3/1.26.37 Python/3.8.15 Linux/5.10.102.1-microsoft-standard-WSL2 Botocore/1.29.37 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 PyAthena/2.18.1 dbt-athena-community/1.3.3', 'X-Amz-Date': b'20221227T141827Z', 'X-Amz-Security-Token': b'****', 'X-Amz-Content-SHA256': b'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855', 'Authorization': b'****', 'amz-sdk-invocation-id': b'****', 'amz-sdk-request': b'attempt=1'}>
#############################
Initial problem:
#############################
It is caused by a boto3 call in the cleanup_table function of the adapter.
In our project with about 300 dbt models it failed when we build the full project. When we build it partially it is OK for some selectors and for others it failed.
The boto3 call:
s3_bucket.objects.filter(Prefix=prefix).delete()
Here is the actual exception:
botocore.exceptions.ClientError: An error occurred (RequestHeaderSectionTooLarge) when calling the ListObjects operation: Your request header section exceeds the maximum allowed size.
Here the full stack trace:
[08:52:16.828249 [debug] [Thread-9 ]: Began running node model.climatepartner.stg_fpm__base_project_reservation_account
[08:52:16.839619 [info ] [Thread-9 ]: 392 of 573 START sql table model dev_datalakesensitive_fabi.stg_fpm__base_project_reservation_account [RUN]
[08:52:16.854983 [debug] [Thread-9 ]: Acquiring new athena connection "model.climatepartner.stg_fpm__base_project_reservation_account"
[08:52:16.875331 [debug] [Thread-9 ]: Began compiling node model.climatepartner.stg_fpm__base_project_reservation_account
[08:52:16.884906 [debug] [Thread-9 ]: Compiling model.climatepartner.stg_fpm__base_project_reservation_account
[08:52:16.916737 [debug] [Thread-9 ]: Writing injected SQL for node "model.climatepartner.stg_fpm__base_project_reservation_account"
[08:52:33.177364 [debug] [Thread-9 ]: Athena adapter: Deleting table data from 's3://bucket-name/datalakesensitive/output/tables/dev_datalakesensitive_fabi/stg_fpm__base_project_reservation_account/ee9fab99-03d0-44eb-964a-47e71e0aa5d1/'
[08:52:34.331420 [debug] [Thread-9 ]: finished collecting timing info
[08:52:34.331870 [debug] [Thread-9 ]: On model.climatepartner.stg_fpm__base_project_reservation_account: Close
[08:52:34.332278 [error] [Thread-9 ]: Unhandled error while executing model.climatepartner.stg_fpm__base_project_reservation_account
An error occurred (RequestHeaderSectionTooLarge) when calling the ListObjects operation: Your request header section exceeds the maximum allowed size.
[08:52:34.332610 [debug] [Thread-9 ]:
Traceback (most recent call last):
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/dbt/task/base.py", line 385, in safe_run
result = self.compile_and_execute(manifest, ctx)
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/dbt/task/base.py", line 338, in compile_and_execute
result = self.run(ctx.node, manifest)
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/dbt/task/base.py", line 429, in run
return self.execute(compiled_node, manifest)
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/dbt/task/run.py", line 281, in execute
result = MacroGenerator(
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/dbt/clients/jinja.py", line 326, in __call__
return self.call_macro(*args, **kwargs)
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/dbt/clients/jinja.py", line 253, in call_macro
return macro(*args, **kwargs)
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/jinja2/runtime.py", line 763, in __call__
return self._invoke(arguments, autoescape)
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/jinja2/runtime.py", line 777, in _invoke
rv = self._func(*arguments)
File "<template>", line 49, in macro
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/jinja2/sandbox.py", line 393, in call
return __context.call(__obj, *args, **kwargs)
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/jinja2/runtime.py", line 298, in call
return __obj(*args, **kwargs)
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/dbt/clients/jinja.py", line 326, in __call__
return self.call_macro(*args, **kwargs)
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/dbt/clients/jinja.py", line 253, in call_macro
return macro(*args, **kwargs)
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/jinja2/runtime.py", line 763, in __call__
return self._invoke(arguments, autoescape)
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/jinja2/runtime.py", line 777, in _invoke
rv = self._func(*arguments)
File "<template>", line 22, in macro
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/jinja2/sandbox.py", line 393, in call
return __context.call(__obj, *args, **kwargs)
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/jinja2/runtime.py", line 298, in call
return __obj(*args, **kwargs)
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/dbt/adapters/athena/impl.py", line 163, in clean_up_table
s3_bucket.objects.filter(Prefix=prefix).delete()
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/boto3/resources/collection.py", line 561, in batch_action
return action(self, *args, **kwargs)
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/boto3/resources/action.py", line 134, in __call__
for page in parent.pages():
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/boto3/resources/collection.py", line 171, in pages
for page in pages:
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/botocore/paginate.py", line 269, in __iter__
response = self._make_request(current_kwargs)
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/botocore/paginate.py", line 357, in _make_request
return self._method(**current_kwargs)
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/botocore/client.py", line 530, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/home/fabi/workspace/bi-models/.venv/lib/python3.8/site-packages/botocore/client.py", line 960, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (RequestHeaderSectionTooLarge) when calling the ListObjects operation: Your request header section exceeds the maximum allowed size.
[08:52:34.422051 [error] [Thread-9 ]: 392 of 573 ERROR creating sql table model dev_datalakesensitive_fabi.stg_fpm__base_project_reservation_account [ERROR in 17.50s]
We want to add some basic unit tests that run on the CI.
In case we need to mock requests, we can use moto. Some inspiration can be find at: https://github.com/laughingman7743/PyAthena/tree/master/tests
See this PR: Tomme/dbt-athena#95
Also using iceberg should be quit easy:
ALTER TABLE target RENAME TO target_bkp;
ALTER TABLE target_tmp RENAME TO target;
DROP target_bkp; -- only if all is fine
It's not 100% downtime, but almost.
Regarding not iceberg table we can do something like this:
Thanks for implementing the faster Iceberg table materialization. Can you please implement the same feature for the incremental materialization as well?
Please see the issue 11 for context and more details.
Support merge materialisation incremental for iceberg.
I'm switching a few projects to this dbt-athena community adapter and I notice a regression bug ๐
Locally, I use threads: 4
. I have a few databases defined in my dbt_project.yml
:
models:
test_jesse:
silver:
test_a:
+schema: silver_test_a
test_b:
+schema: silver_test_b
test_c:
+schema: silver_test_c
When I run dbt --debug run
I see it starts by making 3 parallel queries to INFORMATION_SCHEMA
(corresponding to the 3 custom schemas)
23:22:00.639968 [info ] [MainThread]: Found 79 models, 9 tests, 0 snapshots, 0 analyses, 498 macros, 0 operations, 0 seed files, 64 sources, 0 exposures, 0 metrics
23:22:00.642584 [info ] [MainThread]:
23:22:00.642973 [debug] [MainThread]: Acquiring new athena connection "master"
23:22:00.645492 [debug] [ThreadPool]: Acquiring new athena connection "list_awsdatacatalog"
23:22:00.652730 [debug] [ThreadPool]: Acquiring new athena connection "list_awsdatacatalog"
23:22:00.654241 [debug] [ThreadPool]: Acquiring new athena connection "list_awsdatacatalog"
23:22:00.655511 [debug] [ThreadPool]: Using athena connection "list_awsdatacatalog"
23:22:00.655686 [debug] [ThreadPool]: Using athena connection "list_awsdatacatalog"
23:22:00.655861 [debug] [ThreadPool]: Using athena connection "list_awsdatacatalog"
23:22:00.656061 [debug] [ThreadPool]: On list_awsdatacatalog: /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "dev", "connection_name": "list_awsdatacatalog"} */
select
distinct schema_name
from awsdatacatalog.INFORMATION_SCHEMA.schemata
23:22:00.656238 [debug] [ThreadPool]: On list_awsdatacatalog: /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "dev", "connection_name": "list_awsdatacatalog"} */
select
distinct schema_name
from awsdatacatalog.INFORMATION_SCHEMA.schemata
23:22:00.656446 [debug] [ThreadPool]: On list_awsdatacatalog: /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "dev", "connection_name": "list_awsdatacatalog"} */
select
distinct schema_name
from awsdatacatalog.INFORMATION_SCHEMA.schemata
23:22:00.656774 [debug] [ThreadPool]: Opening a new connection, currently in state init
23:22:00.660507 [debug] [ThreadPool]: Opening a new connection, currently in state init
23:22:00.673268 [debug] [ThreadPool]: Opening a new connection, currently in state init
23:22:00.675392 [error] [ThreadPool]: Athena adapter: Got an error when attempting to open a Athena connection due to 'credential_provider'
Traceback (most recent call last):
File "/opt/homebrew/lib/python3.9/site-packages/dbt/adapters/athena/connections.py", line 155, in open
handle = AthenaConnection(
File "/opt/homebrew/lib/python3.9/site-packages/pyathena/connection.py", line 139, in __init__
self._client = self._session.client(
File "/opt/homebrew/lib/python3.9/site-packages/boto3/session.py", line 299, in client
return self._session.create_client(
File "/opt/homebrew/lib/python3.9/site-packages/botocore/session.py", line 951, in create_client
credentials = self.get_credentials()
File "/opt/homebrew/lib/python3.9/site-packages/botocore/session.py", line 507, in get_credentials
self._credentials = self._components.get_component(
File "/opt/homebrew/lib/python3.9/site-packages/botocore/session.py", line 1112, in get_component
del self._deferred[name]
KeyError: 'credential_provider'
23:22:00.688091 [debug] [ThreadPool]: Athena adapter: Error running SQL: /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "dev", "connection_name": "list_awsdatacatalog"} */
select
distinct schema_name
from awsdatacatalog.INFORMATION_SCHEMA.schemata
23:22:00.688642 [debug] [ThreadPool]: Athena adapter: Error running SQL: macro list_schemas
23:22:00.688916 [debug] [ThreadPool]: On list_awsdatacatalog: No close available on handle
23:22:07.156437 [debug] [ThreadPool]: SQL status: OK -1 in 6.5 seconds
23:22:07.158140 [debug] [ThreadPool]: On list_awsdatacatalog: Close
23:22:07.159025 [debug] [ThreadPool]: SQL status: OK -1 in 6.5 seconds
23:22:07.159950 [debug] [ThreadPool]: On list_awsdatacatalog: Close
23:22:07.160520 [debug] [MainThread]: Connection 'master' was properly closed.
23:22:07.160681 [debug] [MainThread]: Connection 'list_awsdatacatalog' was properly closed.
23:22:07.160892 [debug] [MainThread]: Connection 'list_awsdatacatalog' was properly closed.
23:22:07.161071 [debug] [MainThread]: Connection 'list_awsdatacatalog' was properly closed.
23:22:07.161847 [debug] [MainThread]: Flushing usage events
23:22:07.162046 [error] [MainThread]: Encountered an error:
Runtime Error
Runtime Error
Database Error
'credential_provider'
When I switch back to Tomme's adapter (dbt-athena-adapter==1.0.1
), It makes three queries in parallel successfully.
23:23:51.801549 [info ] [MainThread]: Found 79 models, 9 tests, 0 snapshots, 0 analyses, 490 macros, 0 operations, 0 seed files, 64 sources, 0 exposures, 0 metrics
23:23:51.804227 [info ] [MainThread]:
23:23:51.804618 [debug] [MainThread]: Acquiring new athena connection "master"
23:23:51.807446 [debug] [ThreadPool]: Acquiring new athena connection "list_awsdatacatalog"
23:23:51.814927 [debug] [ThreadPool]: Acquiring new athena connection "list_awsdatacatalog"
23:23:51.816278 [debug] [ThreadPool]: Acquiring new athena connection "list_awsdatacatalog"
23:23:51.817486 [debug] [ThreadPool]: Using athena connection "list_awsdatacatalog"
23:23:51.817641 [debug] [ThreadPool]: Using athena connection "list_awsdatacatalog"
23:23:51.817786 [debug] [ThreadPool]: Using athena connection "list_awsdatacatalog"
23:23:51.817940 [debug] [ThreadPool]: On list_awsdatacatalog: -- /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "dev", "connection_name": "list_awsdatacatalog"} */
select
distinct schema_name
from awsdatacatalog.INFORMATION_SCHEMA.schemata
23:23:51.818153 [debug] [ThreadPool]: On list_awsdatacatalog: -- /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "dev", "connection_name": "list_awsdatacatalog"} */
select
distinct schema_name
from awsdatacatalog.INFORMATION_SCHEMA.schemata
23:23:51.818459 [debug] [ThreadPool]: On list_awsdatacatalog: -- /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "dev", "connection_name": "list_awsdatacatalog"} */
select
distinct schema_name
from awsdatacatalog.INFORMATION_SCHEMA.schemata
23:23:51.818715 [debug] [ThreadPool]: Opening a new connection, currently in state init
23:23:51.828140 [debug] [ThreadPool]: Opening a new connection, currently in state init
23:23:51.832063 [debug] [ThreadPool]: Opening a new connection, currently in state init
23:23:58.460654 [debug] [ThreadPool]: SQL status: OK -1 in 6.64 seconds
23:23:58.461749 [debug] [ThreadPool]: On list_awsdatacatalog: Close
23:23:58.465697 [debug] [ThreadPool]: SQL status: OK -1 in 6.65 seconds
23:23:58.466534 [debug] [ThreadPool]: On list_awsdatacatalog: Close
23:23:58.473755 [debug] [ThreadPool]: SQL status: OK -1 in 6.66 seconds
23:23:58.474907 [debug] [ThreadPool]: On list_awsdatacatalog: Close
When I use threads: 1
or threads: 2
it works and successfully deploys the dbt project, but it starts failing with 3 or more threads.
This might be a regression? I found a similar issue Tomme/dbt-athena#41
Currently, dbt-athena does not support on_schema_change option.
Original work done here https://github.com/dbt-athena/dbt-athena/pull/3/files implemented by @hiro-o918
Hi there, I've done a pip install dbt-athena-community
and I'm unable to init a new project. I'm getting this output.
23:38:46 Running with dbt=1.3.1
No adapters available. Go to https://docs.getdbt.com/docs/available-adapters`
I've tried on another machine and running the pip install pointed at the repo with no luck. I've verified that installing other adapters successfully get detected by dbt
.
Hi,
I'm using env_var
in my profile.yml
file, but when trying to set the num_retires
attribute with env_var
it fails with the following error:
ERROR: Runtime Error
Credentials in profile "my_profile", target "dev" invalid: '0' is not valid under any of the given schemas
To set the env var I ran this: export DBT_ATHENA_NUM_RETRIES=0
My profile.yml:
my_profile:
outputs:
dev:
database: "{{ env_var('DBT_ATHENA_DB') }}"
region_name: "{{ env_var('DBT_REGION_NAME') }}"
s3_staging_dir: "{{ env_var('DBT_S3_STAGING_DIR') }}"
schema: "{{ env_var('DBT_ATHENA_SCHEMA') }}"
type: athena
num_retries: "{{ env_var('DBT_ATHENA_NUM_RETRIES') }}"
target: dev
A few notes:
num_retries: 0
env_var
Add Iceberg table materialisation, original work was done in Tomme/dbt-athena#135
I read the call for help in slack and wanted to ask what kind of linting you want?
My suggestion (based on what I usually implemented in python repos and actually nowadays like):
I usually add a small makefile to run the linter/formatter locally on demand (instead of using pre-commit as git pre-commit hook) and setup the local dev virtualenv, either via pip from requirements.txt or via poetry/pyproject.toml, which seems to be the new standard in python packaging (my preference would be the latter... requirements.txt can still be autogenerated).
Not sure what to use for tests, usually it's pytest, but I have no idea how dbt adapters are usually tested. -> leave it out for now?
Would that meet your needs and is this what you had in mind?
I ran into this when installing dbt-athena-community
via pip install dbt-athena-community
on dbt=1.3.1
.
The issue was that the package dbt-athena
(https://github.com/Tomme/dbt-athena) has already been installed on the target system, leading to the two macros of the same name being being defined.
The fix was to uninstall one of the dbt-athena*
packages -- in my case pip uninstall dbt-athena
.
DBT_TEST_ATHENA_AWS_PROFILE_NAME
should be optional for those who use SSO and don't have a profile
Line 23 in f31a863
Following what described in testing-a-new-adapter)
we need to add some extra functional testings to standardize the minimum functionalities of the adapter:
With the introduction of Iceberg, we can start to support snapshots
Move all the macros that are utils in a specific path:
dbt/include/athena/macros/utils/
instead of being in
dbt/include/athena/macros/adapters/
to avoid: HIVE_PATH_ALREADY_EXISTS, ensure that the path is pruned. Check implementation here, in our case we can build something similar.
Iceberg table should avoid to use that, as the table drop take care of pruning the location.
https://getdbt.slack.com/archives/C013MLFR7BQ/p1670012162896509
Summary of the above Slack discussion: codegen
calls dbt-utils
macro which doesn't work on Athena (see this issue). We could improve the out-of-the box compatibility of this adapter with codegen
and dbt-utils
by implementing athena__get_tables_by_pattern()
.
It might be as easy as copying the implementation from this PR into this adapter.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.