Comments (3)
Did you consider to build a custom macro to run optimize in batches? that's what I do at the moment, and it's pure inspired by what @svdimchenko proposed on the way we handle more than 100 partitions.
Re-running the optimize might not work with partitioned datasets - I believe that the best way is to run "optimize" with specific where statements.
Here the code:
{% macro optimize_by_partition(relation) -%}
{%- set athena_partitions_limit = config.get('athena_optimize_partitions_limit', 50) | int -%}
{%- set partitioned_by = config.get('partitioned_by') -%}
{%- set partitioned_keys = adapter.format_partition_keys(partitioned_by) -%}
{% do log('PARTITIONED KEYS: ' ~ partitioned_keys) %}
{% call statement('get_partitions', fetch_result=True) %}
select distinct {{ partitioned_keys }} from {{ relation }} order by {{ partitioned_keys }};
{% endcall %}
{%- set results = load_result('get_partitions') -%}
{%- if results is not none %}
{%- set table = results.table -%}
{%- set rows = table.rows -%}
{%- set partitions_batches = [] -%}
{%- set partitions = {} -%}
{% do log('TOTAL PARTITIONS TO PROCESS: ' ~ rows | length) %}
{%- for row in rows -%}
{%- set single_partition = [] -%}
{%- for col in row -%}
{%- set column_type = adapter.convert_type(table, loop.index0) -%}
{%- set comp_func = '=' -%}
{%- if col is none -%}
{%- set value = 'null' -%}
{%- set comp_func = ' is ' -%}
{%- elif column_type == 'integer' -%}
{%- set value = col | string -%}
{%- elif column_type == 'string' -%}
{%- set value = "'" + col + "'" -%}
{%- elif column_type == 'date' -%}
{%- set value = "DATE'" + col | string + "'" -%}
{%- elif column_type == 'timestamp' -%}
{%- set value = "TIMESTAMP'" + col | string + "'" -%}
{%- else -%}
{%- do exceptions.raise_compiler_error('Need to add support for column type ' + column_type) -%}
{%- endif -%}
{%- set partition_key = adapter.format_one_partition_key(partitioned_by[loop.index0]) -%}
{%- do single_partition.append(partition_key + comp_func + value) -%}
{%- endfor -%}
{%- set single_partition_expression = single_partition | join(' and ') -%}
{%- set batch_number = (loop.index0 / athena_partitions_limit) | int -%}
{% if not batch_number in partitions %}
{% do partitions.update({batch_number: []}) %}
{% endif %}
{%- do partitions[batch_number].append('(' + single_partition_expression + ')') -%}
{%- if partitions[batch_number] | length == athena_partitions_limit or loop.last -%}
{%- do partitions_batches.append(partitions[batch_number] | join(' or ')) -%}
{%- endif -%}
{%- endfor -%}
{%- for batch in partitions_batches -%}
{%- do log('BATCH PROCESSING: ' ~ loop.index ~ ' OF ' ~ partitions_batches|length) -%}
{%- set optimize_partitions -%}
optimize {{relation.render_pure()}} rewrite data using bin_pack
where {{ batch }}
{%- endset -%}
{%- do run_query(optimize_partitions) -%}
{%- endfor -%}
{% endif %}
{%- endmacro %}
then in my post-hook I simply call: {{ optimize_by_partition(this)}}
If you like this approach, we can consider to make such macro as part of the adapter, and we can re-use/refactor code available in the adapter itself.
from dbt-athena.
Thanks a lot for the code, I will test it 🥳
Re-running the optimize works at some point, I tried using AWS console. For instance if my table has 1000 partitions, running 10 times optimizes will raise the error I showed, and the 11th time passes.
from dbt-athena.
Oh I was not aware of that, maybe then could be worth it to add what you describe, a built in macro that "optimize till it should".
from dbt-athena.
Related Issues (20)
- [Bug] `truncate()` partition transformation does not work when it includes more than 100 partitions HOT 1
- Bug Hitting `ThrottlingException` on `GetWorkGroup` with threads turned up HOT 5
- [Bug] Iceberg table materialization shouldn't s3_data_naming=table
- [Bug] Adapter error when FIPS mode is enabled HOT 4
- [Bug] Resolution failure for `create_table_as` macro when upgrading to 1.7.2 HOT 1
- upgrade to support dbt-core v1.8.0 HOT 6
- [Feature] Control glue database/schema for tmp tables generated by incremental models HOT 1
- [Bug] force_batch deletes data from model_tmp_not_partitioned before coping to the final table HOT 2
- [Feature] Rename unique_key to unique_columns or merge_on_columns HOT 3
- [Feature] Support configurable management of Table Optimisers for Iceberg tables HOT 3
- [Bug] Error when Python Model Goes To Write To Database HOT 14
- [Feature] Custom strategy for incremental models when table type is iceberg
- [Bug] dbt source freshness expected a timestamp but received a string HOT 2
- [Feature] Athena dbt-external-tables impl as independent package HOT 5
- [Bug] Clone materialization raises an error when cloning Python models HOT 2
- TABLE_NOT_FOUND Error During Unit Testing in dbt-athena 1.8 Due to Jinja Macro Dependency HOT 3
- Hive vs Iceberg timestamps in unit tests HOT 4
- [Bug] TABLE_NOT_FOUND {{tmp_relation}} when there are zero batches to process in incremental model HOT 1
- [Feature] Allow to define a different schema for tmp tables created during table materialization
- [Lake Formation] Allow lf_tags_config.tags to set multiple values
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dbt-athena.