👋🏻 Hello Right now, we running an OPTIMIZE in a post-hook, on an I

Post hook when too many partitions failed about dbt-athena HOT 3 CLOSED

Jrmyy commented on June 12, 2024

Post hook when too many partitions failed

from dbt-athena.

Comments (3)

nicor88 commented on June 12, 2024

Did you consider to build a custom macro to run optimize in batches? that's what I do at the moment, and it's pure inspired by what @svdimchenko proposed on the way we handle more than 100 partitions.
Re-running the optimize might not work with partitioned datasets - I believe that the best way is to run "optimize" with specific where statements.

Here the code:

{% macro optimize_by_partition(relation) -%}
  {%- set athena_partitions_limit = config.get('athena_optimize_partitions_limit', 50) | int -%}
  {%- set partitioned_by = config.get('partitioned_by') -%}
  {%- set partitioned_keys = adapter.format_partition_keys(partitioned_by) -%}
  {% do log('PARTITIONED KEYS: ' ~ partitioned_keys) %}
  {% call statement('get_partitions', fetch_result=True) %}
      select distinct {{ partitioned_keys }} from {{ relation }} order by {{ partitioned_keys }};
  {% endcall %}

  {%- set results = load_result('get_partitions') -%}

  {%- if results is not none %}
    {%- set table = results.table -%}
    {%- set rows = table.rows -%}
    {%- set partitions_batches = [] -%}
    {%- set partitions = {} -%}
    {% do log('TOTAL PARTITIONS TO PROCESS: ' ~ rows | length) %}

    {%- for row in rows -%}
        {%- set single_partition = [] -%}
        {%- for col in row -%}

            {%- set column_type = adapter.convert_type(table, loop.index0) -%}
            {%- set comp_func = '=' -%}
            {%- if col is none -%}
                {%- set value = 'null' -%}
                {%- set comp_func = ' is ' -%}
            {%- elif column_type == 'integer' -%}
                {%- set value = col | string -%}
            {%- elif column_type == 'string' -%}
                {%- set value = "'" + col + "'" -%}
            {%- elif column_type == 'date' -%}
                {%- set value = "DATE'" + col | string + "'" -%}
            {%- elif column_type == 'timestamp' -%}
                {%- set value = "TIMESTAMP'" + col | string + "'" -%}
            {%- else -%}
                {%- do exceptions.raise_compiler_error('Need to add support for column type ' + column_type) -%}
            {%- endif -%}
            {%- set partition_key = adapter.format_one_partition_key(partitioned_by[loop.index0]) -%}
            {%- do single_partition.append(partition_key + comp_func + value) -%}
        {%- endfor -%}

        {%- set single_partition_expression = single_partition | join(' and ') -%}

        {%- set batch_number = (loop.index0 / athena_partitions_limit) | int -%}
        {% if not batch_number in partitions %}
            {% do partitions.update({batch_number: []}) %}
        {% endif %}

        {%- do partitions[batch_number].append('(' + single_partition_expression + ')') -%}
        {%- if partitions[batch_number] | length == athena_partitions_limit or loop.last -%}
            {%- do partitions_batches.append(partitions[batch_number] | join(' or ')) -%}
        {%- endif -%}
    {%- endfor -%}

    {%- for batch in partitions_batches -%}
      {%- do log('BATCH PROCESSING: ' ~ loop.index ~ ' OF ' ~ partitions_batches|length) -%}
      {%- set optimize_partitions -%}
        optimize {{relation.render_pure()}} rewrite data using bin_pack
        where {{ batch }}
      {%- endset -%}
      {%- do run_query(optimize_partitions) -%}
    {%- endfor -%}
  {% endif %}

{%- endmacro %}

then in my post-hook I simply call: {{ optimize_by_partition(this)}}

If you like this approach, we can consider to make such macro as part of the adapter, and we can re-use/refactor code available in the adapter itself.

from dbt-athena.

Jrmyy commented on June 12, 2024

Thanks a lot for the code, I will test it 🥳

Re-running the optimize works at some point, I tried using AWS console. For instance if my table has 1000 partitions, running 10 times optimizes will raise the error I showed, and the 11th time passes.

from dbt-athena.

nicor88 commented on June 12, 2024

Oh I was not aware of that, maybe then could be worth it to add what you describe, a built in macro that "optimize till it should".

from dbt-athena.

Post hook when too many partitions failed about dbt-athena HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent