Git Product home page Git Product logo

Comments (3)

nicor88 avatar nicor88 commented on June 12, 2024

Did you consider to build a custom macro to run optimize in batches? that's what I do at the moment, and it's pure inspired by what @svdimchenko proposed on the way we handle more than 100 partitions.
Re-running the optimize might not work with partitioned datasets - I believe that the best way is to run "optimize" with specific where statements.

Here the code:

{% macro optimize_by_partition(relation) -%}
  {%- set athena_partitions_limit = config.get('athena_optimize_partitions_limit', 50) | int -%}
  {%- set partitioned_by = config.get('partitioned_by') -%}
  {%- set partitioned_keys = adapter.format_partition_keys(partitioned_by) -%}
  {% do log('PARTITIONED KEYS: ' ~ partitioned_keys) %}
  {% call statement('get_partitions', fetch_result=True) %}
      select distinct {{ partitioned_keys }} from {{ relation }} order by {{ partitioned_keys }};
  {% endcall %}

  {%- set results = load_result('get_partitions') -%}

  {%- if results is not none %}
    {%- set table = results.table -%}
    {%- set rows = table.rows -%}
    {%- set partitions_batches = [] -%}
    {%- set partitions = {} -%}
    {% do log('TOTAL PARTITIONS TO PROCESS: ' ~ rows | length) %}

    {%- for row in rows -%}
        {%- set single_partition = [] -%}
        {%- for col in row -%}

            {%- set column_type = adapter.convert_type(table, loop.index0) -%}
            {%- set comp_func = '=' -%}
            {%- if col is none -%}
                {%- set value = 'null' -%}
                {%- set comp_func = ' is ' -%}
            {%- elif column_type == 'integer' -%}
                {%- set value = col | string -%}
            {%- elif column_type == 'string' -%}
                {%- set value = "'" + col + "'" -%}
            {%- elif column_type == 'date' -%}
                {%- set value = "DATE'" + col | string + "'" -%}
            {%- elif column_type == 'timestamp' -%}
                {%- set value = "TIMESTAMP'" + col | string + "'" -%}
            {%- else -%}
                {%- do exceptions.raise_compiler_error('Need to add support for column type ' + column_type) -%}
            {%- endif -%}
            {%- set partition_key = adapter.format_one_partition_key(partitioned_by[loop.index0]) -%}
            {%- do single_partition.append(partition_key + comp_func + value) -%}
        {%- endfor -%}

        {%- set single_partition_expression = single_partition | join(' and ') -%}

        {%- set batch_number = (loop.index0 / athena_partitions_limit) | int -%}
        {% if not batch_number in partitions %}
            {% do partitions.update({batch_number: []}) %}
        {% endif %}

        {%- do partitions[batch_number].append('(' + single_partition_expression + ')') -%}
        {%- if partitions[batch_number] | length == athena_partitions_limit or loop.last -%}
            {%- do partitions_batches.append(partitions[batch_number] | join(' or ')) -%}
        {%- endif -%}
    {%- endfor -%}

    {%- for batch in partitions_batches -%}
      {%- do log('BATCH PROCESSING: ' ~ loop.index ~ ' OF ' ~ partitions_batches|length) -%}
      {%- set optimize_partitions -%}
        optimize {{relation.render_pure()}} rewrite data using bin_pack
        where {{ batch }}
      {%- endset -%}
      {%- do run_query(optimize_partitions) -%}
    {%- endfor -%}
  {% endif %}

{%- endmacro %}

then in my post-hook I simply call: {{ optimize_by_partition(this)}}

If you like this approach, we can consider to make such macro as part of the adapter, and we can re-use/refactor code available in the adapter itself.

from dbt-athena.

Jrmyy avatar Jrmyy commented on June 12, 2024

Thanks a lot for the code, I will test it 🥳

Re-running the optimize works at some point, I tried using AWS console. For instance if my table has 1000 partitions, running 10 times optimizes will raise the error I showed, and the 11th time passes.

from dbt-athena.

nicor88 avatar nicor88 commented on June 12, 2024

Oh I was not aware of that, maybe then could be worth it to add what you describe, a built in macro that "optimize till it should".

from dbt-athena.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.