Git Product home page Git Product logo

Comments (8)

JanKrl avatar JanKrl commented on August 30, 2024 4

That did the trick!
Plus, I had to set --conf spark.sql.catalog.glue_catalog.warehouse=<s3-bucket> due to the error: IllegalArgumentException: Cannot initialize GlueCatalog because warehousePath must not be null

For sake of clarity, here is the full config:

type: glue
glue_version: "3.0"
query-comment: DBT model for Iceberg tables
role_arn: <role-arn>
region: eu-central-1
location: <s3-bucket>
schema: <schema-name>
session_provisioning_timeout_in_seconds: 120
workers: 2
worker_type: G.1X
idle_timeout: 5
datalake_formats: iceberg
conf: >
  spark.sql.defaultCatalog=glue_catalog
  --conf spark.sql.catalog.glue_catalog.warehouse=<s3-bucket>
  --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog
  --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
  --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

from dbt-glue.

JanKrl avatar JanKrl commented on August 30, 2024

Another relevant finding for this issue - when creating new table from seed, a non-iceberg table is creates:
Input format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
Output format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Serde serialization lib: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe

from dbt-glue.

JanKrl avatar JanKrl commented on August 30, 2024

This article about DBT and Glue doesn't mention this specifically but seems like DBT-Glue is not able to read Iceberg tables (InputFormat cannot be null). In their setup they use Hive tables for intermediate stage and Iceberg only for final layer.
Furthermore, it doesn't work on Glue 4.0 but it seems to work in Glue 3.0.

Can anyone confirm my conclusion that Iceberg table can be used only in a final stage of the processing pipeline?

from dbt-glue.

moryachok avatar moryachok commented on August 30, 2024

Have the same issue with Iceberg. Maybe also related to the fact I use LakeFormation

from dbt-glue.

eshetben avatar eshetben commented on August 30, 2024

hi, i have the same issue (dbt and dbt-glue 1.7, glue 4.0, with lake formation), so i tried replicating the dbt code and running it in a glue notebook, and i did get the exact same error in the notebook as well.

adding glue_catalog. to table name did work for me in the notebook, but i couldn't really apply this solution to dbt, since i don't have control over that piece of code.

instead - i added these configs:

    .config("spark.sql.defaultCatalog", "glue_catalog") \
    .config("spark.sql.catalog.glue_catalog.default-namespace", "via_stage") \

that also worked in the notebook, since the job now used my catalog instead of the default one (named default).

however, i still couldn't get dbt to work, even though i added these two configs in the profiles yaml.
i have a suspicion that dbt is not using these configs properly...

to conclude - i've identified 2 problems:

  1. session is using default catalog instead of glue_catalog
  2. configs might not be used properly by dbt

from dbt-glue.

eshetben avatar eshetben commented on August 30, 2024

update - got it working, don't know why it didn't work before...

the solution was adding the default configs -

        --conf spark.sql.defaultCatalog=glue_catalog
        --conf spark.sql.catalog.glue_catalog.default-namespace=<schema>

from dbt-glue.

JanKrl avatar JanKrl commented on August 30, 2024

update - got it working, don't know why it didn't work before...

the solution was adding the default configs -

        --conf spark.sql.defaultCatalog=glue_catalog
        --conf spark.sql.catalog.glue_catalog.default-namespace=<schema>

When trying this I get the error:
Catalog 'glue_catalog' plugin class not found: spark.sql.catalog.glue_catalog is not defined. I tried with both Glue 3.0 and 4.0

This is my conf now:

--conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
--conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
--conf spark.sql.defaultCatalog=glue_catalog
--conf spark.sql.catalog.glue_catalog.default-namespace=<schema-name>

from dbt-glue.

eshetben avatar eshetben commented on August 30, 2024

@JanKrl this is exactly what i have (only spark.sql.catalog.glue_catalog.warehouse might be missing) and it's working for me.

did you make sure to leave out the first --conf from the string? i made that mistake 😅 so no config was actually used

      conf: >
        spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog
        --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
        ...

from dbt-glue.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.