Git Product home page Git Product logo

Comments (3)

nicor88 avatar nicor88 commented on May 20, 2024 4

@cstruct here your answers:

  • Who owns this org: as now me and @thenaturalist , but we plan definitely to add more owners
  • Who are part of the org: me and @thenaturalist as now, and we have 3 extra maintainers: @Jrmyy, @jessedobbelaere @mattiamatrix
  • Can one become a member? of course, if you are interested, at start we give writing access to the repo.
  • How are decisions about the future of the project made? as now we are having internal communications on what to push first. Me and @Jrmyy and @thenaturalist coordinated a lot on what we want to work on. But we might plan to have regular planning sessions.
  • What is expected of a contributor? We need to put some strict guidelines on how to contribute, generally a contributor might work on open issues, or push features that are relevant for her use case.

We have a kick off session with the maintainers, me or some other maintainers will write back here after we discuss some open points.

from dbt-athena.

brabster avatar brabster commented on May 20, 2024 1

Hi 👋 new-ish dbt-athena user here. I've been having some problems with dbt-athena updates lately, which I think boil down to that "future of the project" point in respect of what kind of PRs you accept, how they are implemented, and how you classify and handle resulting breaking changes. It's a bit of a convoluted and multi-issue story so I'll try and set it out for context before making some suggestions.

Context

I'm a consultant working with clients in AWS at the moment (more about that at https://tempered.works/ if it's useful) and using dbt with Athena - had a good experiences with dbt on BigQuery over the past couple of years. I'm currently using [email protected] for multiple pipelines that have over 1k operations in a dbt build and I've resurrected and fixed up a PR on dbt-external-tables so that I can make existing data in S3 accessible via Athena. I am also responsible for advising clients and fellow consultants on tech choices.

I always advocate for least-privilege permissions, in line with AWS best practice. Most if not all clients I'm aware of explicitly require a least-privilege approach and have processes in place to agree changes to permissions - typically these are time-consuming and there is always the chance that the client will decline the change.

When we set up the first dbt pipeline, we figured out quite quickly what permissions were needed to run dbt-athena, agreed the permissions and got up and running. All was well for a couple of months. We try to keep our supply chains up to date (another typical infosec requirement), including packages like dbt-athena. When I tried to update from 1.3.5 a few weeks ago, my pipeline stopped working, throwing a permissions issue. The timing was problematic and there were no known vulnerabilities to address so we raised a ticket to look at as soon as time allowed.

Issues Since v1.3.5

Spin forward to now when I picked up that ticket a couple of weeks ago. I found the issue was down to a permission athena:GetWorkGroup that had not been needed in 1.3.5, so I raised issue #262. I (initially) accepted @nicor88's response

IMHO is better to maybe document a bit more the minum permissions needed to work with this adapter.
Also, athena:GetWorkGroup is just a read permission.

Seems fair - I can probably make that argument without any problems. It's a PITA and I'd rather not spend time on it but OK.

As it happened, I'd already forked the repo wanting to understand the problem and anticipating submitted a PR and I disabled the problematic call.

Then my seeds blew up. Now I need S3 list and create permissions. Hence issue #279, tracking down the problem to PR#161 - to support large seed files. A tricky one to fix as the behaviour had been completely rewritten but I did my best and added some tests in PR#283.

Next, issue #285 to deal with the need for S3 delete permissions to clean up iceberg tables.

I haven't looked into it yet but I have a feeling that PR#249 in v1.4.5 will break me again as it'll need permissions to drop tables via Glue instead of via Athena.

😭 Point is - I'm not whining over a one-off. Each time I fixed something and rolled my fork forward I found a new issue. Had I taken the original issue and recommendation at face value, I would simply have run into more permissions issues once athena:getWorkGroup had been sorted. From a personal and professional perspective, that would make me look incompetent with my client.

Wider Supply Chain Problems

As I mentioned, I also raised PR#277 to make the dbt-athena maintainers aware of a dependency I'd introduced in dbt-external-tables. I was unsure how best to handle the coupling (I picked up and old PR against the original dbt-athena repo and updated to work with [email protected]) - now that I understand better the internals of dbt-athena I think core external tables functionality should move into dbt-athena (you already have at least part of it in recent CTAS support) but I'm hesitant to suggest or implement given the recent breaks and the impact they might have on consumers of dbt-external-tables.

Suggestions

I'm aware of and very sympathetic to the challenges of open source maintanance and I 👏 🙇 ❤️ you folks for taking that on. I want you to be successful and I applaud your goal of avoiding a need for any forks in the future.

Breaking Changes Policy

Is there any doubt about whether these permissions changes are "breaking changes"? For me - if my pipeline works under one version of dbt-athena and breaks under a future version then it's a breaking change.

The most important element of software that I (and by extension my clients) need is reliability - far more important than new features. A simple policy around how breaking changes are handled could be helpful - my original assumption was that an issue highlighting a breaking change and a PR to fix it would be accepted (subject to normal PR quality considerations) as a priority without argument. I have no problem raising PRs to patch up the problems but I've stopped as I became unconvinced they would be accepted.

On that subject - I'm lucky in having a software development background, so rolling up my sleeves and diving in to fix stuff is no problem (subject to time constraints). When I move on to my next engagement, there may not be anyone remaining who has that skillset, which is a significant consideration for the advice I give to clients. I think it's fair to say that use of dbt-athena implies knowledge of SQL and how Athena works, but does not imply any skills in Python or macro debugging. In the event of a breaking change, that might be catastrophic for a team lacking those skills. I'm not sure what specific actions I suggest here - but prioritising fixes to breaking changes over new functionality seems sensible?

Explicit Docs and Testing of IAM role

Specifying an IAM role that should run dbt-athena in the codebase should be pretty straightforward. That policy can be used in testing to ensure the permissions footprint hasn't moved. I'll raise an issue for this and I can use the policy my team uses to define that minimal permissions footprint - it'll help with my PRs! (DONE #291)

Automated Integration Testing

I can't see evidence of automated integration testing taking place. Guessing that's because an AWS account for that purpose would be risky given the inability to set an actual billing limit? I can raise an issue for that too and look into how to solve it safely (curious about this myself!)

Thanks!

Sorry about the word count here - aside from my own problems, I can see downloads of dbt-athena steadily increasing and a distribution of versions being downloaded last week all the way from 1.3.1 to 1.4.5. I suspect these challenges are coming sooner or later and hopefully I'm helping you get ahead of it.

Thanks for your hard work on dbt-athena!

from dbt-athena.

brabster avatar brabster commented on May 20, 2024

Sorry, missed a suggestion

Documented Feature Acceptance Criteria

Having some criteria over whether a new feature (as opposed to a fix) will be accepted might be a good idea, not just for consistency, but to minimise risk of breaks and keep the complexity under control. Also useful to help avoid contributors investing time in something that won't be accepted.

I think I see a pattern in dbt-athena where new features are added with the best of intentions, but for convenience or based on a perception without evidence. In adding those features, there are often subtle creeps in the permissions required or unintended consequences leading to subsequent conditional processing and exception handling.

I know this will come off as critical and I apologise - but I feel I need to call out an example to justify the claim.

I'll call out the recent switch to use Glue APIs to drop tables instead of dropping via Athena SQL. As best I can trace the lineage of that change, I can't find any evidence of a real-world problem that is addressed (I can't think of a pipeline I've been exposed to where dropping tables was a performance problem - it's the tests and materalizations that take the time). It did add a need for both Glue and S3 permissions - even in the case of Iceberg tables where AWS assert that:

Because Iceberg tables are considered managed tables in Athena, dropping an Iceberg table also removes all the data in the table.

This led to a confusing conversion for me with @nicor88 here because I expected dbt-athena to mirror Athena's behaviour, not Glue's.

So to my understanding, a feature was added that didn't add any needed functionality or address any concrete issue, but will break permissions-sensitive consumers (i.e. me) and did add a need for confusing code to clean up Iceberg tables when they would be cleaned up by AWS if dropped via Athena SQL.

from dbt-athena.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.