Comments (3)
@cstruct here your answers:
- Who owns this org: as now me and @thenaturalist , but we plan definitely to add more owners
- Who are part of the org: me and @thenaturalist as now, and we have 3 extra maintainers: @Jrmyy, @jessedobbelaere @mattiamatrix
- Can one become a member? of course, if you are interested, at start we give
writing
access to the repo. - How are decisions about the future of the project made? as now we are having internal communications on what to push first. Me and @Jrmyy and @thenaturalist coordinated a lot on what we want to work on. But we might plan to have regular planning sessions.
- What is expected of a contributor? We need to put some strict guidelines on how to contribute, generally a contributor might work on open issues, or push features that are relevant for her use case.
We have a kick off session with the maintainers, me or some other maintainers will write back here after we discuss some open points.
from dbt-athena.
Hi 👋 new-ish dbt-athena user here. I've been having some problems with dbt-athena updates lately, which I think boil down to that "future of the project" point in respect of what kind of PRs you accept, how they are implemented, and how you classify and handle resulting breaking changes. It's a bit of a convoluted and multi-issue story so I'll try and set it out for context before making some suggestions.
Context
I'm a consultant working with clients in AWS at the moment (more about that at https://tempered.works/ if it's useful) and using dbt with Athena - had a good experiences with dbt on BigQuery over the past couple of years. I'm currently using [email protected] for multiple pipelines that have over 1k operations in a dbt build
and I've resurrected and fixed up a PR on dbt-external-tables so that I can make existing data in S3 accessible via Athena. I am also responsible for advising clients and fellow consultants on tech choices.
I always advocate for least-privilege permissions, in line with AWS best practice. Most if not all clients I'm aware of explicitly require a least-privilege approach and have processes in place to agree changes to permissions - typically these are time-consuming and there is always the chance that the client will decline the change.
When we set up the first dbt pipeline, we figured out quite quickly what permissions were needed to run dbt-athena, agreed the permissions and got up and running. All was well for a couple of months. We try to keep our supply chains up to date (another typical infosec requirement), including packages like dbt-athena. When I tried to update from 1.3.5 a few weeks ago, my pipeline stopped working, throwing a permissions issue. The timing was problematic and there were no known vulnerabilities to address so we raised a ticket to look at as soon as time allowed.
Issues Since v1.3.5
Spin forward to now when I picked up that ticket a couple of weeks ago. I found the issue was down to a permission athena:GetWorkGroup
that had not been needed in 1.3.5, so I raised issue #262. I (initially) accepted @nicor88's response
IMHO is better to maybe document a bit more the minum permissions needed to work with this adapter.
Also, athena:GetWorkGroup is just a read permission.
Seems fair - I can probably make that argument without any problems. It's a PITA and I'd rather not spend time on it but OK.
As it happened, I'd already forked the repo wanting to understand the problem and anticipating submitted a PR and I disabled the problematic call.
Then my seeds blew up. Now I need S3 list and create permissions. Hence issue #279, tracking down the problem to PR#161 - to support large seed files. A tricky one to fix as the behaviour had been completely rewritten but I did my best and added some tests in PR#283.
Next, issue #285 to deal with the need for S3 delete permissions to clean up iceberg tables.
I haven't looked into it yet but I have a feeling that PR#249 in v1.4.5 will break me again as it'll need permissions to drop tables via Glue instead of via Athena.
😭 Point is - I'm not whining over a one-off. Each time I fixed something and rolled my fork forward I found a new issue. Had I taken the original issue and recommendation at face value, I would simply have run into more permissions issues once athena:getWorkGroup
had been sorted. From a personal and professional perspective, that would make me look incompetent with my client.
Wider Supply Chain Problems
As I mentioned, I also raised PR#277 to make the dbt-athena maintainers aware of a dependency I'd introduced in dbt-external-tables. I was unsure how best to handle the coupling (I picked up and old PR against the original dbt-athena
repo and updated to work with [email protected]
) - now that I understand better the internals of dbt-athena I think core external tables functionality should move into dbt-athena (you already have at least part of it in recent CTAS support) but I'm hesitant to suggest or implement given the recent breaks and the impact they might have on consumers of dbt-external-tables
.
Suggestions
I'm aware of and very sympathetic to the challenges of open source maintanance and I 👏 🙇 ❤️ you folks for taking that on. I want you to be successful and I applaud your goal of avoiding a need for any forks in the future.
Breaking Changes Policy
Is there any doubt about whether these permissions changes are "breaking changes"? For me - if my pipeline works under one version of dbt-athena and breaks under a future version then it's a breaking change.
The most important element of software that I (and by extension my clients) need is reliability - far more important than new features. A simple policy around how breaking changes are handled could be helpful - my original assumption was that an issue highlighting a breaking change and a PR to fix it would be accepted (subject to normal PR quality considerations) as a priority without argument. I have no problem raising PRs to patch up the problems but I've stopped as I became unconvinced they would be accepted.
On that subject - I'm lucky in having a software development background, so rolling up my sleeves and diving in to fix stuff is no problem (subject to time constraints). When I move on to my next engagement, there may not be anyone remaining who has that skillset, which is a significant consideration for the advice I give to clients. I think it's fair to say that use of dbt-athena implies knowledge of SQL and how Athena works, but does not imply any skills in Python or macro debugging. In the event of a breaking change, that might be catastrophic for a team lacking those skills. I'm not sure what specific actions I suggest here - but prioritising fixes to breaking changes over new functionality seems sensible?
Explicit Docs and Testing of IAM role
Specifying an IAM role that should run dbt-athena in the codebase should be pretty straightforward. That policy can be used in testing to ensure the permissions footprint hasn't moved. I'll raise an issue for this and I can use the policy my team uses to define that minimal permissions footprint - it'll help with my PRs! (DONE #291)
Automated Integration Testing
I can't see evidence of automated integration testing taking place. Guessing that's because an AWS account for that purpose would be risky given the inability to set an actual billing limit? I can raise an issue for that too and look into how to solve it safely (curious about this myself!)
Thanks!
Sorry about the word count here - aside from my own problems, I can see downloads of dbt-athena steadily increasing and a distribution of versions being downloaded last week all the way from 1.3.1 to 1.4.5. I suspect these challenges are coming sooner or later and hopefully I'm helping you get ahead of it.
Thanks for your hard work on dbt-athena!
from dbt-athena.
Sorry, missed a suggestion
Documented Feature Acceptance Criteria
Having some criteria over whether a new feature (as opposed to a fix) will be accepted might be a good idea, not just for consistency, but to minimise risk of breaks and keep the complexity under control. Also useful to help avoid contributors investing time in something that won't be accepted.
I think I see a pattern in dbt-athena where new features are added with the best of intentions, but for convenience or based on a perception without evidence. In adding those features, there are often subtle creeps in the permissions required or unintended consequences leading to subsequent conditional processing and exception handling.
I know this will come off as critical and I apologise - but I feel I need to call out an example to justify the claim.
I'll call out the recent switch to use Glue APIs to drop tables instead of dropping via Athena SQL. As best I can trace the lineage of that change, I can't find any evidence of a real-world problem that is addressed (I can't think of a pipeline I've been exposed to where dropping tables was a performance problem - it's the tests and materalizations that take the time). It did add a need for both Glue and S3 permissions - even in the case of Iceberg tables where AWS assert that:
Because Iceberg tables are considered managed tables in Athena, dropping an Iceberg table also removes all the data in the table.
This led to a confusing conversion for me with @nicor88 here because I expected dbt-athena to mirror Athena's behaviour, not Glue's.
So to my understanding, a feature was added that didn't add any needed functionality or address any concrete issue, but will break permissions-sensitive consumers (i.e. me) and did add a need for confusing code to clean up Iceberg tables when they would be cleaned up by AWS if dropped via Athena SQL.
from dbt-athena.
Related Issues (20)
- seeds with columns starting with underscore fail HOT 2
- Unable to compile test models in Athena HOT 1
- Add parameter `detailed_table_type` for `list_relations_without_caching` function
- Support column level statistics HOT 4
- Add default_lf_inherited_tags
- When using `force_batch=true` with incremental models, it will fail if there is no data to write HOT 1
- [Communication] Install Slack on the repository to enable release subscription HOT 4
- Prune old table version for incremental models. HOT 1
- Athena partitions limit fix (#360) fails with partitions defined as non-Athena functions HOT 4
- Log the query id HOT 2
- Post hook when too many partitions failed HOT 3
- Allow dbt-athena to grant permissions via lakeformation named resource method
- `external_location` ignored when `table_type='hive'` HOT 5
- Support multi-engine views in Athena
- Fine-Grain Permissions make re-creating tables difficult. HOT 9
- (Potential) bug with incremental iceberg tables HOT 6
- [Feature] set external_location for iceberg table HOT 1
- [Feature] Introduce ha flag for table materialisation and iceberg tables to allow users to control final location
- [Bug] Compilation Error when create view with lf-tags
- [Bug] pre_hook executed two times HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dbt-athena.