Comments (6)
Hi @mjkanji, my suspicion is that much of this may be stemming from a synchronization issue that can crop up when using a combination of "wait for all parents to updated" and "materialize on this cron schedule".
The first time this is condition is evaluated, if the dbt assets have been updated less recently than their parents, then they will execute immediately, regardless of if the fivetran assets have completed yet. Now, when the fivetran assets finish materializing, the dbt assets get sent back to the "parents updated more recently" state, and so the next day at midnight, they are instantly "ready to kick off" again, and so on.
We are working on implementing a "parents updated since latest cron schedule tick" type of rule to solve this issue, as it will force the downstream asset to wait for the upstream assets to have been materialized after midnight before kicking off (and so an upstream materialization from yesterday will not allow the downstream asset to be materialized "ahead of schedule", even if the parent did indeed materialize more recently than the child)
from dagster.
Hi @OwenKephart Thank you for the reply! Is there an estimated ETA for when the new rule will be released?
Additionally, while I think the interaction you mentioned can be part of the story, I don't think it's the entire story. That's because I have, on multiple occasions, run all DBT assets much later in the day, after the usual cron tick, as a single run. In this case, the Fivetran assets would have been materialized hours ago, so when the cron tick arrives the next day, the parents would have all been updated before the materialization of the DBT assets. Yet, the behaviour is seemingly the same and, additionally, the DBT assets are split across multiple runs.
There's also another peculiarity in my setup and I'm wondering if that may be causing some of this. The IOManager
I'm using for my DBT assets counts the number of rows in the table/view as part of the handle_output
method. I'm wondering if this is messing with the order of materialization times for assets.
For example, consider a setup where A -> B -> C
and
- A is a source asset (in DBT parlance) with 1B rows that's orchestrated by Dagster (i.e., outside of DBT).
- B is a DBT model/view that evaluates to
select * from A
. - C is a DBT model/view that evaluates to
select * from B limit 10
.
In this setup, the row count operation for C will terminate almost immediately, but counting 1B rows for B will take a while. Would this, in turn, mean that the asset materialization event (and time) for B is later than the materialization time for C, even though DBT would have correctly materialized B before C?
In essence, I'm wondering if the row count operation could cause the same issue that you're identifying with the Fivetran parents, but within the DBT group itself. Or is the materialization time recorded by Dagster dependent on when dbt run
command sends a SUCCESS
event, regardless of how long any processing by the IOManager takes afterwards?
from dagster.
In terms of next steps:
- Is there anything else you need from my end (e.g., logs, access) that might help you identify the root cause with more certainty?
- This orchestration issue is preventing me from putting Dagster into production on a well-past-overdue project for a client and I really need a short-term solution. What would you recommend doing to enable the ideal setup I outlined above: all the assets in the Fivetran-Assets and Data-Pipeline groups are updated first, and then all DBT assets are updated as part of a single run. I'm thinking the Fivetran-Assets and Data-Pipeline groups can remain on their current AMP setup, and I'll need to disable AMP for the DBT assets and use a
job
instead. If so, how can I ensure the job only starts after all of the parents have been updated? Would this be a job for aSensor
?- Finally, what about the assets that are downstream of DBT if I'm using a job? At the moment, there's only one downstream asset in the
appwrap
group, but it needs to be run monthly, instead of daily, and only after the daily DBT run has been completed on the first of the month. Would I use a sensor for that as well?
- Finally, what about the assets that are downstream of DBT if I'm using a job? At the moment, there's only one downstream asset in the
from dagster.
Hi @mjkanji -- the new rule will go out in either this week or next week's release.
In terms of a short-term value solution, considering the specific use case you have is fairly simple, I think the current-day solution would be to use a combination of a schedule (for the fivetran + data pipeline groups), and a run status sensor (for the dbt assets).
For the assets downstream of the dbt assets, that could also be accomplished with a sensor (which only fires if it's the first of the month).
Definitely interested in getting to the bottom of this issue, though. The strangest part to me here is the fact that the dbt assets are executing immediately upon the cron schedule ticking. Some useful information to help debug this would be a screenshot of the Automation tab on the Asset Details page of one of the triggered assets (preferably one of the root assets of the dbt project)
from dagster.
Hi @OwenKephart Apologies for the late reply.
After further investigation, I was able to determine the following:
-
I had a bug in my DAG. I was under the impression that external assets do not need materialization events (and are considered always up-to-date). So, some of the DBT sources that are orchestrated outside Dagster were considered never materialized by Dagster and their children were not updating correctly. I have now fixed that.
-
The asset runs do respect the
AutoMaterializeRule.skip_on_not_all_parents_updated()
rule. In fact, the multiple runs are, partially, a result of this. As theData-Pipeline
andFivetran
group runs incrementally update their assets, the corresponding children in DBT start to materialize, even though theData-Pipeline
run is still ongoing and materializing other assets. Since only part of the DAG is eligible to be run, Dagster splits the DBT DAG into multiple chunks and materializes what it can at a given moment in time. Hence, there are a number of runs where a single staging asset is materialized (as soon as the source for it is materialized).
To accommodate the second point above, I staggered the DBT AMP cron tick to be 1AM so that the upstream groups have an hour to materialize all the DBT sources before DBT runs start.
However, while the above issues reduce the number of split DBT runs, the DBT assets are still not all orchestrated in a single run, which is still confusing to me. I'm not quite sure why that is the case. Dagster seems to prefer to go through the DBT DAG in tranches/different levels of depth.
This makes me want a feature that allows defining an AMP at a group/AssetSelection level (instead of just the asset level), such that if any of the assets in the selection is not eligible for auto-materialization, then none of the other assets in the group can be auto-materialized, regardless of their own status/eligibility.
from dagster.
Hey @OwenKephart - whilst I can't speak for the synchronicity problem, I think I am seeing the split-job behaviour referred to here.
What I've seen, in a simplified view, is that if I have a linear pipeline of the form A -> B -> C -> D -> E
, where A
is a source and B
through E
are sequentially downstream dbt assets. A
materialises on simple cron.
With a AutoMaterializeRule.skip_on_not_all_parents_updated()
AMP on the dbt assets the B
to E
assets materialise individually, in separate jobs, after A
is refreshed.
What I think is happening is that:
A
is updated,- The AMP tick determines that the immediate parent of
B
is updated and so updatesB
. At this point the parent ofC
is not yet updated, soC
and downstream get skipped. - the next AMP tick determines that
B
is updated, and so triggersC
.D
, depending onC
which hasn't updated yet, is skipped. - This follows for all downstream assets and ultimately results in each "layer" of dependency in the dbt graph being a separate job.
So at this point I have a few questions:
- is this expected behaviour for dbt with an amp?
- if it is, is there a way to mitigate it so that jobs are executed with a "global parent" in mind? In this example there's no reason from a dbt perspective as to why
B
throughE
assets can't be in a single job because the dbt runtime will handle the lineage appropriately. I appreciate that this is simplified, and it gets a lot more nuanced when considering split and many-branched dbt lineages, but even in those cases I would expect (and prefer) the jobs to map to batches of models that share the same mutual parent.
What do you think?
from dagster.
Related Issues (20)
- CeleryK8sRunLauncher doesn't work with celery_executor HOT 3
- Hooks: slack_on_success and slack_on_failure fail silently to work
- [dagster-deltalake] GcsConfig ImportError and TypeError for partitioned assets
- Different UX when viewing runs locally than in production because of additional tags
- `load_asset_checks_from_module` sometimes returns `AssetsDefinition`s instead of `AssetChecksDefinition`s
- ModuleNotFoundError: No module named 'dbt.adapters.base.impl' HOT 4
- Support tags with colon (:) HOT 4
- Propagate filters when navigating through catalog search results
- dagster_pipes.DagsterPipesError: Cannot send message after pipes context is closed." HOT 2
- Cannot create asset job with BackfillPolicy.multi_run() and backfill_policy=None
- GCS IO manager connection error "AttributeError: 'NoneType' object has no attribute 'get_client'" HOT 6
- Dagster-pipes report_asset_materialization is missing `partition` versioning
- Daemon error on code location reload HOT 4
- Dagster-pipes errors out even though run was successfull HOT 1
- Allow easily backfilling all failed partitions for different assets with different partitions HOT 2
- @io_manager(config_schema=...) fails to resolve silently after pydantic update >=2.7.1
- Provide more flexible launch-time configuration composition options for Kubernetes HOT 1
- Allow multiple partition selection in launch pad
- Launchpad on asset shows all resources instead of the required ones HOT 2
- Missing environment variable causes silent failure of materialization from UI HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dagster.