Dagster version dagster, version 1.5.14 What's

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

In terms of next steps: Is there anything else you need from m

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

dbt assets with AMPs split across runs about dagster HOT 6 OPEN

mjkanji commented on July 20, 2024 3

dbt assets with AMPs split across runs

from dagster.

Comments (6)

OwenKephart commented on July 20, 2024

Hi @mjkanji, my suspicion is that much of this may be stemming from a synchronization issue that can crop up when using a combination of "wait for all parents to updated" and "materialize on this cron schedule".

The first time this is condition is evaluated, if the dbt assets have been updated less recently than their parents, then they will execute immediately, regardless of if the fivetran assets have completed yet. Now, when the fivetran assets finish materializing, the dbt assets get sent back to the "parents updated more recently" state, and so the next day at midnight, they are instantly "ready to kick off" again, and so on.

We are working on implementing a "parents updated since latest cron schedule tick" type of rule to solve this issue, as it will force the downstream asset to wait for the upstream assets to have been materialized after midnight before kicking off (and so an upstream materialization from yesterday will not allow the downstream asset to be materialized "ahead of schedule", even if the parent did indeed materialize more recently than the child)

from dagster.

mjkanji commented on July 20, 2024

Hi @OwenKephart Thank you for the reply! Is there an estimated ETA for when the new rule will be released?

Additionally, while I think the interaction you mentioned can be part of the story, I don't think it's the entire story. That's because I have, on multiple occasions, run all DBT assets much later in the day, after the usual cron tick, as a single run. In this case, the Fivetran assets would have been materialized hours ago, so when the cron tick arrives the next day, the parents would have all been updated before the materialization of the DBT assets. Yet, the behaviour is seemingly the same and, additionally, the DBT assets are split across multiple runs.

There's also another peculiarity in my setup and I'm wondering if that may be causing some of this. The IOManager I'm using for my DBT assets counts the number of rows in the table/view as part of the handle_output method. I'm wondering if this is messing with the order of materialization times for assets.

For example, consider a setup where A -> B -> C and

A is a source asset (in DBT parlance) with 1B rows that's orchestrated by Dagster (i.e., outside of DBT).
B is a DBT model/view that evaluates to select * from A.
C is a DBT model/view that evaluates to select * from B limit 10.

In this setup, the row count operation for C will terminate almost immediately, but counting 1B rows for B will take a while. Would this, in turn, mean that the asset materialization event (and time) for B is later than the materialization time for C, even though DBT would have correctly materialized B before C?

In essence, I'm wondering if the row count operation could cause the same issue that you're identifying with the Fivetran parents, but within the DBT group itself. Or is the materialization time recorded by Dagster dependent on when dbt run command sends a SUCCESS event, regardless of how long any processing by the IOManager takes afterwards?

from dagster.

mjkanji commented on July 20, 2024

In terms of next steps:

Is there anything else you need from my end (e.g., logs, access) that might help you identify the root cause with more certainty?
This orchestration issue is preventing me from putting Dagster into production on a well-past-overdue project for a client and I really need a short-term solution. What would you recommend doing to enable the ideal setup I outlined above: all the assets in the Fivetran-Assets and Data-Pipeline groups are updated first, and then all DBT assets are updated as part of a single run. I'm thinking the Fivetran-Assets and Data-Pipeline groups can remain on their current AMP setup, and I'll need to disable AMP for the DBT assets and use a job instead. If so, how can I ensure the job only starts after all of the parents have been updated? Would this be a job for a Sensor?
- Finally, what about the assets that are downstream of DBT if I'm using a job? At the moment, there's only one downstream asset in the appwrap group, but it needs to be run monthly, instead of daily, and only after the daily DBT run has been completed on the first of the month. Would I use a sensor for that as well?

from dagster.

OwenKephart commented on July 20, 2024

Hi @mjkanji -- the new rule will go out in either this week or next week's release.

In terms of a short-term value solution, considering the specific use case you have is fairly simple, I think the current-day solution would be to use a combination of a schedule (for the fivetran + data pipeline groups), and a run status sensor (for the dbt assets).

For the assets downstream of the dbt assets, that could also be accomplished with a sensor (which only fires if it's the first of the month).

Definitely interested in getting to the bottom of this issue, though. The strangest part to me here is the fact that the dbt assets are executing immediately upon the cron schedule ticking. Some useful information to help debug this would be a screenshot of the Automation tab on the Asset Details page of one of the triggered assets (preferably one of the root assets of the dbt project)

from dagster.

mjkanji commented on July 20, 2024

Hi @OwenKephart Apologies for the late reply.

After further investigation, I was able to determine the following:

I had a bug in my DAG. I was under the impression that external assets do not need materialization events (and are considered always up-to-date). So, some of the DBT sources that are orchestrated outside Dagster were considered never materialized by Dagster and their children were not updating correctly. I have now fixed that.
The asset runs do respect the AutoMaterializeRule.skip_on_not_all_parents_updated() rule. In fact, the multiple runs are, partially, a result of this. As the Data-Pipeline and Fivetran group runs incrementally update their assets, the corresponding children in DBT start to materialize, even though the Data-Pipeline run is still ongoing and materializing other assets. Since only part of the DAG is eligible to be run, Dagster splits the DBT DAG into multiple chunks and materializes what it can at a given moment in time. Hence, there are a number of runs where a single staging asset is materialized (as soon as the source for it is materialized).

To accommodate the second point above, I staggered the DBT AMP cron tick to be 1AM so that the upstream groups have an hour to materialize all the DBT sources before DBT runs start.

However, while the above issues reduce the number of split DBT runs, the DBT assets are still not all orchestrated in a single run, which is still confusing to me. I'm not quite sure why that is the case. Dagster seems to prefer to go through the DBT DAG in tranches/different levels of depth.

This makes me want a feature that allows defining an AMP at a group/AssetSelection level (instead of just the asset level), such that if any of the assets in the selection is not eligible for auto-materialization, then none of the other assets in the group can be auto-materialized, regardless of their own status/eligibility.

from dagster.

gofford commented on July 20, 2024

Hey @OwenKephart - whilst I can't speak for the synchronicity problem, I think I am seeing the split-job behaviour referred to here.

What I've seen, in a simplified view, is that if I have a linear pipeline of the form A -> B -> C -> D -> E, where A is a source and B through E are sequentially downstream dbt assets. A materialises on simple cron.

With a AutoMaterializeRule.skip_on_not_all_parents_updated() AMP on the dbt assets the B to E assets materialise individually, in separate jobs, after A is refreshed.

What I think is happening is that:

A is updated,
The AMP tick determines that the immediate parent of B is updated and so updates B. At this point the parent of C is not yet updated, so C and downstream get skipped.
the next AMP tick determines that B is updated, and so triggers C. D, depending on C which hasn't updated yet, is skipped.
This follows for all downstream assets and ultimately results in each "layer" of dependency in the dbt graph being a separate job.

So at this point I have a few questions:

is this expected behaviour for dbt with an amp?
if it is, is there a way to mitigate it so that jobs are executed with a "global parent" in mind? In this example there's no reason from a dbt perspective as to why B through E assets can't be in a single job because the dbt runtime will handle the lineage appropriately. I appreciate that this is simplified, and it gets a lot more nuanced when considering split and many-branched dbt lineages, but even in those cases I would expect (and prefer) the jobs to map to batches of models that share the same mutual parent.

What do you think?

from dagster.

dbt assets with AMPs split across runs about dagster HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent