dataform-co / dataform-scd Goto Github PK
View Code? Open in Web Editor NEWCommon data models for creating type-2 slowly changing dimensions tables from mutable data sources in Dataform.
Home Page: https://dataform.co
License: MIT License
Common data models for creating type-2 slowly changing dimensions tables from mutable data sources in Dataform.
Home Page: https://dataform.co
License: MIT License
Hi apologies if I'm duplicating things; I also filed this on the GCP issue tracker, but as I look into it further it looks like this may be something y'all are passing into dataform/core
so I thought probably I should file here and if you agree then you will raise it with core
? Anyway here goes:
Basically my dataset and dataform project are in a single region but when I run my dataform-scd script I get an error "dataset ... not found in location EU". So for some reason it's not respecting my single-region defaultLocation, and running instead in the surrounding multi-region. Everything else in the project seems to respect my defaultLocation, except this JS.
I got the sense it was an issue with dataform/core
because the word defaultLocation
only appears once in the entire dataform-scd repo (in the sample config), so I assume it's getting passed along into the dataform/core
library (perhaps publish
function?) which is actually executing the thing and somehow dropping the single-region-ness?
I've tried to set the location every way I know how and it keeps giving me this error, like the script is really intent on running in the multi-region... is this a bug? Seems like a bug? Or maybe there's a way to set updates.config
and view.config
so it will run in the single region? I tried explicitly setting the database and dataset there, but it didn't affect the outcome.
The SCD library doesn't seems working on handling new rows (insertion).
For example:
Existing source table has rows with id : 1,2,3
After initial Dataform run
The table_scd_updates has rows with id : 1,2,3
When there is a new row in the source table : 1,2,3,4
The current logic doesn't handle the new row (id 4).
File code : dataform-scd/index.js
Code line :
"// Create an incremental table with just pure updates, for a full history of the table."
...
"where ${timestamp} > (select max(${timestamp}) from ${ctx.self()})
"
The above logic will only insert new row when the id already exists from the initial SCD creation.
Is it intentional or bug?
How if adding additional clause e.g
"where ${timestamp} > (select max(${timestamp}) from ${ctx.self()})
"
OR ${uniqueKey} NOT IN (select distinct ${uniqueKey} from ${ctx.self()}))
This is an issue occurs both when backfilling records (when deploying the SCD script for the first time) and when adding new records with no entries on the updates table which have different creation and update times.
My use case is we have a users
table and users take actions
and the users table has some slowly changing fields like their preferred language and their country of residence. So we create the updates table, we create the view with scd_valid_to
and scd_valid_from
and then we can join any action to the state of the user that existed at the time of the action (in theory). e.g.
SELECT u.language, u.country, a.*
FROM ${ref("action_tool", "action")} action
JOIN ${ref("scd", "user")} user
ON (user.id = action.user_id)
WHERE action.created_at BETWEEN user.scd_valid_from AND user.scd_valid_to
This is very convenient and fast except it doesn't work in two specific situations.
Back-filling old data: Consider a user, who joined in 2020, took dozens of actions over the last 4 years, has a recent updated_at of some time in late 2023. When I run the SCD script for the first time I get a single history record in 2024 with an updated_at value of 2023, and the view with scd_valid_from shows the same date in 2023. This means that ALL the action records from 2020 to late 2023 will fail to join in the above query. (This is what I mean by orphan records.)
Changes before the first update record: Similarly, if the SCD script is active and running daily, and a user signs up by taking an action at 12:00, not specifying a language so our system defaults them to English, and then at 12:05 they go into their profile and set their language to Spanish, and then at 2:00 the SCD job runs... it will insert a record with created_at: 12:00, updated_at: 12:05, and the view will show scd_valid_from 12:05. So the very first action they ever took (a very important one, from a business perspective) will be orphaned in the same way scenario 1.
It would be possible to complexify the views I'm using to join actions to histories, but this defeats the purpose of the convenience view. It might also be possible to add a configuration option for a created_at field so that the view detects when it's dealing with the first ever record update and set scd_valid_from to the creation date (figuring maybe there were changes in between there but we don't know so we have to be okay with some loss of specificity in exchange for not breaking all our other queries).
I think a better solution would be to add a config option for a created_at field, and have the insert script:
select * EXCEPT(updated_at), created_at AS updated_at
With this logic the Scenario 2 situation would create two records, and the scd_valid_* view would look like this
created_at | updated_at | user_id | scd_valid_from | scd_valid_to |
---|---|---|---|---|
12:00 | 12:00 | 613 | 12:00 | 12:05 |
12:00 | 12:05 | 613 | 12:05 | NULL |
This way, trying to join the original action from 12:00 and the second action from 12:05 will both work, whereas under the current logic, the first one will not join.
When using hash value comparison, new record is not loaded, if the same hash value exist in older record.
Input data:
CUSTOMER_ID | CUSTOMER_ADDRESS | UPDATE_ON | Comment |
---|---|---|---|
1 | "ADDRESS_1" | 1/1/2000 | Initial address. |
1 | "ADDRESS_2" | 1/20/2000 | Customer changed address. |
1 | "ADDRESS_1" | 1/22/2000 | Customer changed back the address to the value from line 1. |
The third value in the table will not be processed. This is happening, because when generating "ids_to_update", the historical records are not filtered.
with ids_to_update as
(select col1, HASH_COLUMN from `....TEST.TEST_TABLE_SRC_01`
except distinct
(select col1, HASH_COLUMN from `....TEST.source_data_scd_updates`)
In order for the process to work properly, only most recent record per KEY should be compared to the input data.
Possible solution is:
with ids_to_update as
(select col1, HASH_COLUMN from `....TEST.TEST_TABLE_SRC_01`
except distinct
(select col1, HASH_COLUMN from `....TEST.source_data_scd_updates`
qualify row_number() over (partition by col1 order by updated_at desc) = 1)
Hi apologies if I'm duplicating things; I also filed this on the GCP issue tracker, but as I look into it further it looks like this may be something y'all are passing into dataform/core
so maybe a question for the library maintainers so I thought probably I should file here and if you agree then you will raise it with the other repo? Anyway here goes:
Basically my dataset and dataform project are in a single region but when I run my dataform-scd script I get an error "dataset ... not found in location EU". So for some reason it's not respecting my single-region defaultLocation, and running instead in the surrounding multi-region. Everything else in the project seems to respect my defaultLocation, except this JS.
I got the sense it was an issue with dataform/core
because the word defaultLocation
only appears once in the entire dataform-scd repo (in the sample config), so I assume it's getting passed along into the dataform/core
library (perhaps publish
function?) which is actually executing the thing and somehow dropping the single-region-ness?
I've tried to set the location every way I know how and it keeps giving me this error, like the script is really intent on running in the multi-region... is this a bug? Seems like a bug? Or maybe there's a way to set updates.config
and view.config
so it will run in the single region? I tried explicitly setting the database and dataset there, but it didn't affect the outcome.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.