dataform-co / dataform-scd Goto Github PK

View Code? Open in Web Editor NEW

19.0 19.0 13.0 19 KB

Common data models for creating type-2 slowly changing dimensions tables from mutable data sources in Dataform.

Home Page: https://dataform.co

License: MIT License

JavaScript 100.00%

hacktoberfest

dataform-scd's People

Contributors

Stargazers

Watchers

Forkers

bhadauriyaa genemerrill polinabee alexandrejsimon leonandrade 9werayut michaelsnook 350org nnikolov84

dataform-scd's Issues

SCD not running in my single region defaultLocation, "dataset not found in location EU"

Hi apologies if I'm duplicating things; I also filed this on the GCP issue tracker, but as I look into it further it looks like this may be something y'all are passing into dataform/core so I thought probably I should file here and if you agree then you will raise it with core? Anyway here goes:

Basically my dataset and dataform project are in a single region but when I run my dataform-scd script I get an error "dataset ... not found in location EU". So for some reason it's not respecting my single-region defaultLocation, and running instead in the surrounding multi-region. Everything else in the project seems to respect my defaultLocation, except this JS.

I got the sense it was an issue with dataform/core because the word defaultLocation only appears once in the entire dataform-scd repo (in the sample config), so I assume it's getting passed along into the dataform/core library (perhaps publish function?) which is actually executing the thing and somehow dropping the single-region-ness?

I've tried to set the location every way I know how and it keeps giving me this error, like the script is really intent on running in the multi-region... is this a bug? Seems like a bug? Or maybe there's a way to set updates.config and view.config so it will run in the single region? I tried explicitly setting the database and dataset there, but it didn't affect the outcome.

Handling new insertion on existing SCD table

The SCD library doesn't seems working on handling new rows (insertion).

For example:
Existing source table has rows with id : 1,2,3

After initial Dataform run
The table_scd_updates has rows with id : 1,2,3

When there is a new row in the source table : 1,2,3,4

The current logic doesn't handle the new row (id 4).

File code : dataform-scd/index.js
Code line :
"// Create an incremental table with just pure updates, for a full history of the table."
...
"where ${timestamp} > (select max(${timestamp}) from ${ctx.self()})"

The above logic will only insert new row when the id already exists from the initial SCD creation.

Is it intentional or bug?

How if adding additional clause e.g
"where ${timestamp} > (select max(${timestamp}) from ${ctx.self()})"
OR ${uniqueKey} NOT IN (select distinct ${uniqueKey} from ${ctx.self()}))

Updates table will create "blind spots" between creation and update times

This is an issue occurs both when backfilling records (when deploying the SCD script for the first time) and when adding new records with no entries on the updates table which have different creation and update times.

My use case is we have a users table and users take actions and the users table has some slowly changing fields like their preferred language and their country of residence. So we create the updates table, we create the view with scd_valid_to and scd_valid_from and then we can join any action to the state of the user that existed at the time of the action (in theory). e.g.

SELECT u.language, u.country, a.*
FROM ${ref("action_tool", "action")} action
JOIN ${ref("scd", "user")} user
  ON (user.id = action.user_id)
WHERE action.created_at BETWEEN user.scd_valid_from AND user.scd_valid_to

This is very convenient and fast except it doesn't work in two specific situations.

Back-filling old data: Consider a user, who joined in 2020, took dozens of actions over the last 4 years, has a recent updated_at of some time in late 2023. When I run the SCD script for the first time I get a single history record in 2024 with an updated_at value of 2023, and the view with scd_valid_from shows the same date in 2023. This means that ALL the action records from 2020 to late 2023 will fail to join in the above query. (This is what I mean by orphan records.)
Changes before the first update record: Similarly, if the SCD script is active and running daily, and a user signs up by taking an action at 12:00, not specifying a language so our system defaults them to English, and then at 12:05 they go into their profile and set their language to Spanish, and then at 2:00 the SCD job runs... it will insert a record with created_at: 12:00, updated_at: 12:05, and the view will show scd_valid_from 12:05. So the very first action they ever took (a very important one, from a business perspective) will be orphaned in the same way scenario 1.

It would be possible to complexify the views I'm using to join actions to histories, but this defeats the purpose of the convenience view. It might also be possible to add a configuration option for a created_at field so that the view detects when it's dealing with the first ever record update and set scd_valid_from to the creation date (figuring maybe there were changes in between there but we don't know so we have to be okay with some loss of specificity in exchange for not breaking all our other queries).

I think a better solution would be to add a config option for a created_at field, and have the insert script:

In the query that identifies IDs to insert, also identify whether the record is completely new to the updates table or whether it's simply being updated.
When the record is being updated only, use the present behavior.
When the record has being inserted for the first time, if the created_at and updated_at values are different, insert TWO updates, one normal one (like the current logic) and one simulated original record, e.g. select * EXCEPT(updated_at), created_at AS updated_at

With this logic the Scenario 2 situation would create two records, and the scd_valid_* view would look like this

created_at	updated_at	user_id	scd_valid_from	scd_valid_to
12:00	12:00	613	12:00	12:05
12:00	12:05	613	12:05	NULL

This way, trying to join the original action from 12:00 and the second action from 12:05 will both work, whereas under the current logic, the first one will not join.

Hash value comparison: new record is not loaded, if the same hash value exist in older record.

When using hash value comparison, new record is not loaded, if the same hash value exist in older record.

Input data:

CUSTOMER_ID	CUSTOMER_ADDRESS	UPDATE_ON	Comment
1	"ADDRESS_1"	1/1/2000	Initial address.
1	"ADDRESS_2"	1/20/2000	Customer changed address.
1	"ADDRESS_1"	1/22/2000	Customer changed back the address to the value from line 1.

The third value in the table will not be processed. This is happening, because when generating "ids_to_update", the historical records are not filtered.

with ids_to_update as 
   (select col1, HASH_COLUMN from `....TEST.TEST_TABLE_SRC_01`        
     except distinct         
   (select col1, HASH_COLUMN from `....TEST.source_data_scd_updates`)

In order for the process to work properly, only most recent record per KEY should be compared to the input data.
Possible solution is:

with ids_to_update as 
   (select col1, HASH_COLUMN from `....TEST.TEST_TABLE_SRC_01`        
     except distinct         
   (select col1, HASH_COLUMN from `....TEST.source_data_scd_updates`
      qualify row_number() over (partition by col1 order by updated_at desc) = 1)

SCD script is not running in my defaultLocation, creating error "dataset not found in location EU"

Hi apologies if I'm duplicating things; I also filed this on the GCP issue tracker, but as I look into it further it looks like this may be something y'all are passing into dataform/core so maybe a question for the library maintainers so I thought probably I should file here and if you agree then you will raise it with the other repo? Anyway here goes:

dataform-co / dataform-scd Goto Github PK

dataform-scd's People

Contributors

Stargazers

Watchers

Forkers

dataform-scd's Issues

SCD not running in my single region defaultLocation, "dataset not found in location EU"

Handling new insertion on existing SCD table

Updates table will create "blind spots" between creation and update times

Hash value comparison: new record is not loaded, if the same hash value exist in older record.

SCD script is not running in my defaultLocation, creating error "dataset not found in location EU"

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent