Git Product home page Git Product logo

dataform-scd's People

Contributors

alexandrejsimon avatar benbirt avatar dependabot[bot] avatar ekrekr avatar lewish avatar polinabee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

dataform-scd's Issues

SCD not running in my single region defaultLocation, "dataset not found in location EU"

Hi apologies if I'm duplicating things; I also filed this on the GCP issue tracker, but as I look into it further it looks like this may be something y'all are passing into dataform/core so I thought probably I should file here and if you agree then you will raise it with core? Anyway here goes:

Basically my dataset and dataform project are in a single region but when I run my dataform-scd script I get an error "dataset ... not found in location EU". So for some reason it's not respecting my single-region defaultLocation, and running instead in the surrounding multi-region. Everything else in the project seems to respect my defaultLocation, except this JS.

I got the sense it was an issue with dataform/core because the word defaultLocation only appears once in the entire dataform-scd repo (in the sample config), so I assume it's getting passed along into the dataform/core library (perhaps publish function?) which is actually executing the thing and somehow dropping the single-region-ness?

I've tried to set the location every way I know how and it keeps giving me this error, like the script is really intent on running in the multi-region... is this a bug? Seems like a bug? Or maybe there's a way to set updates.config and view.config so it will run in the single region? I tried explicitly setting the database and dataset there, but it didn't affect the outcome.

Handling new insertion on existing SCD table

The SCD library doesn't seems working on handling new rows (insertion).

For example:
Existing source table has rows with id : 1,2,3

After initial Dataform run
The table_scd_updates has rows with id : 1,2,3

When there is a new row in the source table : 1,2,3,4

The current logic doesn't handle the new row (id 4).

File code : dataform-scd/index.js
Code line :
"// Create an incremental table with just pure updates, for a full history of the table."
...
"where ${timestamp} > (select max(${timestamp}) from ${ctx.self()})"

The above logic will only insert new row when the id already exists from the initial SCD creation.

Is it intentional or bug?

How if adding additional clause e.g
"where ${timestamp} > (select max(${timestamp}) from ${ctx.self()})"
OR ${uniqueKey} NOT IN (select distinct ${uniqueKey} from ${ctx.self()}))

Updates table will create "blind spots" between creation and update times

This is an issue occurs both when backfilling records (when deploying the SCD script for the first time) and when adding new records with no entries on the updates table which have different creation and update times.

My use case is we have a users table and users take actions and the users table has some slowly changing fields like their preferred language and their country of residence. So we create the updates table, we create the view with scd_valid_to and scd_valid_from and then we can join any action to the state of the user that existed at the time of the action (in theory). e.g.

SELECT u.language, u.country, a.*
FROM ${ref("action_tool", "action")} action
JOIN ${ref("scd", "user")} user
  ON (user.id = action.user_id)
WHERE action.created_at BETWEEN user.scd_valid_from AND user.scd_valid_to

This is very convenient and fast except it doesn't work in two specific situations.

  1. Back-filling old data: Consider a user, who joined in 2020, took dozens of actions over the last 4 years, has a recent updated_at of some time in late 2023. When I run the SCD script for the first time I get a single history record in 2024 with an updated_at value of 2023, and the view with scd_valid_from shows the same date in 2023. This means that ALL the action records from 2020 to late 2023 will fail to join in the above query. (This is what I mean by orphan records.)

  2. Changes before the first update record: Similarly, if the SCD script is active and running daily, and a user signs up by taking an action at 12:00, not specifying a language so our system defaults them to English, and then at 12:05 they go into their profile and set their language to Spanish, and then at 2:00 the SCD job runs... it will insert a record with created_at: 12:00, updated_at: 12:05, and the view will show scd_valid_from 12:05. So the very first action they ever took (a very important one, from a business perspective) will be orphaned in the same way scenario 1.

It would be possible to complexify the views I'm using to join actions to histories, but this defeats the purpose of the convenience view. It might also be possible to add a configuration option for a created_at field so that the view detects when it's dealing with the first ever record update and set scd_valid_from to the creation date (figuring maybe there were changes in between there but we don't know so we have to be okay with some loss of specificity in exchange for not breaking all our other queries).

I think a better solution would be to add a config option for a created_at field, and have the insert script:

  1. In the query that identifies IDs to insert, also identify whether the record is completely new to the updates table or whether it's simply being updated.
  2. When the record is being updated only, use the present behavior.
  3. When the record has being inserted for the first time, if the created_at and updated_at values are different, insert TWO updates, one normal one (like the current logic) and one simulated original record, e.g. select * EXCEPT(updated_at), created_at AS updated_at

With this logic the Scenario 2 situation would create two records, and the scd_valid_* view would look like this

created_at updated_at user_id scd_valid_from scd_valid_to
12:00 12:00 613 12:00 12:05
12:00 12:05 613 12:05 NULL

This way, trying to join the original action from 12:00 and the second action from 12:05 will both work, whereas under the current logic, the first one will not join.

Hash value comparison: new record is not loaded, if the same hash value exist in older record.

When using hash value comparison, new record is not loaded, if the same hash value exist in older record.

Input data:

CUSTOMER_ID CUSTOMER_ADDRESS UPDATE_ON Comment
1 "ADDRESS_1" 1/1/2000 Initial address.
1 "ADDRESS_2" 1/20/2000 Customer changed address.
1 "ADDRESS_1" 1/22/2000 Customer changed back the address to the value from line 1.

The third value in the table will not be processed. This is happening, because when generating "ids_to_update", the historical records are not filtered.

with ids_to_update as 
   (select col1, HASH_COLUMN from `....TEST.TEST_TABLE_SRC_01`        
     except distinct         
   (select col1, HASH_COLUMN from `....TEST.source_data_scd_updates`)

In order for the process to work properly, only most recent record per KEY should be compared to the input data.
Possible solution is:

with ids_to_update as 
   (select col1, HASH_COLUMN from `....TEST.TEST_TABLE_SRC_01`        
     except distinct         
   (select col1, HASH_COLUMN from `....TEST.source_data_scd_updates`
      qualify row_number() over (partition by col1 order by updated_at desc) = 1)

SCD script is not running in my defaultLocation, creating error "dataset not found in location EU"

Hi apologies if I'm duplicating things; I also filed this on the GCP issue tracker, but as I look into it further it looks like this may be something y'all are passing into dataform/core so maybe a question for the library maintainers so I thought probably I should file here and if you agree then you will raise it with the other repo? Anyway here goes:

Basically my dataset and dataform project are in a single region but when I run my dataform-scd script I get an error "dataset ... not found in location EU". So for some reason it's not respecting my single-region defaultLocation, and running instead in the surrounding multi-region. Everything else in the project seems to respect my defaultLocation, except this JS.

I got the sense it was an issue with dataform/core because the word defaultLocation only appears once in the entire dataform-scd repo (in the sample config), so I assume it's getting passed along into the dataform/core library (perhaps publish function?) which is actually executing the thing and somehow dropping the single-region-ness?

I've tried to set the location every way I know how and it keeps giving me this error, like the script is really intent on running in the multi-region... is this a bug? Seems like a bug? Or maybe there's a way to set updates.config and view.config so it will run in the single region? I tried explicitly setting the database and dataset there, but it didn't affect the outcome.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.