Git Product home page Git Product logo

dbt_mixpanel's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dbt_mixpanel's Issues

[Feature] Improve run time of package with additional ftiler

Is there an existing feature request for this?

  • I have searched the existing issues

Describe the Feature

When running within BigQuery I noticed that the model stg_mixpanel__event_tmp did not apply the same filter that the mixpanel__event model does i.e.:

where time >= {{ "'" ~ var('date_range_start',  '2010-01-01') ~ "'" }}

Applying this filter in the view can significantly reduce the resource requirements for running the model. When running in BigQuery for our own data, a run of the original code would process 4.45TB of data. Applying this filter, with the configuration below in our dbt_project.yml file, changed the processing to 276GB, which is a drastic improvement.

This is especially relevant when running in development, where you can use jinja to limit the number of records when running in development, which can greatly improve the time to test etc. e.g.

vars:
   mixpanel:
      date_range_start: "{{ '2019-11-01' if target.name == 'production' else (modules.datetime.date.today() - modules.datetime.timedelta(7)) }}"

I believe applying the same variable based filtering in the temporary table, when this is used as the source of the mixpanel__event model is required to ensure the same amount of data is processed throughout the model.

Describe alternatives you've considered

No response

Are you interested in contributing this feature?

  • Yes.
  • Yes, but I will need assistance and will schedule time during your office hours for guidance.
  • No.

Anything else?

No response

Add deduping logic within README

Following a conversation with a customer around the deduping logic within the mixpanel__events we found it would be beneficial to clarify the reasoning of our logic in the model and more importantly within the README.

[Feature] Update README

Is there an existing feature request for this?

  • I have searched the existing issues

Describe the Feature

The README needs to updated to the current format.

Describe alternatives you've considered

No response

Are you interested in contributing this feature?

  • Yes.
  • Yes, but I will need assistance and will schedule time during your office hours for guidance.
  • No.

Anything else?

No response

Package not working on Google BigQuery

I recently upgraded to the 0.2.0 release of the package and it will no longer run for me.

However, if I run with a previous version of the package, the only change I make to the dbt project, then this runs without error in BigQuery and produces the datasets.

The error I am receiving is:

Completed with 1 error and 0 warnings:

Runtime Error in model stg_mixpanel__event_tmp (models/staging/tmp/stg_mixpanel__event_tmp.sql)
  404 GET https://bigquery.googleapis.com/bigquery/v2/projects/backend-producti-b8633498/queries/d53025ae-64dc-4846-802b-2080411adce6?maxResults=0&location=EU&prettyPrint=false: Not found: Job backend-producti-b8633498:EU.d53025ae-64dc-4846-802b-2080411adce6
  
  (job ID: d53025ae-64dc-4846-802b-2080411adce6)

Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

I am running all my datasets within an EU region on GCP. My profiles.yml file is configured as so (some fields redacted for security purposes):

default:
  target: default
  outputs:
    default:
      type: bigquery
      method: service-account
      project: xxxxxxxx
      threads: 1
      location: EU
      keyfile: /path/to/file/key_file.json
      timeout_seconds: 300
      priority: interactive
      schema: dbt_xxxx
      retries: 1

[Feature] Databricks Compatibility

Is there an existing feature request for this?

  • I have searched the existing issues

Describe the Feature

For Databricks Compatibility, add the following:

  1. Buildkite testing:
    • Update pre-command (example)
    • Update pipeline.yml (example)
    • Update sample.profiles.yml (example)
    • Add the below to integration_tests/dbt_project.yml if it's not there:
dispatch:
  - macro_namespace: dbt_utils
    search_order: ['spark_utils', 'dbt_utils']
  1. For source packages, update src yml so a database won't be passed to spark (example or use below):
sources: 
  - name: <name>
    database: "{% if target.type != 'spark' %}{{ var('<name>_database', target.database) }}{% endif %}"
  1. Update any incremental models to update partition_by for databricks and add current strategies if not present:
config(
        materialized='incremental',
        unique_key='<original unique key>',
        partition_by={'field': '<original field>', 'data_type': '<original data type>'} if target.type not in ('spark','databricks') else ['<original field>'],
        incremental_strategy = 'merge' if target.type not in ('postgres', 'redshift') else 'delete+insert',
        file_format = 'delta' 
)

Describe alternatives you've considered

No response

Are you interested in contributing this feature?

  • Yes.
  • Yes, but I will need assistance and will schedule time during your office hours for guidance.
  • No.

Anything else?

No response

[Bug] Breaking change in stg_mixpanel___events.sql event table source

Is there an existing issue for this?

  • I have searched the existing issues

Describe the issue

The introduction of the hardcoded source table in commit 1b66f95 [https://github.com/fivetran/dbt_mixpanel/commit/1b66f95022f7e37c78863441b1fa66b48f36f8d8#diff-06f26e4fc8545509412ed16924c7ffacb49b0057ad1f6bb83d75ba1568197428] for events introduced issues for users that used the variable event_table in their dbt_project.yml config.
image

Relevant error log or model output

No response

Expected behavior

The source event table can be determined by the global var event_table in the dbt_project.yml

dbt Project configurations

vars:
  event_table: "{{ ref('stg_mixpanel_union_old_new') }}"
  mixpanel__event_frequency_limit: 1000

Package versions

packages:
- package: fivetran/mixpanel
  version: [">=0.8.0", "<2.0.0"]

What database are you using dbt with?

snowflake

dbt Version

1.6

Additional Context

No response

Are you willing to open a PR to help address this issue?

  • Yes.
  • Yes, but I will need assistance and will schedule time during our office hours for guidance
  • No.

[Feature] Postgres compatibility

Is there an existing feature request for this?

  • I have searched the existing issues

Describe the Feature

Postgres compatibility

Describe alternatives you've considered

No response

Are you interested in contributing this feature?

  • Yes.
  • Yes, but I will need assistance and will schedule time during your office hours for guidance.
  • No.

Anything else?

No response

Package fails if custom event properties contains spaces

I have recently started to use Mixpanel with our company and went to use this dbt package.

Using the event_properties_to_pivot configuration I have added 4 different properties, each of which contain spaces. This has cause the SQL that is executed to be incorrect and not run as the column name in the AS definition are not quoted e.g.

replace(parse_json(event_properties):Company Name, '"', '') as Company Name,
replace(parse_json(event_properties):Region Code, '"', '') as Region Code,

Environment:
dbt environment - dbt cloud
destination - Snowflake DB
Package version - 0.1.0

FEATURE - dedupe users who have multiple mixpanel accounts?

so, customers can store their internal user_ids (or whatever they might call it) in an event's properties. this is different from the event's distinct_id, which is a unique identifier for the Mixpanel account. One person can get associated with multiple distinct_ids in Mixpanel, perhaps by using a different device or triggering events while signed out.

wondering if we should offer the ability to perform event-deduping and/or sessionization at the REAL user level (not the distinct_id level). will check out our sample data to see if i can find any cases where doing so would make a significant impact

datetime/timestamp issue

From email:

Hi Guys,

Just another thing that is hampering me right now. I am using the Mixpanel dbt package from Fivetran, but I am running into and error when using it:

Looking at the executed code this is the part that is causing the error:
occurred_at >= coalesce((select cast( max(date_day) as
timestamp

Looking at the source schema, which is “stg_mixpanel__event_tmp” the “time” column, which is renamed in a CTE, is of type DATETIME. This means that we are trying to do a “‘DATETIME' >= ’TIMESTAMP’” which isn’t allowed in BigQuery.

And then looking at the code in GitHub this appears to be the effected line:

occurred_at >= coalesce((select cast( max(date_day) as {{ dbt_utils.type_timestamp() }} ) from {{ this }} ), '2010-01-01')

I am unsure myself how to fix this since this code will run across multiple database systems.

Can this be raised with your dbt team please? Unfortunately this is really blocking me right now, so going to have to find a workaround

Thanks,

Craig

mixpanel__sessions throws LISTAGG limit error

Is there an existing issue for this?

  • I have searched the existing issues

Describe the issue

When running mixpanel__sessions model, redshift throws the following error:

21:45:36  Database Error in model mixpanel__sessions (models/mixpanel__sessions.sql)
21:45:36    Result size exceeds LISTAGG limit
21:45:36    DETAIL:  
21:45:36      -----------------------------------------------
21:45:36      error:  Result size exceeds LISTAGG limit
21:45:36      code:      8001
21:45:36      context:   LISTAGG limit: 65535
21:45:36      query:     933154
21:45:36      location:  string_ops.cpp:116
21:45:36      process:   query3_246_933154 [pid=3755]
21:45:36      -----------------------------------------------

Relevant error log or model output

No response

Expected behavior

The model should run without error

dbt Project configurations

# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: "dwh"
version: "1.0.0"
config-version: 2

# This setting configures which "profile" dbt uses for this project.
profile: "curative-dwh"

# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

log-path:

target-path: "target" # directory which will store compiled SQL files
clean-targets: # directories to be removed by `dbt clean`
  - "target"
  - "dbt_packages"

on-run-start:
  # TODO: create script that runs on staging to make sure that the _0cp and _prod schemas are in sync with any new landing tables
  
  # when a new schema is created, fix permissions
  # NEED TO FIGURE OUT PERMISSIONS FOR THIS -->  "{{ update_permissions_for_lnd_schemas() }}"
  
on-run-end:
  # drop any models not in dbt project. only on staging and dev (not prod and ci; too risky)
  - "{{ drop_orphaned_relations(['base', 'curated', 'aggregated'], ['mixpanel', 'fivetran_log'], False if target.name in ['staging'] else True) }}"
  - "{{ re_data.save_test_history(results) if 'prod' == target.name }}"

vars:
  force_row_limit: -1 # if want to enforce a row limit, like we do for CI env

  mixpanel:
    mixpanel_database: dwh
    mixpanel_schema: lnd_mixpanel
    event_table: "{{ source('mixpanel', 'event') }}{{ limit_clause_by_env(10000) }}"
    date_range_start: "2022-03-30"

  greenhouse_database: dwh
  greenhouse_schema: lnd_greenhouse
  greenhouse_using_prospects: true
  greenhouse_using_eeoc: true
  greenhouse_using_app_history: false
  greenhouse_using_job_office: true
  greenhouse_using_job_department: true

  fivetran_log:
    fivetran_log_database: dwh
    fivetran_log_schema: lnd_fivetran
    fivetran_log_using_transformations: false # this will disable all transformation + trigger_table logic
    fivetran_log_using_triggers: false # this will disable only trigger_table logic
    log: "{{ source('fivetran_log', 'log') }}{{ limit_clause_by_env() }}"

models:
  +bind: false # Materialize all views as late-binding
  +post-hook:
    - "{{ redshift.compress_table(this.schema, this.table, drop_backup=False) if this.type == 'table' }}"
    - "{{ test_late_binding_view() }}"

  re_data:
    +enabled: "{{ 'prod' == target.name }}"
    +schema: base
    internal:
      +schema: base

  redshift:
    +bind: true
    +schema: base

  greenhouse_source:
    +schema: base

  mixpanel:
    +schema: base
    staging:
      +schema: base
      +docs:
        show: False

  fivetran_log:
    +schema: base
    staging:
      +schema: base
      +docs:
        show: False

  dwh:
    +materialized: view
    +re_data_monitored: true
    
    base:
      +schema: base
      
      app_public:
        staging:
          +re_data_monitored: false
          +materialized: ephemeral
          +docs:
            show: False
          
      revcycle:
        staging:
          +materialized: ephemeral
          +re_data_monitored: false

      ups:
        +enabled: False # busted for now
      safegraph:
        +enabled: False # need to setup
      rd:
        +enabled: False # need to setup

      twilio:
        twilio_call:
          +enabled: False
        twilio_role:
          +enabled: False
        twilio_role_permission:
          +enabled: False
        twilio_service:
          +enabled: False

    curated:
      +schema: curated
      core:
        cur_core_safegraph_visits_by_home_cbg:
          +enabled: False # need to setup

    aggregated:
      +schema: aggregated

sources:
  greenhouse_source:
    greenhouse:
      application_history:
        +enabled: False

  #dwh:
  #  landing:
  #    src_quadax:
  #      +re_data_monitored: true
  #    src_app_public:
  #      +re_data_monitored: false

seeds:
  +schema: base

tests:
  fivetran_log:
    +enabled: False

Package versions

  - package: dbt-labs/redshift
    version: 0.6.0

  - package: dbt-labs/dbt_utils
    version: 0.8.0

  - package: dbt-labs/dbt_external_tables
    version: 0.8.0

  - package: dbt-labs/codegen
    version: 0.5.0

  - package: fivetran/fivetran_utils
    version: 0.3.4

  - package: fivetran/fivetran_log
    version: 0.5.2

What database are you using dbt with?

redshift

dbt Version

1.0

Additional Context

No response

Are you willing to open a PR to help address this issue?

  • Yes.
  • Yes, but I will need assistance and will schedule time during our office hours for guidance
  • No.

Any chance of supporting Postgres?

Hey Fivetran team,

Been loving Fivetran connectors and dbt transformers on top of those connectors. Set up our Stripe to use them and have very clean data that we can use in our dashboards.

Wanted to do the same for our Mixpanel data but unfortunately, this package doesn't work for Postgres. Looking at it, there seem to be only a few things that make this incompatible with Postgres like how the interval is formatted and using alias after FROM. Any chance you'd support Postgres, please?

Thanks for considering,
Nikhil

[Bug] partition_by config parameter doesn't work with Snowflake

Is there an existing issue for this?

  • I have searched the existing issues

Describe the issue

In the mixpanel__<table> data models, the config block is set-up to use the partition_by which is not a recognized parameter for snowflake. For clustering to work in Snowflake, you need to use the cluster_by parameter.

The current value of partition_by={'field': 'date_day', 'data_type': 'date'} if target.type not in ('spark','databricks') else ['date_day'], has some logic that does use different values for databricks vs all other platforms. The issue is that Snowflake just ignores this parameter completely.

To fix this issue, I followed the instruction of disabling the base dbt_mixpanel version of the model and writing a copy of the model with the correct config block.

Here's the config block that ended up working for me. Without having other platforms to test, I do not know if this would work. I assume Databricks needs partition_by while Snowflake needs cluster_by.

{{
    config(
        materialized='incremental',
        unique_key='unique_event_id',
        cluster_by="date_day",
        incremental_strategy = 'delete+insert',
    )
}}

Relevant error log or model output

Using the default parameters on Snowflake, this the last few lines of the of the compiled SQL.



    from dedupe

)

select * from pivot_properties


### Expected behavior

Using the correct Snowflake parameters, this is the last few lines of the compile SQL.
from dedupe

)

select * from pivot_properties
) order by (date_day)
);
alter table dbt.int_mixpanel__event cluster by (date_day);


### dbt Project configurations

config-version: 2
name: 'mixpanel'
version: '0.8.0'
require-dbt-version: [">=1.3.0", "<2.0.0"]

### Package versions

  - package: fivetran/mixpanel
    version: 0.8.0

### What database are you using dbt with?

snowflake

### dbt Version

- installed: 1.6.2

### Additional Context

_No response_

### Are you willing to open a PR to help address this issue?

- [ ] Yes.
- [ ] Yes, but I will need assistance and will schedule time during our [office hours](https://calendly.com/fivetran-solutions-team/fivetran-solutions-team-office-hours) for guidance
- [X] No.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.