fivetran / dbt_mixpanel Goto Github PK
View Code? Open in Web Editor NEWFivetran's Mixpanel dbt package
Home Page: https://fivetran.github.io/dbt_mixpanel/
License: Apache License 2.0
Fivetran's Mixpanel dbt package
Home Page: https://fivetran.github.io/dbt_mixpanel/
License: Apache License 2.0
When running within BigQuery I noticed that the model stg_mixpanel__event_tmp
did not apply the same filter that the mixpanel__event
model does i.e.:
where time >= {{ "'" ~ var('date_range_start', '2010-01-01') ~ "'" }}
Applying this filter in the view can significantly reduce the resource requirements for running the model. When running in BigQuery for our own data, a run of the original code would process 4.45TB of data. Applying this filter, with the configuration below in our dbt_project.yml file, changed the processing to 276GB, which is a drastic improvement.
This is especially relevant when running in development, where you can use jinja to limit the number of records when running in development, which can greatly improve the time to test etc. e.g.
vars:
mixpanel:
date_range_start: "{{ '2019-11-01' if target.name == 'production' else (modules.datetime.date.today() - modules.datetime.timedelta(7)) }}"
I believe applying the same variable based filtering in the temporary table, when this is used as the source of the mixpanel__event
model is required to ensure the same amount of data is processed throughout the model.
No response
No response
Following a conversation with a customer around the deduping logic within the mixpanel__events
we found it would be beneficial to clarify the reasoning of our logic in the model and more importantly within the README.
The README needs to updated to the current format.
No response
No response
I recently upgraded to the 0.2.0 release of the package and it will no longer run for me.
However, if I run with a previous version of the package, the only change I make to the dbt project, then this runs without error in BigQuery and produces the datasets.
The error I am receiving is:
Completed with 1 error and 0 warnings:
Runtime Error in model stg_mixpanel__event_tmp (models/staging/tmp/stg_mixpanel__event_tmp.sql)
404 GET https://bigquery.googleapis.com/bigquery/v2/projects/backend-producti-b8633498/queries/d53025ae-64dc-4846-802b-2080411adce6?maxResults=0&location=EU&prettyPrint=false: Not found: Job backend-producti-b8633498:EU.d53025ae-64dc-4846-802b-2080411adce6
(job ID: d53025ae-64dc-4846-802b-2080411adce6)
Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1
I am running all my datasets within an EU region on GCP. My profiles.yml
file is configured as so (some fields redacted for security purposes):
default:
target: default
outputs:
default:
type: bigquery
method: service-account
project: xxxxxxxx
threads: 1
location: EU
keyfile: /path/to/file/key_file.json
timeout_seconds: 300
priority: interactive
schema: dbt_xxxx
retries: 1
For Databricks Compatibility, add the following:
dispatch:
- macro_namespace: dbt_utils
search_order: ['spark_utils', 'dbt_utils']
source
packages, update src yml so a database won't be passed to spark (example or use below):sources:
- name: <name>
database: "{% if target.type != 'spark' %}{{ var('<name>_database', target.database) }}{% endif %}"
config(
materialized='incremental',
unique_key='<original unique key>',
partition_by={'field': '<original field>', 'data_type': '<original data type>'} if target.type not in ('spark','databricks') else ['<original field>'],
incremental_strategy = 'merge' if target.type not in ('postgres', 'redshift') else 'delete+insert',
file_format = 'delta'
)
No response
No response
The introduction of the hardcoded source table in commit 1b66f95
[https://github.com/fivetran/dbt_mixpanel/commit/1b66f95022f7e37c78863441b1fa66b48f36f8d8#diff-06f26e4fc8545509412ed16924c7ffacb49b0057ad1f6bb83d75ba1568197428] for events introduced issues for users that used the variable event_table
in their dbt_project.yml
config.
No response
The source event table can be determined by the global var event_table
in the dbt_project.yml
vars:
event_table: "{{ ref('stg_mixpanel_union_old_new') }}"
mixpanel__event_frequency_limit: 1000
packages:
- package: fivetran/mixpanel
version: [">=0.8.0", "<2.0.0"]
snowflake
1.6
No response
Postgres compatibility
No response
No response
Slack thread here: https://getdbt.slack.com/archives/CBSQTAPLG/p1619707327482400
I have recently started to use Mixpanel with our company and went to use this dbt package.
Using the event_properties_to_pivot
configuration I have added 4 different properties, each of which contain spaces. This has cause the SQL that is executed to be incorrect and not run as the column name in the AS
definition are not quoted e.g.
replace(parse_json(event_properties):Company Name, '"', '') as Company Name,
replace(parse_json(event_properties):Region Code, '"', '') as Region Code,
Environment:
dbt environment - dbt cloud
destination - Snowflake DB
Package version - 0.1.0
so, customers can store their internal user_id
s (or whatever they might call it) in an event's properties. this is different from the event's distinct_id
, which is a unique identifier for the Mixpanel account. One person can get associated with multiple distinct_id
s in Mixpanel, perhaps by using a different device or triggering events while signed out.
wondering if we should offer the ability to perform event-deduping and/or sessionization at the REAL user level (not the distinct_id
level). will check out our sample data to see if i can find any cases where doing so would make a significant impact
From email:
Hi Guys,
Just another thing that is hampering me right now. I am using the Mixpanel dbt package from Fivetran, but I am running into and error when using it:
Looking at the executed code this is the part that is causing the error:
occurred_at >= coalesce((select cast( max(date_day) as
timestamp
Looking at the source schema, which is “stg_mixpanel__event_tmp” the “time” column, which is renamed in a CTE, is of type DATETIME. This means that we are trying to do a “‘DATETIME' >= ’TIMESTAMP’” which isn’t allowed in BigQuery.
And then looking at the code in GitHub this appears to be the effected line:
dbt_mixpanel/models/mixpanel__event.sql
Line 22 in f1d866f
I am unsure myself how to fix this since this code will run across multiple database systems.
Can this be raised with your dbt team please? Unfortunately this is really blocking me right now, so going to have to find a workaround
Thanks,
Craig
When running mixpanel__sessions
model, redshift throws the following error:
21:45:36 Database Error in model mixpanel__sessions (models/mixpanel__sessions.sql)
21:45:36 Result size exceeds LISTAGG limit
21:45:36 DETAIL:
21:45:36 -----------------------------------------------
21:45:36 error: Result size exceeds LISTAGG limit
21:45:36 code: 8001
21:45:36 context: LISTAGG limit: 65535
21:45:36 query: 933154
21:45:36 location: string_ops.cpp:116
21:45:36 process: query3_246_933154 [pid=3755]
21:45:36 -----------------------------------------------
No response
The model should run without error
# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: "dwh"
version: "1.0.0"
config-version: 2
# This setting configures which "profile" dbt uses for this project.
profile: "curative-dwh"
# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
log-path:
target-path: "target" # directory which will store compiled SQL files
clean-targets: # directories to be removed by `dbt clean`
- "target"
- "dbt_packages"
on-run-start:
# TODO: create script that runs on staging to make sure that the _0cp and _prod schemas are in sync with any new landing tables
# when a new schema is created, fix permissions
# NEED TO FIGURE OUT PERMISSIONS FOR THIS --> "{{ update_permissions_for_lnd_schemas() }}"
on-run-end:
# drop any models not in dbt project. only on staging and dev (not prod and ci; too risky)
- "{{ drop_orphaned_relations(['base', 'curated', 'aggregated'], ['mixpanel', 'fivetran_log'], False if target.name in ['staging'] else True) }}"
- "{{ re_data.save_test_history(results) if 'prod' == target.name }}"
vars:
force_row_limit: -1 # if want to enforce a row limit, like we do for CI env
mixpanel:
mixpanel_database: dwh
mixpanel_schema: lnd_mixpanel
event_table: "{{ source('mixpanel', 'event') }}{{ limit_clause_by_env(10000) }}"
date_range_start: "2022-03-30"
greenhouse_database: dwh
greenhouse_schema: lnd_greenhouse
greenhouse_using_prospects: true
greenhouse_using_eeoc: true
greenhouse_using_app_history: false
greenhouse_using_job_office: true
greenhouse_using_job_department: true
fivetran_log:
fivetran_log_database: dwh
fivetran_log_schema: lnd_fivetran
fivetran_log_using_transformations: false # this will disable all transformation + trigger_table logic
fivetran_log_using_triggers: false # this will disable only trigger_table logic
log: "{{ source('fivetran_log', 'log') }}{{ limit_clause_by_env() }}"
models:
+bind: false # Materialize all views as late-binding
+post-hook:
- "{{ redshift.compress_table(this.schema, this.table, drop_backup=False) if this.type == 'table' }}"
- "{{ test_late_binding_view() }}"
re_data:
+enabled: "{{ 'prod' == target.name }}"
+schema: base
internal:
+schema: base
redshift:
+bind: true
+schema: base
greenhouse_source:
+schema: base
mixpanel:
+schema: base
staging:
+schema: base
+docs:
show: False
fivetran_log:
+schema: base
staging:
+schema: base
+docs:
show: False
dwh:
+materialized: view
+re_data_monitored: true
base:
+schema: base
app_public:
staging:
+re_data_monitored: false
+materialized: ephemeral
+docs:
show: False
revcycle:
staging:
+materialized: ephemeral
+re_data_monitored: false
ups:
+enabled: False # busted for now
safegraph:
+enabled: False # need to setup
rd:
+enabled: False # need to setup
twilio:
twilio_call:
+enabled: False
twilio_role:
+enabled: False
twilio_role_permission:
+enabled: False
twilio_service:
+enabled: False
curated:
+schema: curated
core:
cur_core_safegraph_visits_by_home_cbg:
+enabled: False # need to setup
aggregated:
+schema: aggregated
sources:
greenhouse_source:
greenhouse:
application_history:
+enabled: False
#dwh:
# landing:
# src_quadax:
# +re_data_monitored: true
# src_app_public:
# +re_data_monitored: false
seeds:
+schema: base
tests:
fivetran_log:
+enabled: False
- package: dbt-labs/redshift
version: 0.6.0
- package: dbt-labs/dbt_utils
version: 0.8.0
- package: dbt-labs/dbt_external_tables
version: 0.8.0
- package: dbt-labs/codegen
version: 0.5.0
- package: fivetran/fivetran_utils
version: 0.3.4
- package: fivetran/fivetran_log
version: 0.5.2
redshift
1.0
No response
Hey Fivetran team,
Been loving Fivetran connectors and dbt transformers on top of those connectors. Set up our Stripe to use them and have very clean data that we can use in our dashboards.
Wanted to do the same for our Mixpanel data but unfortunately, this package doesn't work for Postgres. Looking at it, there seem to be only a few things that make this incompatible with Postgres like how the interval is formatted and using alias
after FROM
. Any chance you'd support Postgres, please?
Thanks for considering,
Nikhil
In the mixpanel__<table>
data models, the config block is set-up to use the partition_by
which is not a recognized parameter for snowflake. For clustering to work in Snowflake, you need to use the cluster_by
parameter.
The current value of partition_by={'field': 'date_day', 'data_type': 'date'} if target.type not in ('spark','databricks') else ['date_day'],
has some logic that does use different values for databricks vs all other platforms. The issue is that Snowflake just ignores this parameter completely.
To fix this issue, I followed the instruction of disabling the base dbt_mixpanel
version of the model and writing a copy of the model with the correct config block.
Here's the config block that ended up working for me. Without having other platforms to test, I do not know if this would work. I assume Databricks needs partition_by while Snowflake needs cluster_by.
{{
config(
materialized='incremental',
unique_key='unique_event_id',
cluster_by="date_day",
incremental_strategy = 'delete+insert',
)
}}
Using the default parameters on Snowflake, this the last few lines of the of the compiled SQL.
from dedupe
)
select * from pivot_properties
### Expected behavior
Using the correct Snowflake parameters, this is the last few lines of the compile SQL.
from dedupe
)
select * from pivot_properties
) order by (date_day)
);
alter table dbt.int_mixpanel__event cluster by (date_day);
### dbt Project configurations
config-version: 2
name: 'mixpanel'
version: '0.8.0'
require-dbt-version: [">=1.3.0", "<2.0.0"]
### Package versions
- package: fivetran/mixpanel
version: 0.8.0
### What database are you using dbt with?
snowflake
### dbt Version
- installed: 1.6.2
### Additional Context
_No response_
### Are you willing to open a PR to help address this issue?
- [ ] Yes.
- [ ] Yes, but I will need assistance and will schedule time during our [office hours](https://calendly.com/fivetran-solutions-team/fivetran-solutions-team-office-hours) for guidance
- [X] No.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.