dbt-labs / dbt-bigquery Goto Github PK
View Code? Open in Web Editor NEWdbt-bigquery contains all of the code required to make dbt operate on a BigQuery database.
Home Page: https://github.com/dbt-labs/dbt-bigquery
License: Apache License 2.0
dbt-bigquery contains all of the code required to make dbt operate on a BigQuery database.
Home Page: https://github.com/dbt-labs/dbt-bigquery
License: Apache License 2.0
Support connecting to BigQuery via from an AWS Credentials File.
None
This is how my org does gcloud auth, so we don’t have many alternatives here
Any user/company who runs Google cloud workloads from AWS and keeps credentials in AWS.
Yes
Support https://cloud.google.com/bigquery/docs/materialized-views-intro
{{ config(
materialized='view',
materialized-view='true' # default 'false
)}}
{{ config(
materialized='materialized-view',
)}}
Some powerful stuff here https://cloud.google.com/bigquery/docs/materialized-views-intro
BQ users
Running dbt deps
with a profiles.yml
which a) doesn't have a project and b) in an environment without a valid gcloud auth (such as a docker build):
nimbus:
target: user
outputs:
user:
type: bigquery
method: oauth
dataset: my-dataset
timeout_seconds: 3600
threads: 24
# project: my-project
Raises an error:
=> ERROR [dev 11/12] RUN dbt deps 5.2s
------
> [dev 11/12] RUN dbt deps:
dbt-labs/dbt#36 1.615 Running with dbt=0.19.0
dbt-labs/dbt#36 4.925 Encountered an error:
dbt-labs/dbt#36 4.925 Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started
------
executor failed running [/bin/sh -c dbt deps]: exit code: 2
These works:
project:
uncommenteddbt deps --target=null
Does dbt deps
need the database / project name at that stage? If so, this is the correct behavior. But if not, there's no need to break in a docker build.
Which database are you using dbt with?
The output of dbt --version
:
installed version: 0.19.0
latest version: 0.19.0
Up to date!
Plugins:
- bigquery: 0.19.0
- snowflake: 0.19.0
- redshift: 0.19.0
- postgres: 0.19.0
The operating system you're using:
MacOS
The output of python --version
:
Python 3.8.8
In dbt v0.16, dbt run will make query calls for _dbt_max_partition
on all BigQuery partitioned incremental models.
In our situation, we have many very large partitioned incremental models that have no use of _dbt_max_partition
. However, they still have to query _dbt_max_partition
. It is very costly when the runs in production add up.
Could we disable _dbt_max_partition
for models that have no use of it?
To allow the model config to disable _dbt_max_partition
for a particular model.
for example,
{{ config(
materialized='table',
partition_by={
"field": "created_at",
"data_type": "timestamp",
"enable_dbt_max_partition": false
}
)}}
This is specific to BigQuery database, and it's part of the new merge feature from dbt v0.16
Everyone who own very large BigQuery tables for incremental runs but not using _dbt_max_partition
dbt-core
When using the OAuth method to do a federated query for a table that points to a CSV file on Drive, I get an error.
When using a service account to access the same file, it works just fine.
• I have full permissions to the file
• All GDrive APIs in Google Cloud and Google Workspace are turned on
• I've also run the following command
gcloud auth application-default login \
--scopes=https://www.googleapis.com/auth/bigquery,\
https://www.googleapis.com/auth/drive.readonly,\
https://www.googleapis.com/auth/iam.test
• I've also tried revoking auth and re-authing
• Place a (CSV) file in a drive and share it with yourself
• Make a table in Bigquery that uses the Drive file as the source (federated)
• SELECT * the table via dbt
I expect it to be able to query the table.
Access Denied: BigQuery BigQuery: Access Not Configured. Drive API has not been used in project 764086051850 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/drive.googleapis.com/overview?project=764086051850 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.
compiled SQL at ...
What's weird is that the project ID is not one of ours. When using a service account, the correct project is used. My profile.yml is set up correctly, so I don't know where this project ID is coming from.
I've noticed this issue happening with a similar project ID on StackOverflow
The output of dbt --version
:
installed version: 0.21.0
latest version: 1.0.0
Your version of dbt is out of date! You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation
Plugins:
- bigquery: 0.21.0
- snowflake: 0.21.0
- redshift: 0.21.0
- postgres: 0.21.0
The operating system you're using:
Mac OS Big Sur version 11.6
Python 3.8.5
schema test unique
cannot be used on columns in bigquery where the column name has the same name as the table. Bigquery yields the following error:
Grouping by expressions of type STRUCT is not allowed at [4:10]
create table <your dataset>.foo (
foo int,
bar string,
);
-- this mimics this existing structure of a uniqueness test on the foo column in this table but fails with a:
-- Grouping by expressions of type STRUCT is not allowed at [4:10]
with validation_errors as (
select
foo, count(*)
from <your dataset>.foo
group by foo
having count(*) > 1
)
select *
from validation_errors;
using an alias for the table name disambiguates the reference for bigquery and allows the test to run succesfully
with validation_errors as (
select
model_table.foo, count(*)
from <your dataset>.foo as model_table
group by model_table.foo
having count(*) > 1
)
select *
from validation_errors;
Database Error in test dbt_utils_source_unique_combination_of_columns_... (models/....yml)
Grouping by expressions of type STRUCT is not allowed at [18:14]
compiled SQL at target/run/trumid_poc/models/....yml/schema_test/dbt_utils_source_unique_combin_1cc53959c9ef4a8b014db7bc87dadc5d.sql
Which database are you using dbt with?
The output of dbt --version
:
installed version: 0.20.0-rc1
latest version: 0.19.1
Your version of dbt is ahead of the latest release!
Plugins:
- bigquery: 0.20.0rc1
- postgres: 0.20.0rc1
- redshift: 0.20.0rc1
- snowflake: 0.20.0rc1
The operating system you're using:
Distributor ID: Ubuntu
Description: Ubuntu 18.04.4 LTS
Release: 18.04
Codename: bionic
on wsl
The output of python --version
:
3.9.2
Found this issue in dbt_utils technically and will have to file an issue there as well but it holds for a classic uniqueness test as well
Per @drewbanin's comment here, we might add a partition filter to the source as well as the destination in the merge
statement generated by the insert_overwrite
incremental strategy on BigQuery.
....
when not matched by source
and DBT_INTERNAL_DEST.id in (
1, 2, 3
)
then delete
when not matched then insert (`id`, `ts`)
values (`id`, `ts`)
....
when not matched by source
and DBT_INTERNAL_DEST.id in (
1, 2, 3
)
then delete
when not matched
-- this is new:
and DBT_INTERNAL_SOURCE.id in (
1, 2, 3
)
then insert (`id`, `ts`) values (`id`, `ts`)
I'm pretty sure there's no substantive difference here, but to Drew's point,
This just has the benefit of avoiding a weird failure mode if the user should specify a list of partitions to overwrite, but they generate a SQL select statement which returns an incongruent set of partitions.
BigQuery users of new incremental model functionality released in 0.16.0. Given that this is quite new behavior, we should seek user feedback on weird/unexpected edge cases.
Command executed: dbt test --model <model name>
Error Message:
SELECT list expression references column <column_name> which is neither grouped nor aggregated at [12:5]
compiled SQL <location>\unique_test_model_ID_Name.sql
To reproduce execute following commands:
dbt run --model test_model
dbt test --model test_model
test_model.sql
Select
'1' as ID
, 'abc' as Name
mart.yml
version: 2
models:
- name: test_model
columns:
- name: ID || Name
tests:
- unique
Test case should have executed successfully.
The output of dbt --version
:
0.21.0
The test case was running fine with dbt version: 0.19.0
select
count(*) as failures,
count(*) != 0 as should_warn,
count(*) != 0 as should_error
from (
select
ID || Name as unique_field,
count(*) as n_records
from `<project_id>`.`<schema>`.`test_model`
where ID || Name is not null
group by ID || Name
having count(*) > 1
) dbt_internal_test
select count(*) as validation_errors
from (
select
ID || Name
from `<project_id>`.`<schema>`.`test_model`
where ID || Name is not null
group by ID || Name
having count(*) > 1
) validation_errors
Seems the aliasing of the OR field creates an issue for BigQuery.
Add some testing/validation checks for adapter_response object.
A clear and concise description of any alternative solutions or features you've considered.
issue #68 and pr #79 brought up a lacking in our validation checks for the adapter_response would be nice to add in some.
test/unit
and test/integration
related to bigquery to this repoI was trying to create a bigquery table with a specified partition field and 'day' granularity. However, running the with the below config only generate a table with the pseudo-column _PARTITIONTIME.
In similar fashion, declared cluster_key does not show up in the final table
{{ config(
materialized = 'table',
partition_by={
"field": "partition_date",
"data_type": "date",
"granularity": "day"
},
cluster_by=["partition_date"],
partitions=dbt.partition_range(var('dates', default=yesterday())),
verbose=True
)}}
dbt --full-refresh -m page_view_2 --vars 'dates: "20210701, 20210703"'
Running with dbt=0.20.2
Found 6 models, 0 tests, 0 snapshots, 0 analyses, 165 macros, 0 operations, 0 seed files, 2 sources, 0 exposures
16:39:11 | Concurrency: 1 threads (target='dev')
16:39:11 |
16:39:11 | 1 of 1 START table model wc_data_core_MB_DEV.page_view_2............. [RUN]
16:39:12 | -> Running for day 20210701
16:39:21 | -> Running for day 20210702
16:39:30 | -> Running for day 20210703
16:39:36 | 1 of 1 OK created table model wc_data_core_MB_DEV.page_view_2........ [CREATED 3 PARTITIONS in 24.92s]
16:39:36 |
16:39:36 | Finished running 1 table model in 26.05s.
The output of dbt --version
:
installed version: 0.20.2
Plugins:
- snowflake: 0.20.2
- postgres: 0.20.2
- bigquery: 0.20.2
- redshift: 0.20.2
- snowflake: 0.20.2
- postgres: 0.20.2
- bigquery: 0.20.2
- redshift: 0.20.2
The operating system you're using: Mac OS Monterey
The output of python --version
:
Python 3.6.15
Add any other context about the problem here.
the column type is overridden as a BigQueryColumn in BigQuery dbt projects
However, column names in nested fields are not quoted, which is ok for most coloums. However, in cases where reserved keywords are used as column names, the SQL code becomes invalid.
In order to circumvent this issue, all columns should be quoted.
There is already a function called "quoted".
In my opinion, line 86 should be changed, so that column names are quoted.
@property
def quoted(self):
return '`{}`'.format(self.column)
def literal(self, value):
return "cast({} as {})".format(value, self.dtype)
@property
def data_type(self) -> str:
if self.dtype.upper() == 'RECORD':
subcols = [
"{} {}".format(col.name, col.data_type) for col in self.fields
]
field_type = 'STRUCT<{}>'.format(", ".join(subcols))
else:
field_type = self.dtype
if self.mode.upper() == 'REPEATED':
return 'ARRAY<{}>'.format(field_type)
else:
return field_type
We encoutered the issue, when running the dbt_utils.union_relations
makro, which joined multiple nested tables, which had coloum names such as from
and where
deeply in their nested field structure.
All column names (including the ones from nested fields) should be quoted, in order to prevent issues with reserved Keywords
The output of dbt --version
:
installed version: 0.19.1
latest version: 0.21.0
Plugins:
- redshift: 0.19.1
- snowflake: 0.19.1
- bigquery: 0.19.1
- postgres: 0.19.1
The operating system you're using:
Ubuntu
The output of python --version
:
Python 3.8.10
Add any other context about the problem here.
When running an incremental model with the insert_overwrite
strategy and the require_partition_filter=true
option, it seems that the require_partition_filter
option is applied to the temporary table.
Query error: Cannot query over table 'sandbox-project.logs.event__dbt_tmp' without a filter over column(s) 'event_time' that can be used for partition elimination at [75:46]
insert_overwrite
# test_insert_overwrite.sql
{{
config(
materialized="incremental",
incremental_strategy='insert_overwrite',
require_partition_filter=true,
alias="test_insert_overwrite",
partition_by={
"field": "event_time",
"data_type": "timestamp",
"granularity": "day",
},
)
}}
SELECT
CURRENT_TIMESTAMP() AS event_time
dbt run -s test_insert_overwrite
twiceCompleted with 1 error and 0 warnings:
Database Error in model test_insert_overwrite (models/test_insert_overwrite.sql)
Query error: Cannot query over table 'sandbox-project.jaffle_shop.test_insert_overwrite__dbt_tmp' without a filter over column(s) 'event_time' that can be used for partition elimination at [35:46]
compiled SQL at target/run/jaffle_shop/models/test_insert_overwrite.sql
A clear and concise description of what you expected to happen.
The query below was generated when the second time execution of the incremental model. As we can see, the query to create the temporary table test_insert_overwrite__dbt_tmp
includes require_partition_filter=true
, because the macro to generate is create_table_as
. The create_table_as
macro automatically applies all the same options to even the temporary table.
-- generated script to merge partitions into `sandbox-project`.`jaffle_shop`.`test_insert_overwrite`
declare dbt_partitions_for_replacement array<timestamp>;
declare _dbt_max_partition timestamp default (
select max(event_time) from `sandbox-project`.`jaffle_shop`.`test_insert_overwrite`
where event_time is not null
);
-- 1. create a temp table
create or replace table `sandbox-project`.`jaffle_shop`.`test_insert_overwrite__dbt_tmp`
partition by timestamp_trunc(event_time, day)
OPTIONS(
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 12 hour),
require_partition_filter=True
)
as (
SELECT
CURRENT_TIMESTAMP() AS event_time
);
-- 2. define partitions to update
set (dbt_partitions_for_replacement) = (
select as struct
array_agg(distinct timestamp_trunc(event_time, day))
from `sandbox-project`.`jaffle_shop`.`test_insert_overwrite__dbt_tmp`
);
-- 3. run the merge statement
merge into `sandbox-project`.`jaffle_shop`.`test_insert_overwrite` as DBT_INTERNAL_DEST
using (
select * from `sandbox-project`.`jaffle_shop`.`test_insert_overwrite__dbt_tmp`
) as DBT_INTERNAL_SOURCE
on FALSE
when not matched by source
and timestamp_trunc(DBT_INTERNAL_DEST.event_time, day) in unnest(dbt_partitions_for_replacement)
then delete
when not matched then insert
(`event_time`)
values
(`event_time`)
;
-- 4. clean up the temp table
drop table if exists `sandbox-project`.`jaffle_shop`.`test_insert_overwrite__dbt_tmp`
The output of dbt --version
:
installed version: 0.21.0
latest version: 0.21.0
Up to date!
Plugins:
- bigquery: 0.21.0
- snowflake: 0.21.0
- redshift: 0.21.0
- postgres: 0.21.0
The operating system you're using:
The output of python --version
:
Following the examples in:
I currently have dbt in a docker image with a BigQuery warehouse, that is spun up as an airflow task deployed in kubernetes, when I perform a backfill and there are concurrent dbt pods (up to 8) running, I see the following error from dbt:
Encountered an error:
HTTPSConnectionPool(host='bigquery.googleapis.com', port=443): Max retries exceeded with url: /bigquery/v2/projects/<DEDACTED>/datasets/<DEDACTED>/tables?maxResults=100000&prettyPrint=false (Caused by SSLError(SSLError("bad handshake: SysCallError(-1, 'Unexpected EOF')")))
I believe this log is from the requests
library in dbt, however it is not clear why the retries are occuring in the first place ( ie what bq limits or quotas are causing the retries)
Which database are you using dbt with?
bigquery
The output of dbt --version
:
installed version: 0.19.1
Plugins:
- bigquery: 0.19.1
The operating system you're using:
Docker image
The output of python --version
:
python:3.8.5
The BigQueryConnectionManager.dataset
method is firing a deprecation warning due to its call to bigquery.Client.dataset
:
[...]
2021-12-30 18:02:49.877082 (ThreadPoolExecutor-1_9): Client.dataset is deprecated and will be removed in a future version. Use a string like 'my_project.my_dataset' or a cloud.google.bigquery.DatasetReference object, instead.
[...]
From the current API docs:
Client.dataset
Deprecated since version 1.24.0: Construct a DatasetReference using its constructor or use a string where previously a reference object was used.
Given this and other updates to the BigQuery API, I think it would be healthy to do two things:
BigQueryConnectionManager.dataset
to directly create a bigquery.DatasetReference
.
dataset_ref
to be clear about what's being returnedBigQueryConnectionManager.table_ref
to directly create a bigquery.TableReference
; make it a static method.With these changes, we can quit passing a conn
around just to deliver a reference, and we'll meet (my interpretation of) BQ API best practice by working with unconnected references where possible and deferring 'connected' operations to a client.
n/a
This will benefit maintainers by simplifying the signatures of two core methods.
Put me in, coach!
Picking up from dbt-labs/dbt-core#2928, which added support for two new configs in dbt-bigquery: require_partition_filter
and partition_expiration_days
.
Let's ensure that require_partition_filter
works with all the permutations of incremental models on BigQuery. Anyone is welcome to pick this up as a contribution for v0.20.0
!
merge
strategyWe need the merge condition to be
on
DBT_INTERNAL_SOURCE.[unique_key] = DBT_INTERNAL_DEST.[unique_key]
and DBT_INTERNAL_DEST.[partition_col] is not null
This could be accomplished by passing an additional predicate
to get_merge_sql
here, something like:
{% is_partition_filter_required = config.get('require_partition_filter', false) %}
{% set predicates = [] %}
{% if is_partition_filter_required %}
{% set partition_filter %} ({{ partition_by.field }} is not null or {{ partition_by.field }} is null) {% endset %}
{% do predicates.append(partition_filter) %}
{% endif %}
{% set build_sql = get_merge_sql(target_relation, source_sql, unique_key, dest_columns, predicates) %}
This is a bit of a hack—filtering only in this sense—but to be honest there isn't any straightforward way dbt can know in advance the specific partitions it's merging into. For that, you should use...
insert_overwrite
strategyThe require_partition_filter
config works just fine with "static" insert_overwrite
strategy—when the user supplies the values in advance via the partitions
config—which is also the most performant for updating very large datasets. (It would still be a good idea to add a test for this.)
For the "dynamic" insert_overwrite
strategy, the current error comes in step 2:
We need to either:
where {{ partition_by.field }} is not null
, thereby satisfying the filter requirementrequire_partition_by
. Should this be a more general rule, that if temporary = True
, the create table
statement shouldn't set require_partition_filter = True
? I think it would make good sense.require_partition_filter
for all types of incremental models. I think we definitely should!require_partition_filter
setpartition_expiration_days
for incremental models? Personally, I'm not so sureThe fine folks at BigQuery have been releasing a lot of solid SQL-standard syntax of late, which means a lucky contributor gets to delete a bunch of code.
The most obvious candidates to me are create schema
and drop schema
, which no longer need to be python methods—in fact, they no longer need to be bigquery__
macros at all, since the default implementations will do just fine.
Less pressing, still promising: alter table add columns
.
I'm sure there are more such opportunities!
When has there ever been a better first issue?
Given a partitioned on a string column incremental model:
{{
config(
materialized='incremental',
partition_by={'field': 'string_column', 'data_type' : 'string'},
incremental_strategy='insert_overwrite',
)
}}
SELECT '20201201' AS string_column
Running incremental dbt run
on this model provides erroneous definition of partitions to update. It tries to use non-existent in BigQuery function string_trunc(string_column, day)
code in /target/run/
directory:
...
-- 2. define partitions to update
set (dbt_partitions_for_replacement) = (
select as struct
array_agg(distinct string_trunc(string_column, day)) -- HERE NON-EXISTENT FUNCTION
from MODEL_NAME
);
-- 3. run the merge statement
merge into
...
when not matched by source
and string_trunc(DBT_INTERNAL_DEST.string_column, day) in unnest(dbt_partitions_for_replacement) -- HERE NON-EXISTENT FUNCTION
then delete
...
It causes dbt to fail the run with following error
Query error: Function not found: string_trunc at [51:34]
As for dbt version 0.18.2 it ran such model correct:
-- 2. define partitions to update
set (dbt_partitions_for_replacement) = (
select as struct
array_agg(distinct string_column)
from MODEL_NAME
);
...
-- 3. run the merge statement
merge into MODEL_NAME
...
when not matched by source
and DBT_INTERNAL_DEST.string_column in unnest(dbt_partitions_for_replacement)
then delete
when not matched then insert
...
No response
No response
- OS: MacOS 12.0.1
- Python: 3.8.12
- dbt: official docker image version 0.21.0
bigquery
No response
Can an optional, default policy tag for BigQuery be supported? Ideally it could be set at the project, directory, and individual model levels.
The purpose is to prevent accidental leaking of sensitive data. I currently set policy tags in sources (outside DBT) and views automatically inherit them. However table materializations remove the tags. Being able to set policy tags with DBT is amazing, however a developer may forget to set one for a column in its schema.yml
. Or the model does a select * from source ...
and the developer doesn't realize the source contains sensitive data. By setting a default policy tag, accidental sensitive data leaks can be prevented.
It would be amazing if a list of column types could given too. For example, booleans are rarely sensitive but strings can be. So being able to say "please set a default policy tag for string columns" would be amazing. This would be a nice to have, but not strictly necessary.
Manage policy tags outside DBT which is error prone in my experience.
BigQuery users who utilize policy tags.
Possibly
Reduce the BigQuery OAuth scopes down to the minimal set needed, in both dbt and dbt Cloud.
Currently, the dbt BigQuery connector requests these three OAuth scopes:
The BigQuery scope is needed to access the database, but the cloud-platform and drive scopes are probably too broad. These scopes were originally added to address issue dbt-labs/dbt-core#502, primarily to allow for access for reading from Google Sheets. However, I don't immediately see a need for the cloud-platform
scope, which gives access to a wide range of GCP resources, such as the following:
Similarly, the drive
scope has this access:
This app wants permission to access everything in your Google Drive. It will be able to do the same things you can do, including:
See your files
Upload and download your files
Delete your files
See the names and emails of people you share files with
Share and stop sharing your files with others
Remove people from your files
Organize your Drive
There may be private information in your Google Drive, like financial records, medical reports, photos or tax info.
I would think that minimally, these scopes could be reduced to the 'read-only' variants, and could probably be reduced further depending on the access needed for external tables. Maybe something like:
But I don't know yet whether these scopes are too restrictive.
Also note that dbt Cloud has the same list of OAuth scopes, so whatever is changed in dbt should also be changed in dbt Cloud.
import json
import re
from contextlib import contextmanager
from dataclasses import dataclass
from functools import lru_cache
import agate
from requests.exceptions import ConnectionError
from typing import Optional, Any, Dict, Tuple
import google.auth
import google.auth.exceptions
import google.cloud.bigquery
import google.cloud.exceptions
from google.api_core import retry, client_info
from google.auth import impersonated_credentials
from google.oauth2 import (
credentials as GoogleCredentials,
service_account as GoogleServiceAccountCredentials
)
from dbt.utils import format_bytes, format_rows_number
> from dbt.clients import agate_helper, gcloud
E ImportError: cannot import name 'gcloud' from 'dbt.clients' (/Users/kwigley/workspace/dbt-bigquery/.tox/py38-bigquery/lib/python3.8/site-packages/dbt/clients/__init__.py)
Currently, labels apply when a table is created. However, for incremental models, the labels are not updated if they change in the future. This is also a problem as the labels become out of sync with the DBT project.
An alternative is to write a post-hook which applies the labels to the relation.
I'd prefer to contribute this feature back into the DBT project so other people can benefit from it.
This feature is specific to BigQuery
It will benefit users that access tables in BigQuery, the labels offer great metadata.
We also plan to have this visible in our metadata/catalog service.
I am having issues with dbt and BQ setup. In general everything works fine, but when BQ returns an error due to failing SQL, dbt CLI is not able to fetch error from BQ:
▶ dbt run --models asset.dda.global_dda_dv
Running with dbt=0.19.1
...
18:37:34 | Concurrency: 1 threads (target='dev')
18:37:34 |
18:37:34 | 1 of 1 START view model dbt_bart.global_dda_dv....................... [RUN]
18:37:35 | 1 of 1 ERROR creating view model dbt_bart.global_dda_dv.............. [ERROR in 1.27s]
18:37:35 |
18:37:35 | Finished running 1 view model in 3.85s.
Completed with 1 error and 0 warnings:
Runtime Error in model global_dda_dv (models/asset/dda/global_dda_dv.sql)
404 GET https://bigquery.googleapis.com/bigquery/v2/projects/project-xyz/queries/8b5ace67-828a-4328-88d1-4c0ebb8b67e8?maxResults=0&location=EU&prettyPrint=false: Not found: Job project-xyz:EU.8b5ace67-828a-4328-88d1-4c0ebb8b67e8
(job ID: 8b5ace67-828a-4328-88d1-4c0ebb8b67e8)
Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1
In the same time, this particular error I am able to get with bq show -j <job_id>
.
I am using OAuth via gcloud for authorisation, here’s my config:
dde-dbt: # this needs to match the profile: in your dbt_project.yml file
target: dev
outputs:
dev:
type: bigquery
method: oauth
project: project-xyz
dataset: dbt_bart
threads: 1
timeout_seconds: 300
location: EU
priority: interactive
retries: 1
dbt run
with failing SQL modeldbt CLI should be able to find BQ job and fetch error message
Which database are you using dbt with?
The output of dbt --version
:
installed version: 0.19.1
latest version: 0.19.1
Up to date!
Plugins:
- bigquery: 0.19.1
- snowflake: 0.19.1
- redshift: 0.19.1
- postgres: 0.19.1
The operating system you're using: macOS Catalina 10.15.7
The output of python --version
: Python 3.7.3
BigQuery Authorized views cannot be dynamically granted.
This config should grant access to the underlying tables:
{{ config(
grant_access_to=[
{'project': ref('pharmacy_claims_recon_yearmonth').database , 'dataset': ref('pharmacy_claims_recon_yearmonth').schema}
]
)
}}
select *
from {{ ref('pharmacy_claims_recon_yearmonth') }}
The access grant successfully works with ref('pharmacy_claims_recon_yearmonth').database
.
The access grant DOES NOT successfully work with ref('pharmacy_claims_recon_yearmonth').schema
. Dataset will work with a variable declared with the {% set variable='value' %}
command, but no variation on the schema property appears to work, including explicitly casting it to a string like ref('pharmacy_claims_recon_yearmonth').schema|string
.
Access grant works correctly off of the schema value of a relationship.
If applicable, add screenshots or log output to help explain your problem.
2021-12-10T23:49:04.768410Z: Began running node model.vida.v_pharmacy_claims_recon_yearmonth
2021-12-10T23:49:04.768702Z: 23:49:04 | 1 of 1 START view model dbt_bliyanage_clinops.v_pharmacy_claims_recon_yearmonth [RUN]
2021-12-10T23:49:04.768933Z: Acquiring new bigquery connection "model.vida.v_pharmacy_claims_recon_yearmonth".
2021-12-10T23:49:04.769013Z: Compiling model.vida.v_pharmacy_claims_recon_yearmonth
2021-12-10T23:49:04.775412Z: Writing injected SQL for node "model.vida.v_pharmacy_claims_recon_yearmonth"
2021-12-10T23:49:04.790750Z: finished collecting timing info
2021-12-10T23:49:04.813317Z: Writing runtime SQL for node "model.vida.v_pharmacy_claims_recon_yearmonth"
2021-12-10T23:49:04.826940Z: Opening a new connection, currently in state closed
2021-12-10T23:49:04.827102Z: On model.vida.v_pharmacy_claims_recon_yearmonth: /* {"app": "dbt", "dbt_version": "0.20.2", "profile_name": "user", "target_name": "default", "node_id": "model.vida.v_pharmacy_claims_recon_yearmonth"} */
create or replace view `test-dbt-vida`.`dbt_bliyanage_clinops`.`v_pharmacy_claims_recon_yearmonth`
OPTIONS()
as
select *
from `test-dbt-vida`.`dbt_bliyanage_claims`.`pharmacy_claims_recon_yearmonth`;
2021-12-10T23:49:05.944456Z: Access entry <AccessEntry: role=None, view={'projectId': 'test-dbt-vida', 'datasetId': 'dbt_bliyanage_clinops', 'tableId': 'v_pharmacy_claims_recon_yearmonth'}> already exists in dataset
2021-12-10T23:49:05.944813Z: finished collecting timing info
2021-12-10T23:49:05.945195Z: Sending event: {'category': 'dbt', 'action': 'run_model', 'label': '27b7581e-1241-47f3-ad36-bfb5e7099b84', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fc1c9436e80>]}
2021-12-10T23:49:05.945507Z: 23:49:05 | 1 of 1 OK created view model dbt_bliyanage_clinops.v_pharmacy_claims_recon_yearmonth [OK� in 1.18s]
2021-12-10T23:49:05.945589Z: Finished running node model.vida.v_pharmacy_claims_recon_yearmonth
The output of dbt --version
:
In build configs we use: 0.20.2
Here are a couple other observations from troubleshooting this:
Access entry <AccessEntry: role=None, view={'projectId': 'test-dbt-vida', 'datasetId': 'dbt_bliyanage_clinops', 'tableId': 'v_pharmacy_claims_recon_yearmonth'}> already exists in dataset
, even if there is no authorization.In an ideal world, we would just be able to specify grant_access
, and dbt would be able to resolve all ref's in a particular query, and grant access to the appropriate underlying datasets without them having to be explicitly named by a person.
If a model is materialized as incremental
, its strategy is set to insert_overwrite
, its on_schema_change
is set to any value except for ignore
, and the compiled model refers to _dbt_max_partition
, the model will fail to run with the following error:
Unrecognized name: _dbt_max_partition at [X:Y]
Here's a minimal example model:
{{
config(
materialized='incremental',
incremental_strategy='insert_overwrite',
partition_by={
'field': 'date',
'data_type': 'date',
'granularity': 'day'
},
on_schema_change='append_new_columns',
)
}}
SELECT * FROM source_table
{% if is_incremental() %}
WHERE date > _dbt_max_partition
{% endif %}
If is_incremental()
is True
, the model will compile to the following:
SELECT * FROM source_table
WHERE date > _dbt_max_partition
When the BigQuery adapter builds the larger SQL script it needs to run, it will hit this run_query()
command, which (if I'm understanding correctly) is mainly used to compute the updates' schema so that it can be compared against the destination table's existing schema. When it tries to run that query, it hasn't yet declared _dbt_max_partition
(that declaration is done here, in bq_insert_overwrite()
), which triggers the error.
I'm expecting the temporary relation to be created after _dbt_max_partition
has been defined, which ought to allow the rest of the incremental update to proceed as normal.
If applicable, add screenshots or log output to help explain your problem.
The output of dbt --version
:
installed version: 0.21.0
latest version: 0.21.0
Up to date!
Plugins:
- bigquery: 0.21.0
- snowflake: 0.21.0
- redshift: 0.21.0
- postgres: 0.21.0
The operating system you're using:
macOS Mojave 10.14.6
The output of python --version
:
Python 3.7.10
Add any other context about the problem here.
In BigQuery, the execution_project might be different from the project. Sometimes we would need this information for context logging, monitoring and alerting. (such as alert the dbt user if they are using the undesired execution_project in their development environment)
It would be great if the execution_project
is accessible under target
object.
such as
target.execution_project
No response
team using dbt-BigQuery and want to closely monitor how dbt users are using execution_projects correctly and provide warning at the run time.
yes
No response
When doing "DBT run" with this script
{{ config(materialized='table') }}
SELECT customer_id FROM `hello-data-pipeline.adwords.google_ads_campaign_stats`
I get the following error
15:41:51 | 2 of 3 START table model staging_benjamin.yo......................... [RUN]
15:41:51 | 2 of 3 ERROR creating table model staging_benjamin.yo................ [ERROR in
0.32s]
Runtime Error in model yo (models/yo.sql)
404 Not found: Dataset hello-data-pipeline:staging_benjamin was not found in location EU
(job ID: 4020ce42-ec44-438e-8ea0-9f4fe4adcc8c)
_NB. Bigquery does not show any error when doing this query in Bigquery Editor.
NB 2 DBT does not show any error when "running sql" command directly in the script editor
NB3 : No error when doing db run with example or with a public big query like this
{{ config(materialized='table') }}
SELECT *
FROM 'bigquery-public-data.stackoverflow.posts_questions'
ORDER BY view_count DESC
LIMIT 1
_
It should DBT run without error..
Which database are you using dbt with?
BIG QUERY
The operating system you're using:
DBT CLOUD
Others
Data imported from FIVETRAN
Ran dbt test
with Big Query as the connection using Airflow, and got an Error on the custom test. All standard dbt tests worked fine.
Error message:
[2021-11-29, 17:33:46 UTC] {subprocess.py:89} INFO - Invalid project ID 'simple_bigquery_example_dag.[correct-project-id-here-redacted]'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.
Seems like the dataset name from the test's SQL query got prepended to the project ID name.
SQL Query:
SELECT * FROM simple_bigquery_example_dag.forestfires
Schema:
version: 2
models:
- name: forestfire_test
description: "A dbt model of forestfire data."
columns:
- name: id
description: "The primary key for this table"
tests:
- not_null
- unique
- name: month
tests:
- not_null
- accepted_values:
values: ['aug', 'mar', 'sep']
- name: ffmc
tests:
- ffmc_value_check
Where ffmc_value_check
is the failing test.
Query for the test:
{% test ffmc_value_check(model, column_name) %}
SELECT {{ column_name }}
FROM simple_bigquery_example_dag.{{ model }}
HAVING NOT({{ column_name }} >= 90)
{% endtest %}
The DAG from this pull request can be used to replicate the error:
astronomer/airflow-data-quality-demo#9
Steps:
simple_bigquery_example_dag.forestfire_test
(this can also be done by running the BigQuery example DAG here).Test is expected to Fail, not to Error. Project ID is not expected to have the simple_bigquery_example_dag.
prefix.
The output of dbt --version
:
dbt==0.21.0
The operating system you're using:
MacOS with Docker
The output of python --version
:
python 3.9
Add any other context about the problem here.
In BigQuery, each time SQL is executed, a job is created and assigned a unique ID. (Example: job_9JVepH9O1bzemNfB6xZzxz_mfFkY
) When DBT runs it checks for the results/status of these jobs.
We've found the run_results.json
file extremely useful, and being able to tie these results (errors, compiled SQL, variables, tags, etc.) back to the specific job that executed in BigQuery for additional information and troubleshooting would make run_results.json
even more useful.
You could try to look at a combination of DBT logs and BigQuery logs in Stackdriver Logging and try to approximate which exact job corresponded to the model creation by timestamp proximity or query (compiled SQL), but those methods are not deterministic.
This is specific to BigQuery for now, but I am sure there are parallels in other databases (transaction ID? query ID?).
Another aspect to consider would be whether to provide multiple jobId
s if there is a macro called in the model that executes additional queries against the database.
Anyone that wants to tie run results back to specific jobs for additional debugging would benefit from this feature.
dbt's support for "ingestion-time" partitioned tables is a relic of BigQuery's historical lack of support for table partitioning DDL. BigQuery has supported column-level partitioning for some time now, so we should remove the largely-unused ingestion-time partitioning table logic from dbt.
Methods to remove:
As a part of this change, we should also remove the BigQuery-specific timeout_seconds
config, which is only used as a part of creating ingestion-time partitioned tables via the BigQuery HTTP API. We should consider supporting a dbt-wide timeout_seconds
config in the future (this would apply to any query invoked by dbt), so let's continue to support the config in profiles.yml
, but render a warning at run-start indicating that the config is not used.
Warning: The configured target specifies a timeout_seconds config, which is currently unused. Support for this config may be removed in a future release of dbt.
BQ-only... we should be able to delete a whole lot of code!
dbt maintainers/contributors, users who (rightfully) misunderstand the nature of the timeout_seconds
config
When doing an incremental update of an integer-partitioned bigquery table with the new insert_overwrite merge strategy then DBT calculates which partitions it should replace. In the process it can generate a huge BigQuery ARRAY value.
The cause is that DBT does not take the "interval" parameter of the partitioning specification into account. The generated SQL code selects "array_agg(distinct PARTITION_FIELD)" when calculating the partitions to replace. This selects ALL distinct partition field values of the incremental update, even if these values are actually in the same partition. This causes a potentially huge array to be created. If there is enough data in the table then this will even cause an error because BQ's maximum array size (100 MB) is exceeded.
Note that this bug is not triggered by time partitioned tables because for those all partition fields are dates and the partition size is always one day (i.e. there is only one valid value per partition).
Generate a model with:
config(
materialized = 'incremental',
unique_key = 'partkey',
partition_by = {
"field": "partkey",
"data_type": "int64",
"range": {
"start": 0,
"end": 100000,
"interval": 1000
}
},
incremental_strategy = 'insert_overwrite'
)
Run the model once with empty source data
Add 1000 records in the source data with values for partkey from 0 to 999
Generate the incremental update code for the model and look at the values in the dbt_partitions_for_replacement internal variable.
You will see that it contains all 1000 values for partkey from 0 to 999, even though those are all inside the same single partition.
DBT should ensure that no huge temporary array variable is generated.
The dbt_partitions_for_replacement array should have at most as many elements as the number of partitions being updated. In my opinion the way to go would be to store only the starting values of each partition in the array and then modify the merge clause to use a range for each partition.
N/A
Which database are you using dbt with?
The output of dbt --version
:
installed version: 0.16.0
latest version: 0.16.0
Up to date!
The operating system you're using:
macOS
The output of python --version
:
Python 3.7.4
Currently, dbt-bigquery profile.yml has two configurations: Timeouts and Retries
We at Semios have been using dbt extensively to run high-frequency data pipelines on BigQuery (some of them run every 10 minutes). Because of the strict SLA requirement, we require more granular control on BigQuery timeout and retry mechanism to minimize the intermittent query failures and its impact to our pipeline.
We found that the existing configs are not enough to mitigate all intermittent query failures. for example, the Deadline of 120.0s exceeded 503 errors raised by BigQuery JobInsert API that occurred from time to time and impacted several dbt-BigQuery users in the dbt community.
We would like to suggest a more comprehensive BigQuery retry config to minimize all these errors.
Before I propose the suggestion, let me explain how BigQuery query works and how we would like to retry.
at the core, BigQuery query is made by two steps in dbt
def _query_and_results(self, client, sql, conn, job_params, timeout=None):
"""Query the client and wait for results."""
# Cannot reuse job_config if destination is set and ddl is used
job_config = google.cloud.bigquery.QueryJobConfig(**job_params) # <--- Step 1
query_job = client.query(sql, job_config=job_config) # <--- Step 2
iterator = query_job.result(timeout=timeout)
return query_job, iterator
In the first step, client.query()
submits a query to BQ JobInsert API server, when succeeded, BQ server creates a new BigQuery query job, and return the query job id
back to the client as part ofquery_job
object. This step shall be very quick, normally under a few seconds. however, in some rare cases, it would take much longer and might even up to 4 minutes according to the BigQuery engineering team.
In the 2nd step, query_job.result()
await for the BigQuery executing (running) the query and return the results back to the client as an iterator. depending on the complexity of the query, this step could takes long, from tens of seconds to tens of minutes.
Currently, The dbt retry
config is used to set the retries of the overall _query_and_results
. the dbt timeout
config is used only to control step 2, query_job.result(timeout=timeout)
. These configs make sense on its own.
However, you might notice that there is no control over the timeout of query()
in step 1, other than relying on its [default value] (https://github.com/googleapis/python-bigquery/blob/1762e5bd98ab089abf65a0d146234c77217b8dbe/google/cloud/bigquery/client.py#L3193). When BQ JobInsert API sharding cluster is unstable. it could take up to 4 minutes to create the query job id in some rare cases. Because the query()
had a client-side [default deadline of 120s](DEFAULT_RETRY = retry.Retry(predicate=_should_retry)), the client quits at 120s while server side is still waiting for the job creation. That's why several dbt community members had experienced the [Deadline of 120.0s exceeded errors (https://getdbt.slack.com/archives/C99SNSRTK/p1627926805026400).
BQ team has since patched the query()
to make a client-side [default deadline of 10 mins](DEFAULT_RETRY = retry.Retry(predicate=_should_retry, deadline=600.0). It means by default, query()
could take up to 4 minutes to make one attempt to create a new job id. If it fails, the client-side could have up to 10 minutes to retry. This would work fine for users who don't have stringent timing requirements. They won't mind waiting for 10 minutes to create a new BQ job ID and then waiting for another 10 minutes to get the result.
Unfortunately, it doesn't work for us. In situations like this, we would rather fail faster and try again. It shall not take more than 30 seconds to create a new job id. if it takes too long, we would rather let the query() timeout and fail, so that we could retry it again.
In order to gain fine control of the timeout mechanism of step 1 and step 2, we would like to propose the following 4 dbt configs
job_creation_timeout_seconds # specific for initiate BQ job, to control the timeout of step 1, query()
job_execution_timeout_seconds # specific for awaiting job result, to control the timeout of step 2, result()
job_retry_deadline_seconds # to control the overal query, retry_deadline of _query_and_results()
job_retries # to control the overall query, retries of _query_and_results()
For example, we could set the configs below to fail faster on the step of BQ job creation, while allowing queries with long-running results.
job_creation_timeout_seconds=30
job_execution_timeout_seconds=1200
job_retry_deadline_seconds=1500
job_retries=3
These settings would allow us to control the timeout behaviors of step 1 and step 2 on their own, hence maximizing our chances to mitigate different kinds of intermittent errors.
NOTE:
job_execution_timeout_seconds
is the renaming of the current timeout
config.
job_retries
is the renaming of the current retries
config.
Currently, we have to override the dbt-core code to allow the query()
to fail faster and retry, and to mitigate the 503 error.
the dbt-bigquery users who would need more fine control of BigQuery query behaviors
yes
this issue has been discussed by several dbt community members in slack
https://getdbt.slack.com/archives/C99SNSRTK/p1627926805026400
As BigQuery bills users based on the volume of data processed, it's fairly important for users to be aware of this while developing models and pipelines. In the BQ UI this is fairly clearly presented after each query run, and the same information is available in the jobs.query
API response.
You can find the totalBytesProcessed
documented in the linked API documentation.
It would be helpful to see this output in the JSON logging - example:
{"timestamp": "2020-01-30", "message": "13:25:37 | 24 of 27 OK created incremental model dbt_andy.customers............. [CREATE TABLE (450) in 2.82s]", "channel": "dbt", "level": 11, "levelname": "INFO", "extra": {"unique_id": "model.customers", "bytes_processed": 4882}}
And also at the end of a run with the default log formatter:
13:16:52 | Finished running 23 incremental models, 4 table models, 1 hook in 35.87s
13:16:52 | 750mb of data processed
If you have enabled Export To BigQuery within your Google Cloud billing project, it would be possible to query how much data the service account your DBT user has processed. This could then be queried and monitored outside of DBT, but it would be so much helpful to have an indication of query size in realtime - this is an important optimisation when building BigQuery pipelines.
This is very important for all users of BigQuery who have enough data stored to have a reasonable spend, but who are not so large that they have purchased commitments (flat fee / all you can eat) from Google. Even then, it's still best practice to pay attention to how much data you are processing...
This is a follow-up from a discussion over the topic with @jtcohen6.
Let's assume we have a model with incremental materialization and we're going for an incremental run.
merge into mydataset.newtable as DBT_INTERNAL_DEST
using (
SELECT * from mydataset.newtable__dbt_tmp
) as DBT_INTERNAL_SOURCE
on FALSE
when not matched by source
and DBT_INTERNAL_DEST._PARTITIONTIME in unnest([TIMESTAMP("2021-11-30")])
then delete
when not matched then insert
(`_PARTITIONTIME`, `transaction_id`)
values
(`_PARTITIONTIME`, `transaction_id`)
However the merge will first delete the data to insert the data.
On big partitions, it can take a while resulting in a long operation compared to other approaches such as:
mydataset.newtable$20211130
) while using WRITE_TRUNCATE
setting.bq cp mydataset.newtable__dbt_tmp$20211130 mydataset.newtable$20211130
.As far as I know the fastest approach is to use bq cp
(it might require to benchmark it though) unless you don't need a temporary table (ie no column change & single partition) in which case, you would use the input query and write to the destination table with the partition decorator directly.
The main inconvenient for that approach is that even though the insert_overwrite
strategy operation isn't fully atomic (the temporary table is created and stays even if the MERGE
fails), the MERGE
query is atomic on all partitions.
So to keep the same behavior, it requires to have a single partition or accept the tradeoff of breaking the atomicity of the partitions replacement step.
Therefore it could be relevant to have a copy_partitions
config to activate that approach.
It could also be a whole new incremental strategy if relevant.
I had a production case that I described in the thread where, using a single partition, MERGE
version (insert_overwrite) was taking 43 minutes while the query with WRITE_TRUNCATE took 26 minutes for the same result.
At Teads, our internal BigQuery query wrapper tool is using a select & write_truncate since we're not using a temporary table as an intermediate step because we only process a single partition per query (which is a specific case compared to the dbt approach). It's quite a deal breaker to use dbt as is for those queries because of that performance overhead.
Of course, it would be much better if Google could make a server side optimisation on MERGE
queries when it detects that pattern.
It will benefit anyone using insert_overwrite
incremental strategy with large partitions where delete/insert are long to process.
Yes
This is a follow-up from a discussion over the topic with @jtcohen6.
Ingestion time partition tables are supported as table
materialization right now but the support is meant to be deprecated.
The specificity for those tables is that partitioning field is _PARTITIONTIME
OR _PARTITIONDATE
as a "pseudo column".
Since that column doesn't really exist as a column within the table, Google doesn't allow the same kind of operations on the table.
Indeed if you would like to use
partition_by={
"field": "_PARTITIONTIME",
"data_type": "timestamp"
}
BigQuery doesn't let you create a ingestion time partitioned table using usual dbt approach (https://cloud.google.com/bigquery/docs/creating-partitioned-tables#create_an_ingestion-time_partitioned_table):
CREATE TABLE
mydataset.newtable (transaction_id INT64)
PARTITION BY
DATE_TRUNC(_PARTITIONTIME, DAY)
AS (
SELECT TIMESTAMP("2021-11-01") as _PARTITIONTIME, 1 as transaction_id
)
will fail as _PARTITIONTIME is not directly in the defined columns.
So the required approach is to have first:
CREATE TABLE
mydataset.newtable (transaction_id INT64)
PARTITION BY
DATE_TRUNC(_PARTITIONTIME, DAY)
And then:
INSERT INTO mydataset.newtable (_PARTITIONTIME, transaction_id)
SELECT TIMESTAMP("2021-11-01"), 1
Once we move to the merge part, it's indeed possible to insert data with
merge into mydataset.newtable as DBT_INTERNAL_DEST
using (
SELECT TIMESTAMP("2021-11-03") as _PARTITIONTIME, 3 as transaction_id
) as DBT_INTERNAL_SOURCE
on FALSE
when not matched by source
and DBT_INTERNAL_DEST._PARTITIONTIME in unnest([TIMESTAMP("2021-11-03")])
then delete
when not matched then insert
(`_PARTITIONTIME`, `transaction_id`)
values
(`_PARTITIONTIME`, `transaction_id`)
Practically, it won't be 100% straightforward as the it requires to use
SELECT _PARTITIONTIME, * FROM mydataset.newtable__dbt_tmp
as _PARTITIONTIME
is not directly a column and therefore not in the SELECT *
but else it appears to work.
Practically, I think everything can be done within incremental.sql
Alternatives are:
At Teads, we use a lot of time ingestion partitioned tables as they were created prior to column type partitioned table feature on BigQuery. Migrating would be an option but the overhead introduced for selecting data from column type partitioned table is "a drag" to make that move.
It would benefit anyone using time ingestion partition table.
Yes
Hi everyone.
I did not find this issue anywhere yet which seems odd to me and I dont know if it is on my side.
Using Bigquery and dbt 0.20.2
.
Having defined a test for a model like this:
- name: base_test_model
tests:
- unique:
column_name: "concat(id, field, changed_at)"
Which was working before dbt 0.20.0
. Now I get an error:
Database Error in test unique_base_test_id_field_changed_at (models\base\schema.yml)
SELECT list expression references column id which is neither grouped nor aggregated at [12:5] compiled SQL at
I checked in projects where I use dbt 0.19.x
and the test above compiled to this:
select count(*) as validation_errors
from (
select
concat(id, field, changed_at)
from base_test
where concat(id, field, changed_at) is not null
group by concat(id, field, changed_at)
having count(*) > 1
) validation_errors
Now it compiles to this:
select
count(*) as failures
from (
select
concat(id, field, changed_at) as unique_field,
count(*) as n_records
from base_test
where concat(id, field, changed_at) is not null
group by concat(id, field, changed_at)
having count(*) > 1
) dbt_internal_test
which Bigquery seems not to like. Using the query above and removing the alias unique_field
works for Bigquery.
With Postgres and MySQL both of the above queries run. Did not test Snowflake and Redshift.
Add the following roles in IAM to the dbt service account:
In dbt_project.yml
, add:
models:
+persist_docs:
relation: true
columns: true
vars:
policy_tag_ids:
pt_name: 'projects/my-project/locations/<location>/taxonomies/<organization>/policyTags/<tag>'
In schema.yml
, add:
models:
- name: my_model
columns:
- name: name
policy_tags:
- '{{ var("policy_tag_ids")["pt_name"] }}'
$ dbt run -m my_model
Column-level policy tag on the my_model
table in BigQuery’s schema tab.
Nothing shows.
In the BigQuery cloud audit logs (both data access and activity), there isn’t any mention of policy tags.
From the original pull request that implemented this functionality, the test class seems to follow the same steps as the ones taken.
Thanks as ever for such a brilliant app!
Changing the GCP project (e.g. with gcloud config set project {project}
) and running a dbt command (e.g. dbt run
) will not update the project with partial parsing. Instead, it's necessary to run dbt clean
as specified in parsing known limitations.
dbt clean
gcloud config set project project_foo
dbt compile
rg `project_foo` | wc -l # returns a big number
gcloud config set project project_bar
dbt compile
rg `project_foo` | wc -l # returns a big number, should return zero
rg `project_bar` | wc -l # returns zero, should return a big number
I can understand why dbt doesn't track the implicit GCP project in its cache invalidation logic, and so this is a hard problem to solve.
But I'm not sure it's a necessary problem to have. Why does dbt materialize the project name in the queries? If it didn't materialize the projects (arguably by respecting that abstraction), BQ would resolve the implicit project without it being specified, and this would work.
If applicable, add screenshots or log output to help explain your problem.
The output of dbt --version
:
1.0.1
The operating system you're using:
MacOS
The output of python --version
:
3.9.9
Prompted by brooklyn-data/dbt_artifacts#6.
Access to a macro such as:
{% do adapter.upload_file(file_path, destination) %}
would enable dbt_artifacts to become compatible with BigQuery. It looks as though the function to wrap is load_table_from_file.
A clear and concise description of any alternative solutions or features you've considered.
Please include any other relevant context here.
What kind of use case will this feature be useful for? Please be specific and provide examples, this will help us prioritize properly.
Let us know if you want to write some code, and how we can help.
dbt-labs/dbt-core#3145 did something really cool: By setting in dbt_project.yml:
query-comment:
job-label: true
dbt will supply the key-value pairs from a query comment dictionary (or the full value as one key-value pair, if the query comment is a non-dict string) to BigQuery as a job label for the queries representing node execution:
In order to really test that this is working, and avoid regression, we should add an integration test that:
query-comment
and checks that the labels appropriately registered, by querying INFORMATION_SCHEMA.JOBS_BY_USER
. (We'd want that query to filter on user_email
and creation_time
.)query-comment
, checks the samejob-label: false
and ensures that only the default job label (invocation_id
) is appliedNot integration testing this.
dbt-labs/dbt-core#3145 added unit testing for the python dict/string handling functions, so the nuts and bolts are in place. In order for this to be a feature that dbt-bigquery users can feel comfortable relying on, we need to ensure that it will continue working in future releases, which may touch query-comment
functionality in indirect ways.
When I run dbt seed
with a dataset name that includes capital letters it fails eg if using "Bananas" instead of "bananas"
That it would work fine.
No response
10:45:53 | Concurrency: 1 threads (target='caoilte')
10:45:53 |
10:45:53 | 1 of 1 START seed file caoilte_Bananas_raw.seed_top1300 [RUN]
Unhandled error while executing seed.fruit.seed_top1300
404 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/banana-project/jobs?uploadType=resumable: Not found: Dataset banana-project:caoilte_Bananas_raw
10:45:53 | 1 of 1 ERROR loading seed file caoilte_Bananas_raw.seed_top1300 [ERROR in 0.27s]
10:45:53 |
10:45:53 | Finished running 1 seed in 5.72s.
Completed with 1 error and 0 warnings:
404 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/banana-project/jobs?uploadType=resumable: Not found: Dataset banana-project:caoilte_Bananas_raw
### Environment
```markdown
- OS: Linux
- Python: 3.8.6
- dbt: 0.21.0
bigquery
dbt/bigquery works fine for capitalised model names (It was trying to make the seed name consistent with the capitalised model name that highlighted the bug)
I create a row policy level in bigquery for the user who is used in dbt to access it, and when I try to run dbt run -m + dim_table
the tables that is created using reference of the table with row access policy got a macro error. I got the run target for that table, and running directly on bigquery everything works fine.
I trying to use this process to add the dbt user in row policy access to spread more easily on others tables this row I don't want to show in Metabase.
Create a row access policy using
CREATE OR REPLACE ROW ACCESS POLICY teste
ON `tableA`
GRANT TO ("serviceAccount:dbt_service_account@project_id.iam.gserviceaccount.com")
FILTER USING (id != "99");
Where service account is the same used previous in dbt to access.
then create a sql file with
select
*
from
{{ ref('tableA') }}
Without row access policy evething works fine, with row access policy i got and error
Expect the table be created and in table a when try to get information from id 99 will return an empty result
But instead i got and error
Completed with 1 error and 0 warnings:
Compilation Error in macro statement (macros/core.sql)
bad operand type for abs(): 'NoneType'
> in macro materialization_table_bigquery (macros/materializations/table.sql)
> called by macro statement (macros/core.sql)
Done. PASS=8 WARN=0 ERROR=1 SKIP=1 TOTAL=10
I try in both
installed version: 0.19.1
Plugins:
- redshift: 0.19.1
- postgres: 0.19.1
- bigquery: 0.19.1
- snowflake: 0.19.1
and in
installed version: 0.21.0
Plugins:
- redshift: 0.21.0
- postgres: 0.21.0
- bigquery: 0.21.0
- snowflake: 0.21.0
The operating system you're using:
debian buster
The output of python --version
:
python 3.8
Add any other context about the problem here.
A workaround was needed here to get the correct package version for setup.py
#82
A more permanent fix like we did in dbt-core
is needed
We use labels on Google Cloud assets for inventory tracking purposes. For our Bigquery datasets, we need labels at the schema/dataset level.
It looks like it's possible to add other Bigquery schema-level configuration to the profile - e.g. dataset location.
Are there any plans to support labels at the dataset level?
We tested table-level labels implemented as per dbt-labs/dbt-core#1942 but our inventory tracking service (Vanta for SOC2 purposes) expects the labels at the dataset level.
Other Bigquery users who are working with labels at the dataset level.
Change the full refresh behaviour for BigQuery (or at least make it configurable).
Current behaviour:
create or replace
statement.Proposed behaviour:
model_table_name__tmp
).model_table_name
if full refresh is successful.model_table_name__tmp
to model_table_name
model_table_name__tmp
This solution has two advantages:
No alternative solution found.
I am using BigQuery, not sure if could be extended to other.
Anyone that needs to run very big full refresh. This issue effects them linearly with the query execution time.
The solution reduces to the copy time, usually much faster than query execution for complex queries.
With some guidance.
When running dbt tests in a container, I intermittently receive authentication errors, which cause the running tasks to fail.
dbt test [my_model]
(Expect ~288 tests to run)I expect all tests to succeed (verified by running locally).
"exc_info": "Traceback (most recent call last):\n File \"/usr/local/lib/python3.7/site-packages/dbt/adapters/bigquery/connections.py\", line 206, in open\n handle = cls.get_bigquery_client(connection.credentials)\n File \"/usr/local/lib/python3.7/site-packages/dbt/adapters/bigquery/connections.py\", line 187, in get_bigquery_client\n creds = cls.get_bigquery_credentials(profile_credentials)\n File \"/usr/local/lib/python3.7/site-packages/dbt/adapters/bigquery/connections.py\", line 157, in get_bigquery_credentials\n credentials, project_id = google.auth.default(scopes=cls.SCOPE)\n File \"/usr/local/lib/python3.7/site-packages/google/auth/_default.py\", line 354, in default\n raise exceptions.DefaultCredentialsError(_HELP_MESSAGE)\ngoogle.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File \"/usr/local/lib/python3.7/site-packages/dbt/task/base.py\", line 333, in safe_run\n result = self.compile_and_execute(manifest, ctx)\n File \"/usr/local/lib/python3.7/site-packages/dbt/task/base.py\", line 276, in compile_and_execute\n result = self.run(ctx.node, manifest)\n File \"/usr/local/lib/python3.7/site-packages/dbt/task/base.py\", line 378, in run\n return self.execute(compiled_node, manifest)\n File \"/usr/local/lib/python3.7/site-packages/dbt/task/test.py\", line 84, in execute\n failed_rows = self.execute_schema_test(test)\n File \"/usr/local/lib/python3.7/site-packages/dbt/task/test.py\", line 64, in execute_schema_test\n fetch=True,\n File \"/usr/local/lib/python3.7/site-packages/dbt/adapters/base/impl.py\", line 227, in execute\n fetch=fetch\n File \"/usr/local/lib/python3.7/site-packages/dbt/adapters/bigquery/connections.py\", line 275, in execute\n query_job, iterator = self.raw_execute(sql, fetch=fetch)\n File \"/usr/local/lib/python3.7/site-packages/dbt/adapters/bigquery/connections.py\", line 248, in raw_execute\n client = conn.handle\n File \"/usr/local/lib/python3.7/site-packages/dbt/contracts/connection.py\", line 71, in handle\n self._handle.resolve(self)\n File \"/usr/local/lib/python3.7/site-packages/dbt/contracts/connection.py\", line 96, in resolve\n return self.opener(connection)\n File \"/usr/local/lib/python3.7/site-packages/dbt/adapters/bigquery/connections.py\", line 210, in open\n gcloud.setup_default_credentials()\n File \"/usr/local/lib/python3.7/site-packages/dbt/clients/gcloud.py\", line 24, in setup_default_credentials\n run_cmd('.', [\"gcloud\", \"auth\", \"application-default\", \"login\"])\n File \"/usr/local/lib/python3.7/site-packages/dbt/clients/system.py\", line 436, in run_cmd\n out, err)\ndbt.exceptions.CommandResultError: Got a non-zero returncode running: ['gcloud', 'auth', 'application-default', 'login']"
Which database are you using dbt with?
The output of dbt --version
:
installed version: 0.18.1
The operating system you're using:
The output of python --version
:
Python 3.7.7
When calling load_result() from a statement block connecting to a bigquery db, if the query is an UPDATE operation, we do not get an AdapterResponse on par with other DML operations.
Seems like it should be included here:
Moreover, what is the reason that SELECT
statements do not also get a filled out AdapterResponse?
{% macro test_response_object() %}
{% call statement('test') %}
... UPDATE STATEMENT ...
{% endcall %}
{% set result = load_result('test') %}
{{ dbt_utils.log_info(result) }}
{% endmacro %}
From the command line:
dbt run-operation test_response_object
Result:
{'response': BigQueryAdapterResponse(_message='OK', code=None, rows_affected=None, bytes_processed=None), 'data': [], 'table': <agate.table.Table object at 0x114f1cd90>}
BigQueryAdapterResponse field should be populated with values reported by the BigQuery API. Here is an example of what you get when you run a DELETE:
{'response': BigQueryAdapterResponse(_message='DELETE (1.0 rows, 352.5 KB processed)', code='DELETE', rows_affected=1, bytes_processed=360929), 'data': [], 'table': <agate.table.Table object at 0x1065e82b0>}
If applicable, add screenshots or log output to help explain your problem.
The output of dbt --version
:
installed version: 0.21.0
latest version: 0.21.0
Up to date!
Plugins:
- bigquery: 0.21.0
- snowflake: 0.21.0
- redshift: 0.21.0
- postgres: 0.21.0
The operating system you're using:
macOS v 11.6 (20G165)
The output of python --version
:
Python 3.8.11
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.