Comments (10)
@dondaum Looks reasonable from what I can tell. I'd suggest trying to get some eyes on it from the committers.
from airflow.
As a workaround, setting force_rerun=False
and setting retries should allow another retry to pick up and monitor the original job instead of starting a new one.
from airflow.
What location
do you provide as an input? (checking if it might be related to #37282)
from airflow.
As a workaround, setting
force_rerun=False
and setting retries should allow another retry to pick up and monitor the original job instead of starting a new one.
@collinmcnulty Thanks but this doesn't occur enough to warrant this change and would likely have it's own side effects because forcing a rerun is what we want to do in most scenarios.
from airflow.
What
location
do you provide as an input? (checking if it might be related to #37282)
It's EU, so I don't think it would be.
from airflow.
Maybe I have some time to look at it.
A few questions:
- What authentication method do you use? Application Default Credentials, service account or a credential configuration file ?
- Are you running other GCP-related asynchronous tasks on the triggerer when you see these exceptions (e.g. multiple BigQuery tasks at the same time)?
- It seems that the exception is thrown after about 30 minutes. So do these exceptions only occur after some time or is it random and they also occur after a few minutes?
from airflow.
Maybe I have some time to look at it.
A few questions:
- What authentication method do you use? Application Default Credentials, service account or a credential configuration file ?
- Are you running other GCP-related asynchronous tasks on the triggerer when you see these exceptions (e.g. multiple BigQuery tasks at the same time)?
- It seems that the exception is thrown after about 30 minutes. So do these exceptions only occur after some time or is it random and they also occur after a few minutes?
@dondaum This will be using ADC as authentication. Yes, there could be multiple BigQuery tasks running at the same time. I don't think that the time is a factor. The 30 mins in the log I gave might just be how long the query ran for because I have examples that occur in the space of a couple of minutes.
from airflow.
I tried to reproduce the exact error but with no success. I tried to reproduce it with the following DAG:
import datetime
import os
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from airflow import DAG
WAIT_QUERY = """
DECLARE retry_count INT64;
DECLARE success BOOL;
DECLARE size_bytes INT64;
DECLARE row_count INT64;
DECLARE DELAY_TIME DATETIME;
DECLARE WAIT STRING;
SET retry_count = 2;
SET success = FALSE;
WHILE retry_count <= 3 AND success = FALSE DO
BEGIN
SET row_count = (with a as (SELECT 1 as b) SELECT * FROM a WHERE 1 = 2);
IF row_count > 0 THEN
SELECT 'Table Exists!' as message, retry_count as retries;
SET success = TRUE;
ELSE
SELECT 'Table does not exist' as message, retry_count as retries, row_count;
SET retry_count = retry_count + 1;
-- WAITFOR DELAY '00:00:10';
SET WAIT = 'TRUE';
SET DELAY_TIME = DATETIME_ADD(CURRENT_DATETIME,INTERVAL 90 SECOND);
WHILE WAIT = 'TRUE' DO
IF (DELAY_TIME < CURRENT_DATETIME) THEN
SET WAIT = 'FALSE';
END IF;
END WHILE;
END IF;
END;
END WHILE;
"""
with DAG(
dag_id=os.path.splitext(os.path.basename(__file__))[0],
schedule=None,
start_date=datetime.datetime(2024, 1, 1),
catchup=False,
tags=["testing"],
) as dag:
for i in range(10):
bq_task = BigQueryInsertJobOperator(
task_id=f"debug_query_{i}",
configuration={
"query": {
"query": WAIT_QUERY,
"useLegacySql": False,
"priority": "BATCH",
}
},
location="europe-west3",
deferrable=True,
)
Also, I set the retry option in the GCP connection to 0 so as not to implicitly retry on failure.
Could you perhaps create a DAG that reproduces the error? And maybe you could also check which apache-airflow-providers-google you are using?
My setup:
Apache Airflow
version | 2.7.3
executor | LocalExecutor
task_logging_handler | airflow.utils.log.file_task_handler.FileTaskHandler
sql_alchemy_conn | postgresql+psycopg2://airflow:airflow@postgres/airflow
dags_folder | /opt/airflow/dags
plugins_folder | /opt/airflow/plugins
base_log_folder | /opt/airflow/logs
remote_base_log_folder |
System info
OS | Linux
architecture | x86_64
uname | uname_result(system='Linux', node='42d8cf034cfc', release='5.10.16.3-microsoft-standard-WSL2', version='#1 SMP Fri Apr 2 22:23:49 UTC 2021', machine='x86_64')
locale | ('en_US', 'UTF-8')
python_version | 3.11.6 (main, Nov 1 2023, 14:02:22) [GCC 10.2.1 20210110]
python_location | /usr/local/bin/python
Tools info
git | NOT AVAILABLE
ssh | OpenSSH_8.4p1 Debian-5+deb11u2, OpenSSL 1.1.1w 11 Sep 2023
kubectl | NOT AVAILABLE
gcloud | NOT AVAILABLE
cloud_sql_proxy | NOT AVAILABLE
mysql | mysql Ver 8.0.35 for Linux on x86_64 (MySQL Community Server - GPL)
sqlite3 | 3.34.1 2021-01-20 14:10:07 10e20c0b43500cfb9bbc0eaa061c57514f715d87238f4d835880cd846b9ealt1
psql | psql (PostgreSQL) 16.0 (Debian 16.0-1.pgdg110+1)
Paths info
airflow_home | /opt/airflow
system_path | /root/bin:/home/airflow/.local/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
python_path | /home/airflow/.local/bin:/usr/local/lib/python311.zip:/usr/local/lib/python3.11:/usr/local/lib/python3.11/lib-dynload:/home/airflow/.local/lib/python3.11/site-packages:
| /usr/local/lib/python3.11/site-packages:/opt/airflow/dags:/opt/airflow/config:/opt/airflow/plugins
airflow_on_path | True
Providers info
apache-airflow-providers-amazon | 8.10.0
apache-airflow-providers-apache-beam | 5.6.2
apache-airflow-providers-celery | 3.6.0
apache-airflow-providers-cncf-kubernetes | 8.0.1
apache-airflow-providers-common-sql | 1.11.1
apache-airflow-providers-daskexecutor | 1.1.0
apache-airflow-providers-dbt-cloud | 3.7.0
apache-airflow-providers-docker | 3.8.0
apache-airflow-providers-elasticsearch | 5.1.0
apache-airflow-providers-ftp | 3.7.0
apache-airflow-providers-google | 10.16.0
apache-airflow-providers-grpc | 3.3.0
apache-airflow-providers-hashicorp | 3.6.4
apache-airflow-providers-http | 4.10.0
apache-airflow-providers-imap | 3.5.0
apache-airflow-providers-microsoft-azure | 8.1.0
apache-airflow-providers-mysql | 5.5.4
apache-airflow-providers-odbc | 4.1.0
apache-airflow-providers-openlineage | 1.2.0
apache-airflow-providers-postgres | 5.10.2
apache-airflow-providers-redis | 3.4.0
apache-airflow-providers-sendgrid | 3.4.0
apache-airflow-providers-sftp | 4.9.0
apache-airflow-providers-slack | 8.3.0
apache-airflow-providers-snowflake | 5.1.0
apache-airflow-providers-sqlite | 3.7.1
apache-airflow-providers-ssh | 3.10.1
from airflow.
@dondaum Thanks but, as I mentioned, it isn't possible to replicate consistently. The error that is returned is a 502 HTTP error from Google which means the problem was on ultimately on their side when the trigger is trying to obtain impersonated credentials in order to check the status of a BigQuery job. It doesn't have anything to do with query times or whether the table exists or not.
Perhaps it is possible to simulate the exception that is received by the trigger though?
('Unable to acquire impersonated credentials', '<!DOCTYPE html>\n<html lang=en>\n <meta charset=utf-8>\n <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">\n <title>Error 502 (Server Error)!!1</title>\n <style>\n {margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px} > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_colo
Oh, and for clarity on the Google provider.
$ pip freeze | grep apache-airflow-providers-google
apache-airflow-providers-google==10.16.0
from airflow.
@nathadfield Thanks. I think I got it now.
I worked on a change that adds a retry in such cases. Can you perhaps have a look and check ?
from airflow.
Related Issues (20)
- Unable to mock Variable in python unittest HOT 3
- [BACKFILL] We shouldn't be waiting for a whole batch of dag runs to be finished before triggering new ones
- Flaky MappedTaskUpstreamDep test `test_mapped_task_upstream_dep` on Postgres
- add blob size to LocalFilesystemToGCSOperator HOT 2
- GlueJobOperator with local script location fails on consecutive runs HOT 1
- Control DAG Executor Assignment through RBAC HOT 5
- Tasks are in queued state for a longer time and executor slots are exhausted often HOT 17
- Mismatching dates in Run ID and Grid View HOT 1
- Flaky docker example test around `mpi4py` builds
- Allow searching dag by display name HOT 1
- Status of testing Providers that were prepared on April 13, 2024 HOT 7
- Allow passing airflow params as job parameter in databricks job
- xcom_push failure of KubernetesPodOperator execute_sync() can make pod leak HOT 1
- Allow DAGs to trigger on Datasets with wildcard/regex HOT 9
- Decision for broken task log prefix feature HOT 11
- Scheduder crashed with error - "PermissionError: [Errno 13] Permission denied " even with extraInitContainers and PVC. HOT 4
- No way of going back to the DAG page once entering task details HOT 6
- EcsRunTaskOperator does not send logs if task times out HOT 2
- Airflow produces an unnecessary ' ' (space) in the middle of the WASB URL when WASB connection is read from Azure Key Vault secret backed. HOT 2
- Graph is empty after update to 2.9.0 on some DAGs HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from airflow.