Git Product home page Git Product logo

Comments (14)

LamaAni avatar LamaAni commented on September 24, 2024 1

Could you give me a bit more details about what you expect in experimental/ ? Just the DAG code + YAML file, or more of a test like test_job_runner.py ?

Experimental is a folder to add experimental work and tests. It would be removed in the future once the Operator has more people. In which case, as long as you document your test, and put it under your folder, I'm ok with it. Please make sure to make it as clear as possible and minimize the code. See examples for examples ;)

from kubernetesjoboperator.

LamaAni avatar LamaAni commented on September 24, 2024

Hi, I'll test it out on my local system but it might take a few days. Connection reset means the connection was stopped by the server.. which is odd. Would you be able to specify:

  1. What python version?
  2. Where is the airflow executed? (Local, On the cluster ... GCP, AWS, custom?)
  3. Other server info if possible.
  4. The error log.

from kubernetesjoboperator.

odesenfans avatar odesenfans commented on September 24, 2024

Sure, here you go:

  1. Python version 3.7.9
  2. The connection reset by peer occurs on a server running on Azure. I only ran the sample locally but can spin it up on the VM if you believe it will help identify the issue.
uname -a
> Linux DTCODSDEV002 5.4.0-1031-azure #32~18.04.1-Ubuntu SMP Tue Oct 6 10:03:22 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

If you want specific info let me know.
4. The error log is here for the code sample: https://github.com/LamaAni/KubernetesJobOperator/files/5996463/test_connection_reset_dev_test_connection_reset_task_2021-02-15T05_00_00%2B00_00_4.log

Note that this one is not a connection reset by peer issue and looks more like a code issue, so that might help on the way to the main problem. The initial log (connection reset by peer) is here:
connection_reset_by_peer.log

from kubernetesjoboperator.

LamaAni avatar LamaAni commented on September 24, 2024

Hi,

Maybe there is connection issues with the underlining lib. I'll test it. For reference, the longest running job tested with it was ~17hrs. Also, it should withstand a connection reset.

from kubernetesjoboperator.

LamaAni avatar LamaAni commented on September 24, 2024

I checked your DAG in this branch: https://github.com/LamaAni/KubernetesJobOperator/tree/test_issue_33

There seems to be no issues on the execution side and the operation completed successfully. From your log it seems that the error you are seeing matches this stack-overflow post.

I got this from lines 343-348,

  File "/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line 455, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)

urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

[2021-02-17 14:45:13,431] {job_runner.py:253} ERROR - {job-runner}: Execution timeout... deleting resources

It seems that the connection is closed prematurely by the server and an error is thrown by the urllib3. I am not sure why.

As a possible solution, we can force a reconnection in the underlining Client, which would solve this issue. I think this should be added only if these kind of communication issues are common. Moreover, I am not sure if that is the right way to go since this seems to be a server issue.

I will leave this issue open until you have some conclusions on your end. I am very interested in this issue since it may be one that repeats and hard to detect in many servers.

If you could,

  1. Test on a local Kubernetes cluster (docker desktop or minkube)
  2. Test on a Kubernetes cluster without providing creds, i.e. run incluster (maybe the creds are invalidated?)

If you can please keep me up to date about the progress. If connection issue are common I will address them in the underlining REST api, but if this is a server issue then it should be made known.

Tests I did

To execute the code using the local debug,

  1. Have airflow installed
  2. Have KJO installed

From the repo root run (may be local changes),

mkdir -p ./.local
./tests/local_airflow/configure
./tests/local_airflow/start initdb
python ./tests/dags/test_log_run_issue.py

from kubernetesjoboperator.

odesenfans avatar odesenfans commented on September 24, 2024

Ok thanks a lot for testing. I'll give it a try using the options you propose here, anyway I planned to move this Airflow instance in-cluster somewhere in the future so now might be the right time to give it a try.

I'll first test with minikube and see where it goes from there! I'll keep you posted, might take a few days.

from kubernetesjoboperator.

odesenfans avatar odesenfans commented on September 24, 2024

Update: I reproduced the issue on minikube using the sample SAG. The output is sensibly the same. Note that the delay seems to be around 5:30 minutes in this case. This could be explained by the fact that both Docker and minikube run in VMs on macOS.

Setup:

  • Airflow 1.10.13, official Docker image, running in Docker Compose
  • Minikube v1.17.1
  • macOS Mojave 10.14.6

Log: test_connection_reset_dev_test_connection_reset_task_2021-02-17T20_48_22.116360+00_00_1.log

I'll try to run this in cluster now.

from kubernetesjoboperator.

LamaAni avatar LamaAni commented on September 24, 2024

Interesting, are you using a kubernetes config file?
Also, can you try DockerDesktop?
Finally, if you don't mind PRing your test code into experimental? I'll add you as a collaborator if you wish

from kubernetesjoboperator.

odesenfans avatar odesenfans commented on September 24, 2024

Yes sure, I'll contribute the code.

What do you mean by "try Docker Desktop"? I am already using it, with compose on top to manage my containers (DB + scheduler + webserver).

I'm using a Kubernetes config file, the contents are generated by minikube on init:

apiVersion: v1
clusters:
- cluster:
    certificate-authority: /home/airflow/.minikube/ca.crt
    server: https://192.168.64.2:8443
  name: minikube
contexts:
- context:
    cluster: minikube
    user: minikube
  name: minikube
current-context: minikube
kind: Config
preferences: {}
users:
- name: minikube
  user:
    client-certificate: /home/airflow/.minikube/profiles/minikube/client.crt
    client-key: /home/airflow/.minikube/profiles/minikube/client.key

from kubernetesjoboperator.

odesenfans avatar odesenfans commented on September 24, 2024

Could you give me a bit more details about what you expect in experimental/ ? Just the DAG code + YAML file, or more of a test like test_job_runner.py ?

from kubernetesjoboperator.

LamaAni avatar LamaAni commented on September 24, 2024

Yes sure, I'll contribute the code.

What do you mean by "try Docker Desktop"? I am already using it, with compose on top to manage my containers (DB + scheduler + webserver).

I'm using a Kubernetes config file, the contents are generated by minikube on init:

apiVersion: v1
clusters:
- cluster:
    certificate-authority: /home/airflow/.minikube/ca.crt
    server: https://192.168.64.2:8443
  name: minikube
contexts:
- context:
    cluster: minikube
    user: minikube
  name: minikube
current-context: minikube
kind: Config
preferences: {}
users:
- name: minikube
  user:
    client-certificate: /home/airflow/.minikube/profiles/minikube/client.crt
    client-key: /home/airflow/.minikube/profiles/minikube/client.key

Docker desktop: https://www.docker.com/products/docker-desktop

Can you try instead in-cluster? Maybe the configuration loader is getting a timeout when refreshing the token.

from kubernetesjoboperator.

LamaAni avatar LamaAni commented on September 24, 2024

Hi,

Any update on this?

from kubernetesjoboperator.

odesenfans avatar odesenfans commented on September 24, 2024

Hi, I first worked on fixing my setup and testing the deployment in Kubernetes. The issue is definitely related to Docker, as everything now works alright in cluster. I could not find a clear issue in the Docker bug tracker, but a few posts on Stack Overflow point in that direction. I'll still upload the code and add a note in the README to document this issue, maybe it will help other users.

from kubernetesjoboperator.

LamaAni avatar LamaAni commented on September 24, 2024

Hi, I first worked on fixing my setup and testing the deployment in Kubernetes. The issue is definitely related to Docker, as everything now works alright in cluster. I could not find a clear issue in the Docker bug tracker, but a few posts on Stack Overflow point in that direction. I'll still upload the code and add a note in the README to document this issue, maybe it will help other users.

Hi Nice!

Can we close this issue then?

from kubernetesjoboperator.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.