Git Product home page Git Product logo

Comments (7)

Amr-MKamal avatar Amr-MKamal commented on May 21, 2024 1

Update 2: after scaling the VM SSD again to ~ 4TB & restarting the cluster , it has kept running successfully since then , yet the cluster seems to have taken an extra ~700GB so far , so the suggested 500GB is clearly not enough for the given example

from farmvibes-ai.

Amr-MKamal avatar Amr-MKamal commented on May 21, 2024

Update 3: the spaceeye.spaceeye.spaceeye has finally finished , yet upon complete it through that error & the workflow failed ,
"reason": "KeyError: 'b806ef61-0164-4ef9-9cec-ccda9682ac9c'\n File \"/opt/conda/lib/python3.8/site-packages/vibe_common/messaging.py\", line 491, in accept_or_fail_event\n return success_callback(message)\n\n File \"/opt/conda/lib/python3.8/site-packages/vibe_server/orchestrator.py\", line 422, in success_callback\n self.inqueues[str(message.run_id)].put(message)\n",
After rerunning the same workflow the task spaceeye.spaceeye.spaceeye was done & spaceeye.spaceeye.split was queued pending nothing , So I restarted the cluster , yet after restarting it responds now with HTTPError: 500 Server Error for both the API & python client , stop & start also get's stuck here
image

from farmvibes-ai.

renatolfc avatar renatolfc commented on May 21, 2024

The KeyError you saw was due to the restart of the cluster, in which the orchestrator lost the context about the existing run.

Can you please share the log files stored in ~/.cache/farmvibes-ai/logs? Also, if possible, please also share the output of docker logs k3d-farmvibes-ai-server-0 and df -h.

Thanks!

from farmvibes-ai.

Amr-MKamal avatar Amr-MKamal commented on May 21, 2024

that makes sense because eventually everything I ran was queued pending nothing , which file exactly ?

I will assume "terravibes- orchestrator .log" " it's a very large file if you're looking for some error it will be easier to search

{"app_id": "terravibes-orchestrator", "instance": "terravibes-orchestrator-656cbf6fff-kjb9b", "level": "INFO", "msg": "Changed op spaceeye.spaceeye.group_s2 status to pending. (run id: 76fd498a-c476-492b-9735-d5e9bab49f2c)", "scope": "vibe_server.orchestrator.WorkflowStateUpdate", "time": "2023-02-28 09:21:15,480", "type": "log", "ver": "dev"}
{"app_id": "terravibes-orchestrator", "instance": "terravibes-orchestrator-656cbf6fff-kjb9b", "level": "ERROR", "msg": "Marking op spaceeye.spaceeye.group_s2 as failed, but it didn't have a start time set. (run id: 76fd498a-c476-492b-9735-d5e9bab49f2c)", "scope": "vibe_server.orchestrator.WorkflowStateUpdate", "time": "2023-02-28 09:21:15,480", "type": "log", "ver": "dev"}
{"app_id": "terravibes-orchestrator", "instance": "terravibes-orchestrator-656cbf6fff-kjb9b", "level": "INFO", "msg": "Workflow 76fd498a-c476-492b-9735-d5e9bab49f2c changed status to failed\nReason: WORKFLOW_FAILED status propagated from workflow level", "scope": "vibe_server.orchestrator.WorkflowStateUpdate", "time": "2023-02-28 09:21:15,480", "type": "log", "ver": "dev"}
{"app_id": "terravibes-orchestrator", "instance": "terravibes-orchestrator-656cbf6fff-kjb9b", "level": "DEBUG", "msg": "Starting new HTTP connection (1): 127.0.0.1:3500", "scope": "urllib3.connectionpool", "time": "2023-02-28 09:21:15,481", "type": "log", "ver": "dev"}
{"app_id": "terravibes-orchestrator", "instance": "terravibes-orchestrator-656cbf6fff-kjb9b", "level": "DEBUG", "msg": "http://127.0.0.1:3500 \"GET /v1.0/state/statestore/76fd498a-c476-492b-9735-d5e9bab49f2c?metadata.partitionKey=eywa HTTP/1.1\" 200 2967", "scope": "urllib3.connectionpool", "time": "2023-02-28 09:21:15,482", "type": "log", "ver": "dev"}
{"app_id": "terravibes-orchestrator", "instance": "terravibes-orchestrator-656cbf6fff-kjb9b", "level": "DEBUG", "msg": "Starting new HTTP connection (1): 127.0.0.1:3500", "scope": "urllib3.connectionpool", "time": "2023-02-28 09:21:15,493", "type": "log", "ver": "dev"}
{"app_id": "terravibes-orchestrator", "instance": "terravibes-orchestrator-656cbf6fff-kjb9b", "level": "DEBUG", "msg": "http://127.0.0.1:3500 \"POST /v1.0/state/statestore HTTP/1.1\" 204 0", "scope": "urllib3.connectionpool", "time": "2023-02-28 09:21:15,494", "type": "log", "ver": "dev"}
{"app_id": "terravibes-orchestrator", "instance": "terravibes-orchestrator-656cbf6fff-kjb9b", "level": "INFO", "msg": "Changed op spaceeye.preprocess.cloud.merge status to pending. (run id: 76fd498a-c476-492b-9735-d5e9bab49f2c)", "scope": "vibe_server.orchestrator.WorkflowStateUpdate", "time": "2023-02-28 09:21:15,496", "type": "log", "ver": "dev"}
{"app_id": "terravibes-orchestrator", "instance": "terravibes-orchestrator-656cbf6fff-kjb9b", "level": "ERROR", "msg": "Marking op spaceeye.preprocess.cloud.merge as failed, but it didn't have a start time set. (run id: 76fd498a-c476-492b-9735-d5e9bab49f2c)", "scope": "vibe_server.orchestrator.WorkflowStateUpdate", "time": "2023-02-28 09:21:15,496", "type": "log", "ver": "dev"}

and for dh -f

Filesystem      Size  Used Avail Use% Mounted on
/dev/root       3.9T  2.7T  1.3T  70% /
tmpfs            32G  4.0K   32G   1% /dev/shm
tmpfs            13G  1.9M   13G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda15      105M  6.1M   99M   6% /boot/efi
/dev/sdb1       126G   28K  120G   1% /mnt
tmpfs           6.3G  128K  6.3G   1% /run/user/124
tmpfs           6.3G  140K  6.3G   1% /run/user/1000

this after I destroyed the cluster & renamed the old cash file ( the only way new workflows can work )
and for docker logs k3d-farmvibes-ai-server-0

Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
I0228 22:41:04.152242       7 trace.go:205] Trace[81119723]: "GuaranteedUpdate etcd3" type:*core.ConfigMap (28-Feb-2023 22:41:03.618) (total time: 533ms):
Trace[81119723]: ---"Transaction committed" 533ms (22:41:04.152)
Trace[81119723]: [533.657546ms] [533.657546ms] END
I0228 22:41:04.152421       7 trace.go:205] Trace[956170375]: "Update" url:/api/v1/namespaces/dapr-system/configmaps/operator.dapr.io,user-agent:operator/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election,audit-id:2e418883-fb2b-443c-a6c2-6fcd08f30186,client:10.42.0.62,accept:application/json, */*,protocol:HTTP/2.0 (28-Feb-2023 22:41:03.618) (total time: 534ms):
Trace[956170375]: ---"Object stored in database" 533ms (22:41:04.152)
Trace[956170375]: [534.096151ms] [534.096151ms] END
W0228 22:41:07.369204       7 info.go:53] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"

update 4 ; as a last try I reran the workflow again with much smaller input region inside the given input region ( only 20 km ) luckily it worked & completed succefully in 7 hours

from farmvibes-ai.

renatolfc avatar renatolfc commented on May 21, 2024

I think you can attach the files if you drag and drop them to the text area.

For the log files, at least the terravibes-orchestrator.log, but ideally all of them in the directory.

For the output of docker logs, you can save it to a file with docker logs k3d-farmvibes-ai-server-0 > docker-logs.txt.

from farmvibes-ai.

Amr-MKamal avatar Amr-MKamal commented on May 21, 2024

docker-logs.txt

I already tried that : terravibes-orchestrator.log is 15.3 MB file , github refused it even after changing it to .txt , also terravibes-restapi.log is around 700 MB , if this files are still important I can try public clouds to upload them

from farmvibes-ai.

Amr-MKamal avatar Amr-MKamal commented on May 21, 2024

Update 5 : as explained above I was able to navigate around the error, specially working with smaller areas , however I still get this error when the resources already exists, which eventually requires the notebook to be run multiple times , in the next run the task will be done immediately , this makes it harder to integrate with as an API "reason": "RuntimeError: Received unsupported message header=MessageHeader(type=<MessageType.error: 'error'>, run_id=UUID('18e9dbdc-1f9c-4ffa-bd85-4455e2a0518c'), id='00-18e9dbdc1f9c4ffabd854455e2a0518c-140e43094eee796e-01', parent_id='00-18e9dbdc1f9c4ffabd854455e2a0518c-b87c84a62262a97c-01', version='1.0', created_at=datetime.datetime(2023, 3, 8, 23, 38, 16, 110840)) content=ErrorContent(status=<OpStatusType.failed: 'failed'>, ename=\"<class 'RuntimeError'>\", evalue='Traceback (most recent call last):\\n File \"/opt/conda/lib/python3.8/site-packages/vibe_agent/worker.py\", line 123, in run_op\\n return factory.build(spec).run(input, cache_info)\\n File \"/opt/conda/lib/python3.8/site-packages/vibe_agent/ops.py\", line 99, in run\\n items_out = self.storage.store(run_id, stac_results, cache_info)\\n File \"/opt/conda/lib/python3.8/site-packages/vibe_agent/storage/local_storage.py\", line 135, in store\\n raise LocalResourceExistsError(\\nvibe_agent.storage.local_storage.LocalResourceExistsError: Op output already exists in storage for download_sentinel2_from_pc with id 23bdca2b-99ce-498e-bd19-ced731cc545e.\\n', traceback=[' File \"/opt/conda/lib/python3.8/site-packages/vibe_agent/worker.py\", line 309, in run_op_from_message\\n out = self.run_op_with_retry(content, message.run_id)\\n', ' File \"/opt/conda/lib/python3.8/site-packages/vibe_agent/worker.py\", line 402, in run_op_with_retry\\n raise RuntimeError(\"\".join(ret.format()))\\n']). Aborting execution.", "status": "failed"

from farmvibes-ai.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.