Comments (13)
🤖
Based on the changes made between Flyte 1.10.0 and Flyte 1.10.7, it seems like there has been a shift in how ephemeral storage is handled. Specifically, the term Storage
was replaced with Ephemeral Storage
in the Helm Chart as per this pull request. Additionally, the option to use storage as a task resource was removed as per this pull request.
These changes could potentially be the reason why your ephemeral storage request is not being respected. However, without more information, it's hard to say for certain.
To help troubleshoot this issue, could you please provide the following information:
- The version of the Flytekit you're using.
- The configuration of your Flyte deployment, specifically the settings related to resource allocation and limits.
- The logs from the Flyte Propeller, which is the component responsible for translating Flyte tasks into Kubernetes pods.
This information will help us understand the context in which the task is being executed and why the ephemeral storage request might not be respected.
In the meantime, you might want to consider downgrading to Flyte 1.10.0 if that version was working as expected for your use case. Please note that this is just a workaround and not a long-term solution. We'll continue to investigate this issue to find a proper fix.
For more information about resource configuration in Flyte, you can refer to the Flyte documentation.
Sources
About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
from flyte.
Hi @Future-Outlier,
okay, I will then request help in the Slack channel then
from flyte.
Thank you for opening your first issue here! 🛠
from flyte.
It's fixed here.
#4929
Thank you.
from flyte.
Hi @Future-Outlier,
thanks for your quick response. I am addressing my response to the questions asked by the bot to you.
- We were using Flyte 1.10.7
- I am not allowed to send the complete configuration, but I can tell, we have no resource defaults set on ephemeral storage
{ "insertId": "1qgk89aismv5o2by", "jsonPayload": { "json": { "node": "n0", "res_ver": "254879967", "tasktype": "helix-task", "src": "container_helper.go:329", "wf": "ndas:dev:msp_export_workflow.msp_export_specific_session_workflow", "routine": "worker-24", "exec_id": "fda883097f3dd44f992e", "ns": "ndas-dev" }, "msg": "Adjusted container resources [{map[cpu:{{2 0} {<nil>} 2 DecimalSI} ephemeral-storage:{{20971520 0} {<nil>} 20Mi BinarySI} memory:{{2 9} {<nil>} 2G DecimalSI}] map[cpu:{{2 0} {<nil>} 2 DecimalSI} ephemeral-storage:{{20971520 0} {<nil>} 20Mi BinarySI} memory:{{2 9} {<nil>} 2G DecimalSI}] []}]", "level": "info", "ts": "2024-02-27T13:59:37Z" }, "resource": { "type": "k8s_container", "labels": { "container_name": "flytepropeller", "project_id": "mb-adas-workflow-p-56e1", "pod_name": "flytepropeller-67ff8fdc9f-nbtbs", "cluster_name": "mbadas-prod-gke-workflow", "location": "europe-west4", "namespace_name": "flyte" } }, "timestamp": "2024-02-27T13:59:37.057502579Z", "severity": "INFO", "labels": { "k8s-pod/app_kubernetes_io/managed-by": "Helm", "k8s-pod/app_kubernetes_io/name": "flytepropeller", "k8s-pod/pod-template-hash": "67ff8fdc9f", "k8s-pod/helm_sh/chart": "flyte-core-v1.10.7", "k8s-pod/app_kubernetes_io/instance": "flyte-core", "compute.googleapis.com/resource_name": "gke-mbadas-prod-gke--mbadas-prod-gke--d5166da6-59c5" }, "logName": "projects/mb-adas-workflow-p-56e1/logs/stderr", "receiveTimestamp": "2024-02-27T13:59:39.647297684Z" }
{ "insertId": "a56xdll71u0g5c71", "jsonPayload": { "msg": "The resource requirement for creating Pod [ndas-dev/fda883097f3dd44f992e-n0-6] is [[{[memory]: [2G]} {[cpu]: [2]} {[ephemeral-storage]: [20Mi]}]]\n", "level": "info", "json": { "tasktype": "helix-task", "routine": "worker-24", "res_ver": "254879967", "src": "plugin_manager.go:179", "node": "n0", "exec_id": "fda883097f3dd44f992e", "ns": "ndas-dev", "wf": "ndas:dev:msp_export_workflow.msp_export_specific_session_workflow" }, "ts": "2024-02-27T13:59:37Z" }, "resource": { "type": "k8s_container", "labels": { "location": "europe-west4", "cluster_name": "mbadas-prod-gke-workflow", "pod_name": "flytepropeller-67ff8fdc9f-nbtbs", "namespace_name": "flyte", "project_id": "mb-adas-workflow-p-56e1", "container_name": "flytepropeller" } }, "timestamp": "2024-02-27T13:59:37.058019995Z", "severity": "INFO", "labels": { "k8s-pod/app_kubernetes_io/instance": "flyte-core", "k8s-pod/helm_sh/chart": "flyte-core-v1.10.7", "compute.googleapis.com/resource_name": "gke-mbadas-prod-gke--mbadas-prod-gke--d5166da6-59c5", "k8s-pod/app_kubernetes_io/name": "flytepropeller", "k8s-pod/pod-template-hash": "67ff8fdc9f", "k8s-pod/app_kubernetes_io/managed-by": "Helm" }, "logName": "projects/mb-adas-workflow-p-56e1/logs/stderr", "receiveTimestamp": "2024-02-27T13:59:39.647297684Z" }
We already downgraded and since then everything works fine again. But is is sad, that we cannot update to a more recent version.
Looking at the piece of code, from which the logs are generated, I wonder why the resource requests from our task are overwritten at all. That should not happen and I am concerned that it will still happen even with #4929
from flyte.
If you set the configuration with ephemeral storage no limit, did you restart the flyte cluster deployment?
from flyte.
@Future-Outlier, I maybe do not understand your last statement correctly, but what we did was to limit the ephemeral storage use in a task decorator to 100 GB, then starting the workflow with pyflyte. In the end the limit of the Kubernetes container was set to only 20 MiB.
It is not an option for us to fully lift the limit on the ephemeral storage.
from flyte.
@robert-ulbrich-mercedes-benz, I mean that did you
- edit the
task resource
field in the configmap - restart you cluster with command like
kubectl rollout restart deployment flyte-sandbox -n flyte
?
from flyte.
Hi @Future-Outlier ,
it is hard for me to understand what your actual point is. At least for me it would make things easier if you could please provide a few more sentences about your intentions.
We deployed Flyte using the Flyte Core Helm Chart, so we do not have a flyte-sandbox deployment that I could restart. Which exact config map are your referring to? There are quite a lot of them for the Flyte Core helm deployment. I could find the flyte-admin-base-config
which has a key ``task_resource_defaults.yaml`:
task_resource_defaults.yaml:
----
task_resources:
defaults:
cpu: 500m
ephemeralStorage: 500Mi
memory: 500Mi
storage: 500Mi
limits:
cpu: 128
ephemeralStorage: 20Mi
gpu: 16
memory: 2000Gi
storage: 10000Gi
This is the config on Flyte v1.10.7 in our dev environment. The question again is: Why is our config provided in the task overwritten by the defaults? It should rather be the other way around in my eyes
from flyte.
Hi, @robert-ulbrich-mercedes-benz sorry for the misunderstanding.
I didn't recognize that you are Flyte Core helm deployment.
I haven't used this before, so can't respond to you immediately.
Can you come to the Slack channel and discuss this with us?
Someone know how to fix it, thank you
from flyte.
@robert-ulbrich-mercedes-benz , I was not able to repro the issue using the example in the description, only if I used pod templates. For example, let's say a task f
is defined like so:
@task(
task_config=Pod(
pod_spec=V1PodSpec(
containers=[
V1Container(
name="primary",
resources=V1ResourceRequirements(
requests={"ephemeral-storage": "1Gi"},
limits={"ephemeral-storage": "1Gi"},
),
),
],
),
),
)
def f():
pass
If the configured default ephemeral storage is set to any value, then that's what's used (since that's what passes the validation).
from flyte.
Hi @eapolinario,
we also have default pod templates configured for our workflows. But those pod templates do not configure default ephemeral storage. But as mentioned in this ticket, there are task resource defaults.
As mentioned we can easily reproduce the issue.
Best regards
Rob Ulbrich
from flyte.
@robert-ulbrich-mercedes-benz , just to be clear, we have an outstanding bug in Flyte that essentially does not validate ephemeral storage values in pod templates. The moment we added a default value for ephemeral storage as a task resource default, that value was the one used by the task resource validation, regardless of the original value defined in the pod template.
Just to be crystal clear, let's say we have this task:
@task(
task_config=Pod(
pod_spec=V1PodSpec(
containers=[
V1Container(
name="primary",
resources=V1ResourceRequirements(
requests={"ephemeral-storage": "1Gi"},
limits={"ephemeral-storage": "1Gi"},
),
),
],
),
),
)
def f():
...
This is the values as shown in the registered task template:
And task resource defaults are defined as such:
task_resources:
defaults:
ephemeralStorage: 500Mi
limits:
ephemeralStorage: 20Mi
Notice that we're just using the (non-sensical) values defined in 1.10.7.
Upon running that task f
we see the following resources in the pod:
resources:
limits:
cpu: "2"
ephemeral-storage: 20Mi
memory: 200Mi
requests:
cpu: "2"
ephemeral-storage: 20Mi
memory: 200Mi
The values for cpu
and memory
are coming from the defaults.
This bug is being fixed by #5019. After that's released we'll see an error during registration if the values defined in the pod template do not pass task validation. FYI: we are planning a release, 1.11.0, for next week.
from flyte.
Related Issues (20)
- [BUG] `PanderaTransformer::to_python_value()` seems to be returning an incorrect type HOT 2
- [BUG] flytectl upgrade is broken after moving to the monorepo HOT 2
- [BUG] Pin fsspec<2024.5.0 HOT 2
- [BUG] Namespace creation fails with default pod template HOT 5
- [BUG] flytectl demo start fails with "Error: malformed version" HOT 2
- [Docs] Clarify PodTemplate restrictions and behavior HOT 2
- [Docs] Prevent using mutable default arguments in flytesnacks HOT 1
- [Core feature] Replace `os.path` with `pathlib` for flytekit HOT 1
- Obfuscate sensitive data in TaskConfig HOT 4
- [BUG] Fix non thread safe token cache behavior HOT 1
- [Core feature] Flyteadmin SMPT email publisher HOT 1
- [BUG] rshift '>>' operator doesn't work properly with remoteEntities HOT 2
- [Core feature] Allow flytectl to set a targetExecutionCluster HOT 1
- [BUG] Boolean values within pydantic base model being casted to scalar value HOT 1
- [Housekeeping] Files used in `data_types_and_io.normalize_csv_file` and `data_types_and_io.download_and_normalize_csv_files` are no longer accessible HOT 6
- [Core feature] Default task resource behavior should apply for node level overrides HOT 3
- [Core feature] Update/register multiple launch plans with different inputs HOT 1
- [BUG] (Kubeflow) PyTorchPlugin sets Replicas to 0 casuing infinite loop HOT 3
- [BUG] regression: envFrom provided from pod template is discarded HOT 1
- [BUG] Relaunch workflow converts large numbers in array of structs to objects. HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from flyte.