Git Product home page Git Product logo

Comments (19)

hosunhc avatar hosunhc commented on May 18, 2024 1

run.log
The workload agenda is here:

workloads:
- name: stress-ng
  iterations: 5
  params:
    cleanup_assets: true
    duration: 10
    extra_args: '--cpu-method gcd --taskset 5,7 -l 100'
    stressor: cpu
    threads: 2
    uninstall: false
  runtime_parameters:
    A55_frequency: 1328000
    A76_frequency: 1328000
    X1_frequency: 1745000
    sysfile_values:
      /sys/devices/system/cpu/cpu1/online: 0
      /sys/devices/system/cpu/cpu2/online: 0
      /sys/devices/system/cpu/cpu3/online: 0
      /sys/devices/system/cpu/cpu4/online: 0
      /sys/devices/system/cpu/cpu5/online: 1
      /sys/devices/system/cpu/cpu6/online: 0
      /sys/devices/system/cpu/cpu7/online: 1

from workload-automation.

hosunhc avatar hosunhc commented on May 18, 2024 1

Yep, I've been using your branch rather than the upstream implementation.

from workload-automation.

marcbonnici avatar marcbonnici commented on May 18, 2024

Hi, thanks for reporting this, that should not be the case so sounds like we might have a bug somewhere.

As a workaround could you try explicitly specifying the frequency of the enabled cores that you are looking for and see if that allows you to make progress?

e.g.

cpu2_frequency: 1328000
cpu5_frequency: 1328000
cpu6_frequency: 1745000

from workload-automation.

hosunhc avatar hosunhc commented on May 18, 2024

Thanks for the quick response. Still does not seem to work:


workloads:
- name: stress-ng
  iterations: 10
  params:
    cleanup_assets: true
    duration: 10
    extra_args: '--cpu-method callfunc --taskset 6,7 -l 100'
    stressor: cpu
    threads: 2
    uninstall: false
  runtime_parameters:
    # A55_frequency: 1328000
    # A76_frequency: 1328000
    # X1_frequency: 1745000
    cpu2_frequency: 1328000
    cpu5_frequency: 1328000
    cpu6_frequency: 1745000
    sysfile_values:
      /sys/devices/system/cpu/cpu1/online: 0
      /sys/devices/system/cpu/cpu2/online: 1
      /sys/devices/system/cpu/cpu3/online: 0
      /sys/devices/system/cpu/cpu4/online: 0
      /sys/devices/system/cpu/cpu5/online: 1
      /sys/devices/system/cpu/cpu6/online: 1
      /sys/devices/system/cpu/cpu7/online: 1

With the output as below:

INFO     Running job wk1
INFO         Configuring augmentations
INFO         Configuring target for job wk1 (stress-ng) [1]
ERROR        Cannot configure frequencies for CPU4 as no CPUs are online.
INFO         Completing job wk1
ERROR    Job wk1 iteration 1 completed with status FAILED. retrying...
INFO     Running job wk1
INFO         Configuring augmentations
INFO         Configuring target for job wk1 (stress-ng) [1]
ERROR        Cannot configure frequencies for CPU4 as no CPUs are online.
INFO         Completing job wk1
ERROR    Job wk1 iteration 1 completed with status FAILED. retrying...
INFO     Running job wk1
INFO         Configuring augmentations
INFO         Configuring target for job wk1 (stress-ng) [1]
ERROR        Cannot configure frequencies for CPU4 as no CPUs are online.
INFO         Completing job wk1
ERROR    Job wk1 iteration 1 completed with status FAILED. Max retries exceeded.

from workload-automation.

marcbonnici avatar marcbonnici commented on May 18, 2024

Hmm.. I see. It seems like this is happening because WA is resolving to the first cpu in the cluster and incorrectly not checking to find the first "online" cpu in the cluster.

If you don't have the requirement for particular cpus and only the number online per cluster, one potential workaround may be to online the first cpu of each cluster and hopefully allow WA's resolution to function as intended.
E.g. for your first example:

    sysfile_values:
      /sys/devices/system/cpu/cpu0/online: 1
      /sys/devices/system/cpu/cpu1/online: 0
      /sys/devices/system/cpu/cpu2/online: 1
      /sys/devices/system/cpu/cpu3/online: 0
      /sys/devices/system/cpu/cpu4/online: 1
      /sys/devices/system/cpu/cpu5/online: 0
      /sys/devices/system/cpu/cpu6/online: 1
      /sys/devices/system/cpu/cpu7/online: 0

from workload-automation.

hosunhc avatar hosunhc commented on May 18, 2024

Ahhhh i see, I was hoping that that wasnt the case as I would prefer having the flexibility of particular cpus

from workload-automation.

marcbonnici avatar marcbonnici commented on May 18, 2024

I think I've found the problem (and a few others in the process). Would you be able to try out this [1] branch on your setup and let me know if this resolves the issue for you?

[1] https://github.com/marcbonnici/workload-automation/tree/cpu_domain_fix

from workload-automation.

hosunhc avatar hosunhc commented on May 18, 2024

Okay, so I switched branches, and i just used the setup.py and followed the installation with:

cd workload-automation
sudo -H python setup.py install

And the given version is 3.4.0.dev1+7c432d74. but the issue still seems to occur.

from workload-automation.

marcbonnici avatar marcbonnici commented on May 18, 2024

Hmm.. thanks for trying that out.
Do you have your run.log available to see if there are any further hints in there?

from workload-automation.

marcbonnici avatar marcbonnici commented on May 18, 2024

Thanks, would you be able to pull my branch again and see if this resolves this problem for you?

from workload-automation.

hosunhc avatar hosunhc commented on May 18, 2024

Still seems to be happening.
run.log
Also in case you need the agenda:
stressng_w_10iter.txt

from workload-automation.

scojac01 avatar scojac01 commented on May 18, 2024

Hi Honsunhc - what happens if you try to explicitly set the frequency for each online CPU, rather than the cluster frequency?

e.g

  runtime_parameters:
    cpu0_frequency: 1328000
    cpu5_frequency: 1328000
    cpu7_frequency: 1745000

from workload-automation.

marcbonnici avatar marcbonnici commented on May 18, 2024

Right it looks like next issue here is that WA queries the device at the time it validates the input parameters and this can change before they are committed to the device.

At the point the cluster A76 (for example) will resolve to both cpus 4 and 5 (if both are online at that time) so WA picks the first cpu and hence is later generating the error since as part of the sysfile setting that cpu is being turned off before WA can actually commit the frequency.

I think Scotts workaround should work as it doesn't not rely on this resolution, however I've also updated my branch again to change the order the sysfile runtime parameters are set on the device so that any frequency configuration happens before we offline cpus. Would you be able to check if this one gets things working for you?

from workload-automation.

hosunhc avatar hosunhc commented on May 18, 2024

So I tried both Scotts method and the normal cluster method, and they both work great! There was one instance using the A76 method where the first iteration ran fine but then the remaining four iterations did have the same CPU issues, but this only happened once. If that error persists, I'll open a new issue, but at the moment I think its fixed! Thanks!

from workload-automation.

marcbonnici avatar marcbonnici commented on May 18, 2024

Thanks for confirming, I'm glad we finally have a working setup for you.

I think I might know what could cause the issue with the cluster approach but would need to look into this further so I'll keep this issue open for now as well.

from workload-automation.

hosunhc avatar hosunhc commented on May 18, 2024

So it seems that this could be a more persistent issue.
I attached the run log below:
run.log

from workload-automation.

marcbonnici avatar marcbonnici commented on May 18, 2024

I think the issue here is the cluster names combined with the hotplugging and iterations, the resolution of the cpus is still being performed at the start of the run and when trying to configure the device on subsequent iterations we run into the same problem.

Does using the cpuX_frequency notation still work here?

from workload-automation.

hosunhc avatar hosunhc commented on May 18, 2024

Yep, using cpuX_frequency works great.

from workload-automation.

marcbonnici avatar marcbonnici commented on May 18, 2024

Ok thanks for confirming. I looks like to solve the cluster parameters in combination with hotplugging the runtime parameter mechanism would require some more invasive changes.

Just to double check, are you still using my topic branch to get things working on your end rather than the upstream implementation? If so I'll look at merging those changes so we at least have a workable solution upstream as well.

from workload-automation.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.