Git Product home page Git Product logo

tsp_xeonphi's People

Contributors

dhavals2912 avatar ltc-india avatar parthsl avatar sourabhjains avatar

Stargazers

 avatar

Watchers

 avatar

tsp_xeonphi's Issues

Assymetric per core task placement drops the performance

Running algorithm with OpenMP multi threads scales the performance linearly with the increase in the number of threads when using SMT=1 mode. Hence for Firestone with 20 cores, 20 threads gives the best performance than having lesser threads.
For SMT=4 or 8 mode, this scaling is not linear as introducing one more thread gives poor performance compared to 20 threads.
Using threads in multiple of number of cores gives better result
screenshot from 2018-02-13 14-31-47

For 21 threads, task placement becomes asymmetric and so faster threads has to stall for slower thread to synchronise with it. This drops performance of overall execution.
Core 8and 9 are slower threads making others threads to stall for the synchronisation
screenshot from 2018-02-13 14-30-43


When spwaning 21 threads, OpenMP distributes equal tasks among all threads. Due to 20 cores and SMT=8, exactly 1 core will get 2 threads running in parallel with double tasks compared to other cores. This slows down the core.
The performance needs to be boosted by providing equal tasks to each core and using all SMT threads to make instructions use all computation unit in parallel.

xlC conflicts cpu pinning with OpenMP environment

OpenMP has environment variable OMP_PROC_BIND for thread affinity policy control. Setting this variable to spread allows all threads to spread across all the chips uniformly.
But with xlC when taskset a thread to a cpu disobeys rules and remains pinned to cpu0 always.


xlC compiled binary pinned to cpu0 even if taskset to cpu6
Note: Images have CPU numbering from 1(not from 0).
screenshot from 2018-02-12 14-44-52
xlC compiled binary tasket to cpu6 with OMP_PROC_BIND unset
screenshot from 2018-02-12 14-44-22

This leads to following issue:
Spawned 20 threads runs a thread on CPU0 even if taskset to use only 8-159CPUs
This threads are spread uniformly across both the chips
screenshot from 2018-02-12 14-45-58
Spawned 20 threads with OMP_PROC_BIND set makes threads spawned irregularly degrading the performance
screenshot from 2018-02-12 14-46-21

Wheras GCC gives perfect tasksetting with/without setting OMP_PROC_BIND variable
GCC tasksetting to CPUs 8-160
screenshot from 2018-02-12 14-47-22
GCC tasksetting to CPUs 8-160
screenshot from 2018-02-12 14-46-59

Is there any takeaway from xlC for GCC, some compiler flag for GCC improvising performance; or any xlC specific benefit?

Required performance counter for selecting best Hyper-Threading(HT) mode in Intel Xeon Phi.

The algorithm here gives best result with SMT-4 mode in POWER-8 as shown in below image. This is reflected in front-end-stall-cycles from perf tool being the least for SMT-4 mode.


smt

The same is found for Intel indicating HT-2 mode is the best for the proposed algorithm by showing least execution time among all modes.
ht_modes

The Intel Xeon Phi doesn't support front-end stall cycles counter. So is there any performance counter in Intel Xeon Phi-2(Knights Landing) architecture which can indicate HT-2 is the best in the given case? Or is there any other way to prove HT-2 is the best?

GCC giving up CPU during whereas xlC uses spin locks for synchronisation

Running multi-thread application with uniform load balanced, faster thread has to wait for slower threads for synchronisation. This synchronisation are handled in different ways by gcc and xlC compiled binaries.

GCC compiled binary gives up CPUs during synchronisation and critical regions
screenshot from 2018-02-12 15-17-36
Here CPU48 and 55 are slower one so others don't have 100% cpu utilisation.
screenshot from 2018-02-12 15-22-39

xlC compiled binaries uses spin locks for synchronisation
screenshot from 2018-02-12 15-19-12
Here CPU 8 and 9 are slower one and so others have higher spinlock instruction hits.
screenshot from 2018-02-12 15-28-59

GCC gives up cpu and does context switches which makes performance poorer.
Is there any flag in GCC which can reduce these context-switching during synchronisation?

Optimial Working Set indicates threshold in increase in cache miss with increase in Working Set size

Finding Optimal Working Set helps in deciding the largest workload size fully fitting into the cache. Larger workloads may not fit inside cache and may result in increase in cache misses and thus increasing time of execution. This might allow us to configure architecture to optimise for specific workload.


The data in below graph shows the increase in ratio of cache-misses to the cache-references(in percentage) versus the size of input workload. Also the cache-misses depends on the number of iteration workload runs for. Each iteration does the same data loading and so more iteration means more cache miss rate.
plot 18

From the above graph, it is clear that there is sudden increment in cache-miss rate relative to cache-references between input size of 7000-11000. The working set of size equal to 7000 can properly fit into the cache whereas more size value results in cache miss rate.

Cache Miss rate shows unexpected results.

The tasks is to find the input at which cache starts thrashing. The graph is plotted for the Cache-miss Rate: Ratio of cache-misses to the cache-references aggregated over all the running threads.
The graph from POWER-8 shows expected results showing big jump from 7397 to 11849, indicating data not fitting in cache for 11849.
wss

Question 1. Why cache-miss rate decreases for further increase in input size?
Question 2. For Intel Xeon Phi, cache-miss rate decreases from 7397 to 11849. How is that possible?

Note: The execution time for running binary on Intel and Xeon Phi takes 6sec for 7397 and 21 for former. While POWER-8 too takes same time for these execution with exact accuracy.

CUDA implementation of 2opt shows wrong output.

The Cuda code for 2opt method has issue of float to interger conversion because of which it gives incorrect tour length. The fix for this is required by changing certains int variables to floats.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.