parthsl / tsp_xeonphi Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 5.0 1.23 MB

Optimizing Meta-Heuristic Algorithms for TSP on Xeon Phi Heterogeneous Cluster.

C 95.68% Makefile 4.32%

code-optimization tsp xeon-phi

tsp_xeonphi's People

Contributors

Stargazers

Watchers

Forkers

dhavals2912 sourabhjains dblackmorris ltc-india webclinic017

tsp_xeonphi's Issues

Assymetric per core task placement drops the performance

Running algorithm with OpenMP multi threads scales the performance linearly with the increase in the number of threads when using SMT=1 mode. Hence for Firestone with 20 cores, 20 threads gives the best performance than having lesser threads.
For SMT=4 or 8 mode, this scaling is not linear as introducing one more thread gives poor performance compared to 20 threads.
Using threads in multiple of number of cores gives better result

For 21 threads, task placement becomes asymmetric and so faster threads has to stall for slower thread to synchronise with it. This drops performance of overall execution.
Core 8and 9 are slower threads making others threads to stall for the synchronisation

When spwaning 21 threads, OpenMP distributes equal tasks among all threads. Due to 20 cores and SMT=8, exactly 1 core will get 2 threads running in parallel with double tasks compared to other cores. This slows down the core.
The performance needs to be boosted by providing equal tasks to each core and using all SMT threads to make instructions use all computation unit in parallel.

XLC : Unable to execute in single thread (spawns multiple threads by default)

XLC runs workload on many CPUs despite of no parallel tasks in code. GCC executes in single CPU only but still gives better performance than XLC. CPU pinning gives poorest performance to XLC here.

two_opt multithreading just VNN degrades performance

two_opt parallel approach shows degraded performance with increasing thread where each calculates different initial routes and then best one among all is picked.

xlC conflicts cpu pinning with OpenMP environment

OpenMP has environment variable OMP_PROC_BIND for thread affinity policy control. Setting this variable to spread allows all threads to spread across all the chips uniformly.
But with xlC when taskset a thread to a cpu disobeys rules and remains pinned to cpu0 always.

xlC compiled binary pinned to cpu0 even if taskset to cpu6
Note: Images have CPU numbering from 1(not from 0).

xlC compiled binary tasket to cpu6 with OMP_PROC_BIND unset

This leads to following issue:
Spawned 20 threads runs a thread on CPU0 even if taskset to use only 8-159CPUs
This threads are spread uniformly across both the chips

Spawned 20 threads with OMP_PROC_BIND set makes threads spawned irregularly degrading the performance

Wheras GCC gives perfect tasksetting with/without setting OMP_PROC_BIND variable
GCC tasksetting to CPUs 8-160

GCC tasksetting to CPUs 8-160

Is there any takeaway from xlC for GCC, some compiler flag for GCC improvising performance; or any xlC specific benefit?

Required performance counter for selecting best Hyper-Threading(HT) mode in Intel Xeon Phi.

The algorithm here gives best result with SMT-4 mode in POWER-8 as shown in below image. This is reflected in front-end-stall-cycles from perf tool being the least for SMT-4 mode.

The same is found for Intel indicating HT-2 mode is the best for the proposed algorithm by showing least execution time among all modes.

The Intel Xeon Phi doesn't support front-end stall cycles counter. So is there any performance counter in Intel Xeon Phi-2(Knights Landing) architecture which can indicate HT-2 is the best in the given case? Or is there any other way to prove HT-2 is the best?

GCC giving up CPU during whereas xlC uses spin locks for synchronisation

Running multi-thread application with uniform load balanced, faster thread has to wait for slower threads for synchronisation. This synchronisation are handled in different ways by gcc and xlC compiled binaries.

GCC compiled binary gives up CPUs during synchronisation and critical regions

Here CPU48 and 55 are slower one so others don't have 100% cpu utilisation.

xlC compiled binaries uses spin locks for synchronisation

Here CPU 8 and 9 are slower one and so others have higher spinlock instruction hits.

GCC gives up cpu and does context switches which makes performance poorer.
Is there any flag in GCC which can reduce these context-switching during synchronisation?

Optimial Working Set indicates threshold in increase in cache miss with increase in Working Set size

Finding Optimal Working Set helps in deciding the largest workload size fully fitting into the cache. Larger workloads may not fit inside cache and may result in increase in cache misses and thus increasing time of execution. This might allow us to configure architecture to optimise for specific workload.

The data in below graph shows the increase in ratio of cache-misses to the cache-references(in percentage) versus the size of input workload. Also the cache-misses depends on the number of iteration workload runs for. Each iteration does the same data loading and so more iteration means more cache miss rate.

From the above graph, it is clear that there is sudden increment in cache-miss rate relative to cache-references between input size of 7000-11000. The working set of size equal to 7000 can properly fit into the cache whereas more size value results in cache miss rate.

Cache Miss rate shows unexpected results.

The tasks is to find the input at which cache starts thrashing. The graph is plotted for the Cache-miss Rate: Ratio of cache-misses to the cache-references aggregated over all the running threads.
The graph from POWER-8 shows expected results showing big jump from 7397 to 11849, indicating data not fitting in cache for 11849.

Question 1. Why cache-miss rate decreases for further increase in input size?
Question 2. For Intel Xeon Phi, cache-miss rate decreases from 7397 to 11849. How is that possible?

Note: The execution time for running binary on Intel and Xeon Phi takes 6sec for 7397 and 21 for former. While POWER-8 too takes same time for these execution with exact accuracy.

Results are taken on Firestone system with 10cores/socket, 2 Socket and SMT-8 enabled (160 CPUs).

CUDA implementation of 2opt shows wrong output.

The Cuda code for 2opt method has issue of float to interger conversion because of which it gives incorrect tour length. The fix for this is required by changing certains int variables to floats.

parthsl / tsp_xeonphi Goto Github PK

tsp_xeonphi's People

Contributors

Stargazers

Watchers

Forkers

tsp_xeonphi's Issues

Assymetric per core task placement drops the performance

XLC : Unable to execute in single thread (spawns multiple threads by default)

two_opt multithreading just VNN degrades performance

xlC conflicts cpu pinning with OpenMP environment

Required performance counter for selecting best Hyper-Threading(HT) mode in Intel Xeon Phi.

GCC giving up CPU during whereas xlC uses spin locks for synchronisation

Optimial Working Set indicates threshold in increase in cache miss with increase in Working Set size

Cache Miss rate shows unexpected results.

Results are taken on Firestone system with 10cores/socket, 2 Socket and SMT-8 enabled (160 CPUs).

CUDA implementation of 2opt shows wrong output.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent