Git Product home page Git Product logo

gam's People

Contributors

cac2003 avatar guowentian avatar ooibc88 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

gam's Issues

Abnormal memory access latency when using multiple servers

Hi @ooibc88 @cac2003 @guowentian

I ran some performance benchmarks on GAM that yield unexpected latency numbers when I increase the number of servers, and I was hoping to get some insights from you regarding them. Below are details on the experimental setup, methodology and results.

Experiment setup:

  1. Two servers VM1 and VM2 with 512MB of local memory, and all memory used as cache.
  2. One server VM3 with all available DRAM used as local memory (~10GB), and no cache.

Therefore VM1 and VM2 fetch data from VM3, and keep it in their local cache.

Method:

I replayed several memory traces captured from different applications against GAM, under two scenarios (listed below), and recorded the execution time for both of them. The memory footprint of the application (~1GB) is larger than local cache size (512MB), so there are evictions along with invalidations. All memory accesses are 1 byte.

Scenario 1: Replay the memory traces for 10 threads on VM1, keep VM2 idle.
Scenario 2: Replay the memory traces for 10 threads on VM1 and 10 threads on VM2; this means that there are invalidations between the VMs due to shared memory accesses.

Results:

I expected Scenario 2 to be slower due to more invalidations between VM1 and VM2, but found Scenario 2 was actually faster than Scenario 1.

To understand the results better, I profiled the memory access latency in GAM, separating the latency for local and remote memory accesses (as shown in the table below; only measured for read operations, since write operations are always asynchronous under the PSO model).

Local access latency(us) Remote access latency(us)
Scenario 1 2.2 299
Scenario 2 1.4 84

Even though there are invalidations in Scenario 2, the remote access latency is smaller for Scenario 2 compared to Scenario 1. Also there is a slight speed up in local memory accesses in Scenario 2.

Despite extensive profiling, I was unable to explain this strange behavior; is this expected? If so, why? Thank you for taking the time to read this issue --- I would really appreciate any help!

Unable to allocate hash table

Thanks for providing your source code online for reproduction!
I tried to run the benchmark script (scripts/benchmark-all.sh) but unfortunately it claims that it is unable to allocate the hash table: [worker.cc:92-Worker()] Unable to allocate hash table!!!!
It looks like htable = sb.sb_aligned_malloc(NBKT * BKT_SIZE, BKT_SIZE); is not executed, but I do not know why... I tried some dirty fixes but finally, I gave up.
Do you know what I should change? Thanks in advance!

Failed to run benchamark-all.sh

Hi @cac2003 @guowentian @ooibc88

Since the IB network is not available for us, I adapted GAM to run on RoCE (thanks @charles-typ).
However, we have some problems when running ./scripts/benchmark-all.sh.
Experiment Setup (3VMs):

  • GCC: 10.3.1
  • Kernel: 5.10.0
  • remote_ratio > 0

Here is the output (including some customized logs):

[10744] 03 Feb 16:46:12.129 - [benchmark.cc:658-main()] #Node ID = 1
[6344] 03 Feb 16:46:13.088 - [benchmark.cc:658-main()] #Node ID = 2
[5008] 03 Feb 16:46:14.101 - [benchmark.cc:658-main()] #Node ID = 3
cannot find the key(2) for hash table widCliMap (key not found in table)cannot find the key(1) for hash table widCliMap (key not found in table)[5008] 03 Feb 16:46:17.101 - [benchmark.cc:668-main()] Get 1 on node 3
[5008] 03 Feb 16:46:17.101 - [benchmark.cc:668-main()] Get 2 on node 3
[5008] 03 Feb 16:46:17.101 - [benchmark.cc:668-main()] Get 3 on node 3
[5008] 03 Feb 16:46:17.101 - [benchmark.cc:671-main()] ###All workers started, reported by node 3###
[5012] 03 Feb 16:46:17.102 - [benchmark.cc:144-Init()] start init
cannot find the key(2) for hash table widCliMap (key not found in table)[10746] 03 Feb 16:46:17.101 - [master.cc:226-ProcessRequest()] unrecognized work request 1

Does anyone have some experience with this?
Thanks!!!

Adding MFence to enforce SC consistency doesn't work as expected

Hi @ooibc88 @cac2003 @guowentian

I have been trying to understand the impact of stronger consistency guarantees on application performance in GAM. To this end, I tried to enforce SC consistency by adding an MFence operation after each write (as suggested in Section 4 of the paper: โ€œFor example, sequential consistency can be easily achieved by inserting MFence following each Write operation.โ€). Below are details on the experimental setup, methodology and results.

Experiment setup:

  1. Two servers VM1 and VM2 with 512MB of local memory, and all memory used as cache.
  2. One server VM3 with all available DRAM used as local memory (~10GB), and no cache.

Therefore VM1 and VM2 fetch data from VM3 and keep it in their local cache.

Method:

I replayed several memory traces captured from different applications against GAM, under two scenarios (listed below), and recorded the execution time for both of them. The memory footprint of the application (~1GB) is larger than local cache size (512MB), so there are evictions along with invalidations. All memory accesses are 1 byte.

Scenario 1: Run an application with 10 threads on VM1, PSO consistency.
Scenario 2: Run an application with 10 threads on VM1, enforce SC with memory fences.

Result:

I expected Scenario 2 to be slower since writes cannot be asynchronous anymore. However, Scenario 2 was actually faster than scenario 1 (by 5%-10%).

Questions:

Is the MFence operation completely supported in the current code base?
Are there any benchmarks that compare SC and PSO consistency in the repo?

Thank you for taking the time to read this issue --- I would really appreciate any help!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.