Git Product home page Git Product logo

in-network-hadoop-ndn-cachesimulator's Introduction

This is a caching simulator developed to evaluate the performance of caching when using the Hadoop Distributed File System (HDFS) over the Named Data Networking (NDN) information-centric networking (ICN) architecture. It replays traces of read operations collected from HDFS clusters and evaluates them in a fat-tree network.

Our results were published in the paper "On the Power of In-Network Caching in the Hadoop Distributed File System" by E. Newberry and B. Zhang, which appeared in ACM ICN 2019.

Repository Structure

  • src contains the source code of the simulator.
  • topologies contains various topologies, although only the fat-tree (k=8 and 128 nodes) topology works currently.
  • traces contains various traces, although many of the ones in the "old" subdirectory may not work with the current simulator because they omit critical information (e.g., the starting offset of the read).

Requirements

This simulator should require no prerequisites apart from a modern version of Java and the libraries included in the standard distribution.

Compiling

The simulator can be compiled by running make. After modifications to the code, make clean should be run before running make again. Please note that this will erase the generated "build" directory and all of its contents.

Running

NOTE: Currently, this simulator is hardcoded to use 128 end hosts and a fat tree k factor of 8. However, these values should be trivial to modify in src/CacheSim.java.

The simulator must be run from within the "build" directory generated by running make. It can be invoked in the following way:

java CacheSim [topologyFile] [traceFile] [edgeCachePolicy] [aggrCachePolicy] [coreCachePolicy] [cacheBlockSize] [cacheMaxBlocks] (randomSeed) (nameNode)

The simulator will assign hosts to random positions in the cluster, based upon the provided seed (or, if none is provided, the seed provided by the Random() constructor). The NameNode argument was added as a workaround to fit a cluster with 128 DataNodes and 1 NameNode onto a fat-tree with 128 end hosts -- if specified, any hosts in the trace with this ID will be added as an additional node in the first pod of the fat-tree.

As explained in our paper, the simulator segments HDFS blocks into chunks ("cache blocks") of the size specified on the command line, starting from the beginning of the block. Read operations with the same source and destination host will be ignored and not included in the results, since they do not generate network traffic for the HDFS data they transfer. Moreover, the end location of the last cache block in an HDFS block will be rounded up, even if there is no data past that point, since there is no way to determine the actual size of the block from the input traces.

The simulator will output results for each router in the topology in the following format after it completes its evaluations:

cacheBlockAccesses,cacheBlockHits,cacheBlockHitRatio,byteAccesses,byteHits,byteHitRatio

However, if there was no network read activity for that router, the simulator will output "No Activity" instead of these counts. In the results, routers will be separated by layer of the fat-tree. Additionally, the total network traffic in bytes (with traffic counted once for each link it traverses) will be outputed at the end.

Trace Format

Input traces to this simulator have one operation per line, with the following information written for each operation (separated by spaces):

  • Timestamp - We used this value to sort the operations chronologically before using them as input to the simulator, as the simulator will process operations in the order they appear in the file. As such, this can be any value if the exact timestamps are unavailable.
  • Operation type (READ or WRITE) - This is used to separate out read and write operations, since we are only interested in read operations.
  • Block ID (integer) - Unique HDFS block ID
  • Size (integer) - Size of the read operation (in bytes)
  • Offset (integer) - Starting offset of the read operation within the HDFS block (in bytes)
  • Source (integer) - Unique identifier of the reading client
  • Destination (integer) - Unique identifier of the DataNode storing the read block

For example, a read operation for a 128KB read starting at a 256KB offset in block 987654321 being read by host 123 from the DataNode running on host 456 could have the following format:

2019-09-19-11-44-00-00 READ 987654321 131072 262144 123 456

Please note that our simulator does not take version ("generation stamp") into account. This is because none of our traces featured operations with multiple generation stamp values associated with a single block ID.

Topology Format

Originally, the topology file was used to allow the simulator to support multiple topologies. However, the latest version of the simulator has been hardcoded for the fat-tree topology. Instead, this file is used to validate the fat-tree generated at runtime to ensure that it is correct. The file we used to evaluate a fat-tree with k=8 can be found in topologies/fattree-128node-k8.txt.

Topology files have the following format:

  • Line 1: number of nodes (vertices) in the topology
  • Line 2: number of links (edges) in the topology
  • Lines 3+: source and destination pairs of nodes indexed from zero (separated by a space), one per line - These links are bidirectional, so only one does not need to specify both "4 5" and "5 4".

in-network-hadoop-ndn-cachesimulator's People

Contributors

eric135 avatar navsing avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

in-network-hadoop-ndn-cachesimulator's Issues

TinyLFU?

Hi!

I was happily surprised to see that you adapted policies from my cache simulator for your research. Hopefully its been helpful.

My evaluation led to adopting TinyLFU after making some modifications. I found it often performed as well as or better than LIRS, but is far simpler to implement and make efficient. Implementing LIRS from the papers and original C code is more reverse engineering than understanding, though it does extremely well. I've been very happy with TinyLFU and contributed towards a longer paper, so I would find it interesting to see how it does in your analysis.

Also, when appropriate please send me your paper / results.

Cheers,
Ben

Missing spark-traces-aws-cluster-2017-11-22?

Hi,

I am very interested in your project and tried running the runTests.sh script but the repository seems to be missing the spark-traces-aws-cluster-2017-11-22 directory which should be above the root directory of this project?
It would be very helpful to me if you could provide me with that directory and its contents or even with a small readme to run your application?

Best

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.