Git Product home page Git Product logo

datamigrationbenchmarkingtool's Introduction

Manual for using the data migration benchmarking tool

Overview

Welcome to DMBench – a versatile and scalable solution designed to streamline the process of data migration across diverse scenarios. This framework orchestrates the migration journey from a source machine to a target machine through a robust Controller and Migration Engine. Leveraging Docker containers for seamless deployment.

Issues

If you encounter any issues or have questions while using our system, please don't hesitate to submit an issue on our GitHub repository. Your contributions are crucial in improving the user experience.

Tests

Running Tests

To run tests for this project, please refer to the test.md file. The test.md file provides detailed instructions for each test scenario, including prerequisites, dependencies, and execution steps. Follow the outlined procedures to ensure the correctness and functionality of the project. If you encounter any issues or have questions related to the testing process, feel free to open an issue for assistance.

Physical Requirements

  • Data Source: Where the data journey begins. The Migration Engine grabs data from here to send it off to the target machine.
  • Data Target: The final stop for migrated data. This is where data ends up after the Migration Engine does its job, finding its new home.
  • Controller: The mastermind behind all experiments. It sets up the Migration Engine for each experiment, tweaking parameters. The Controller kicks off and oversees the migration, keeping an eye on performance through migration logs. It also tracks resource usage by using cAdvisor and node-exporter on the same setup as the migration engine.
  • Databases: Two key players in the framework. Prometheus, the timeseries database, gathers resource data from cAdvisor. The second database is the home for all performance data from experiments. Prometheus focuses on resource metrics, while the second database stores broader performance data, making sure all experiment results are neatly organized.
  • Logs Reporter: The organizer of experiment logs. It has two parts. The first part is a Kafka cluster, a storage space for all logs. Both the Controller and Migration Engine share their logs here, managed by a dedicated consumer. The second part is the parser. It not only extracts data from logs but also makes it easy for humans to read. Parsed info goes into CSV files before finding a permanent home in a NoSQL database. This two-part system ensures a smooth and effective process for handling, understanding, and gaining insights from experiment logs.

Prerequisites

For easy setup and deployment, all components of the framework are packaged as Docker containers. To run the tool, you'll need five machines, each with specific dependencies:

  • Data Source Machine
  • Data Target Machine
  • Controller & Migration Engine: Docker
  • Kafka Cluster: Docker, Python
  • Databases: Docker

While it's possible to deploy everything on one machine, it's recommended to use separate machines, preferably in different locations. This setup adds a touch of realism to the migration process, accounting for potential network delays in the evaluation.

Setting up the environment

In our Git repository, you'll find a dedicated deployment folder. Inside this folder, there are distinct subfolders—databases, controller, and reporter—each designed for download onto their respective machines.

Configuration

In this section, we’ll configure each component of the framework deployed on each machine, assuming you’ve already downloaded the corresponding folders onto each machine.

Databases
For this machine, the only necessary configuration is to access the file `prometheus.yml` and modify the following sections by replacing 'localhost' with the IP address of the Controller & Migration Engine machine:
- job_name: 'node-exporter'
  static_configs:
    - targets: ['<Controller-IP>:9100']

- job_name: 'cAdvisorr'
  static_configs:
    - targets: ['<Controller-IP>:9100']

Replace with the actual IP address of your Controller & Migration Engine machine.

Kafka Cluster

For this machine, we have to configure two subfolders.

Kafka Cluster

  1. Change the current working directory to the Kafka cluster folder.
  2. Edit docker-compose.yml :

    In docker compose change these environment variables by changing 192.168.122.145 with your machine's public ip address. KAFKA_ADVERTISED_LISTENERS: INTERNAL://kafka1:19092,EXTERNAL://192.168.122.145:9092,DOCKER://host.docker.internal:29092 KAFKA_JMX_HOSTNAME: 192.168.122.145.
  3. Run pip install -r requirements.txt.

Logs Reporter

  1. Change the current working directory to the logsParser folder.

  2. Open the file config.ini; you have to edit the following parameters:

    • host = 192.168.122.1: Change this with the IP address of your databases IP.
    • user = root: This is the default username used to run the NoSQL MongoDB. If you want to change it, you also have to change MONGO_INITDB_ROOT_USERNAME in docker-compose.yml on the databases machine.
    • password = example: This is the default password used to run the NoSQL MongoDB. If you want to change it, you also have to change MONGO_INITDB_ROOT_PASSWORD in docker-compose.yml on the databases machine.
Controller

The Controller utilizes a pivotal configuration file named "config.ini," crucial for providing essential settings to the Migration Engine. This configuration holds paramount importance, guiding users in the dockerization of their migration engine.

The "config.ini" file consists of two integral parts:

  1. First Part:

    • This section is transmitted unaltered to the Migration Engine. Its content remains intact when creating config.ini for the migration engine.

    • [[targetServer]]

      • In this section, the user can put any information needed to connect to the target Server.
        • host
        • user
        • password
    • [[sourceServer]]

      • In this section, the user can put any information needed to connect to the source Server.
        • host
        • user
        • password
    • [[KafkaCluster]]

      • In this section, the user should only change the value of the IP address of the reporter machine. The other variables should remain with the default values.
        • host=192.168.122.143; this should be changed with the reporter IP
        • port=9092
        • performanceBenchmarkTopic=performanceBenchmark
        • frameworkTopicName=framework
    • [[migrationEnvironment]]

      • In this section, the user should choose to put information needed for the migration.
        • migrationEngineDockerImage: the name of the docker image the user created for the migration engine.
        • loggingId: In case the user needs all the logs and information collected during the monitoring to be assigned to a certain Id; this can be left empty.
        • numberofexperiments: how many times each experiment is repeated with the same configuration (for the accuracy of the results).
  2. Second Part:

    • The second part encompasses all conceivable parameters for the migration scenarios users wish to evaluate. Each parameter combination is systematically chosen by the Controller, which then conveys these specific parameters to the Migration Engine one at a time.

    • [[experiment]]

      • In this section, this is an example for parameters for a file migration engine, the user can put parameters according to his engine.
        • file = file1.csv, file2.txt, file3.java
        • limit = 1048576, 1048576
        • compressiontype = None, gzip, lz4
        • stream = 3, 2, 1

    The Controller is responsible for examining all possible combinations when generating the configuration file for the Migration Engine. As an illustration of the second part of the configuration file, consider the following example:

    • [[experiment]]
      • file = file1.csv
      • limit = 1048576
      • compressiontype = None
      • stream = 3

Dockerizing the migration engine

In this section, we delve into the migration engine Docker image as detailed in the configuration section . To effectively test your migration engine, adherence to our standards is essential. This implies dockerizing your migration engine in accordance with the specified guidelines.

For seamless execution, assume that the source and target of the data are already operational and prepared for migration when running the migration engine. Additionally, your Docker container should be configured to anticipate the configuration file generated by the Controller, as outlined in configuration section. Upon initiating the container, the migration process will commence.

Running the experiment

In this process, we will proceed step by step, emphasizing the importance of executing each component in a specified order. It is crucial to ensure that your source and target systems are operational and prepared for the migration process.

Let's begin systematically:

Databases Start by initiating the databases.
  • On the Databases machine, change the current working directory to the databases folder.
  • Run the following command:
    docker-compose up
  • This will initiate Prometheus and MongoDB database along with Grafana. Grafana serves as a dashboard designed to help you monitor your migration engine's resource consumption in real-time.
Kafka Cluster Next, launch the Kafka cluster.
  • On the Kafka cluster machine, change the current working directory to the reporter/kafkacluster folder.
  • Run the following command:
    docker-compose up
Kafka Consumer After ensuring that Kafka is up and ready, follow up by activating the Kafka cluster's consumer.
  • On the Kafka cluster machine, change the current working directory to the reporter/kafkacluster folder.
  • Run the following command:
    python consumer.py
Controller

Finally, initiate the Controller, which will orchestrate and commence all experiments.

  • On the controller machine, change the current working directory to the controller folder.
  • Run the following command:
    docker pull "<your migration engine image>"
    docker compose up
  • This will initiate the controller along with cAdvisor and node-exporter.
    • cAdvisor serves as a daemon that collects, aggregates, processes, and exports information about running the controller and the migration engine.
    • node-exporter is designed to monitor the host system where all the containers are deployed on.
Parser Upon completion of the experiment, indicated by the termination of the Controller container, it is essential to navigate to the Kafka cluster machine.
  • Subsequently, the parser needs to be executed. Throughout the experiment, resource consumption logs were promptly stored in Prometheus. However, the logs pertaining to performance benchmarks remain localized on the Kafka machine.
  • Running the parser becomes imperative at this juncture. Its role is twofold: to render the performance benchmark logs human-readable and to facilitate their exportation into JSON and CSV files. Furthermore, the parser ensures the archival of these logs in the MongoDB database for comprehensive analysis and reference.
  • On the Kafka cluster machine, change the current working directory to the reporter/logsParser folder.
  • Run the following command:
    python main.py

Result

After following all the steps we talked about earlier, you've got a bunch of data from your experiments. Every performance benchmark, including migration time per experiment and, if there's any, compression time, is now neatly stored in the MongoDB database. At the same time, PrometheusDB has all the nitty-gritty details about resource consumption, and you can easily visualize it using the Grafana Dashboard. This mix ensures you've got all the important info at your fingertips for a deep dive into your experiments.

datamigrationbenchmarkingtool's People

Contributors

hamoudafares avatar

Stargazers

 avatar  avatar

Watchers

Marios Fokaefs avatar  avatar

datamigrationbenchmarkingtool's Issues

Extract total transfer time calculation in Main.py in a separate method

By giving a proper name to the method, you'd increase the understandability of the code. As it is, it is not clear why this code cannot go in the Experiment class. Does it count the total time between all experiments? In this case, is it appropriate to call it "Total transfer time"? This needs to be clear.

Editing configuration files

Within the configuration files there were example usernames and directories which matched those of the author's local machine. I would suggest replacing these with something user-agnostic may make it clearer to the reader what these values should represent. These are (under [targetServer]:

  • username = fareshamouda
  • dataFolder_path = /Users/fareshamouda/Desktop/Desktop/dev/DataMigrationBenchmarkingTool/deployment/tests/target/

refine the architecture

refine the architechture, use object oriented programming to make the project extensible, where it's easier to add features and change old features

Add documentation to classes

This is critical given that this is a framework. Every file/class/method needs to have proper structured documentation.

In python, reStructuredText is the format and Sphinx is the tool to generate HTML or Latex documentation for code.

configparser.NoSectionError: No section: 'localServer'

After building the image and trying to run the interactive, this error pops up. looks like it can not find the file 'config1.ini', I have renamed the 'config,ini' in the configs folder to 'config1.ini' but the error still coming up, is it possible that the docker file was not set up correctly?

Unable to Connect to Port 22

When running the controller, it fails to start with an error indicating an inability to connect to port 22. This issue is likely due to SSH not being enabled or properly configured on the target machine.

Consistent naming

Be consistent with naming. There is "Main.py" in controller and "main.py" in migrationEnging. Also, some names are in camel case (like migration Engine) some are in Pascal case (like KafkaLogger.py) and others are in flat case (like clearcache.sh).

Avoid hardcoded variables and paths as much as possible

See Main.py for example.

The configparser initialization reads "configs/config.ini". If this is a path that is set up automatically when the tool is installed, that's ok. If it is or can be specified by the user that can cause problems. In either case, it is a good idea to add a file with literals and "constants" where you can keep all of these values centrally, so they can be reused and modified with little impact.

See an example of such a practice here: https://developer.android.com/guide/topics/resources/string-resource

Organize the architecture of the tool

It is a good practice to organize code in folders/packages, even if you have few files. On one hand, this will give a view about the dependencies between files, but it will also give a perspective about the general architecture of the tool. Remember that developers may not need to extend everything, so named packages provide a good guide of what they can and what they need to extend and they are linked better to documentation.

One example is your Experiment class that makes a reference to the clearcache.sh file. However, the two files are not in the same folder. If you put everything under src this may imply that the clearcache script is needed by all other files. If you create separate packages though, you'd organize your files better.

Generate graphs to the user.

Generate the graphs that shows how the mean values of all different groups of variables behave so that the user can follow the elbow method to choose the best configuration.

R script can be useful

Read-Only File System During Docker Mount Creation

When attempting to run the controller with Docker Compose, an error occurs: Error response from daemon: error while creating mount source path '/var/lib/docker': mkdir /var/lib/docker: read-only file system.

Avoid duplication

One example is that the clearcache.sh file appears in multiple copies. Do they work differently? Can they be accessed from a central place in the project?

Add to documentation

Add the documentation on how to add the migration engine's logger , logs parser and change config.ini for logs parser by choosing the migration engine and database name.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.