Hadoop Cluster

We created small hadoop cluster for tests.

Architecture of a Hadoop Cluster

Before we start, we have to understand different components of hadoop cluster.

Basically we have two main components:

HDFS
YARN

HDFS

HDFS is a distributed file system designed to run on commodity hardware and has master/slave architecture.

NameNode is a single node that manages the file system namespace and regulates access to files by clients. There are also many DataNodes which manage storage attached to the nodes that they run on.

YARN

YARN - splits up the functionalities of resource management and job scheduling/monitoring into separate daemons.

ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system.

NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.

Docker images

We built many small images for hadoop_ cluster.

base/Dockerfile - image has installed hadoop binaries and is also base image for other images
namenode/Dockerfile - hdfs master node
datanode/Dockerfile - hdfs slave node(s)
resourcemanager/Dockerfile - yarn resource manager node
nodemanager/Dockerfile - yarn node manager node
historyserver/Dockerfile - job history server

Quick start

Start hadoop cluster

If you want to start hadoop cluster, run following command:

    docker-compose up

You can check if containers work and expose web ui pages:

For HDFS:

Please notice, we had to remap default datanode port 9864 to 9801, 9802, 9803 and 9804 for data node instances.

For YARN:

Stop hadoop cluster

If you want to stop hadoop cluster, run following command:


    docker-compose down

IMPORTANT If you break down running containers by CTRL+C, by killing processes, etc. next start you can have problems with starting hadoop cluster:


    Attaching to hadoop-base, hadoop-namenode, hadoop-datanode4,
    hadoop-datanode1, hadoop-datanode3, hadoop-datanode2,
    hadoop-historyserver, hadoop-nodemanager, hadoop-resourcemanager
    ...
    hadoop-namenode           | namenode is running as process 1.  Stop it first.
    hadoop-datanode4          | datanode is running as process 1.  Stop it first.
    hadoop-datanode1          | datanode is running as process 1.  Stop it first.
    hadoop-datanode3          | datanode is running as process 1.  Stop it first.
    hadoop-datanode2          | datanode is running as process 1.  Stop it first.
    hadoop-historyserver      | historyserver is running as process 6.  Stop it first.
    hadoop-nodemanager        | nodemanager is running as process 1.  Stop it first.
    hadoop-base exited with code 0
    hadoop-namenode exited with code 1
    hadoop-datanode4 exited with code 1
    hadoop-datanode2 exited with code 1
    hadoop-nodemanager exited with code 1
    hadoop-datanode3 exited with code 1
    ...

To fix it, you have to run docker-compose down. If this does not help, try:


    docker-compose build --no-cache
    docker-compose up

Examples

Lets check hadoop version:

    docker exec hadoop-namenode bin/hadoop version

You should see something like:

    Hadoop 3.1.1
    Source code repository https://github.com/apache/hadoop -r 2b9a8c1d3a2caf1e733d57f346af3ff0d5ba529c
    Compiled by leftnoteasy on 2018-08-02T04:26Z
    Compiled with protoc 2.5.0
    From source with checksum f76ac55e5b5ff0382a9f7df36a3ca5a0
    This command was run using /usr/lib/hadoop-3.1.1/share/hadoop/common/hadoop-common-3.1.1.jar

Listing Files in HDFS:

  docker exec hadoop-namenode bin/hadoop fs -ls /

  Found 1 items
  drwxrwx---   - root supergroup          0 2018-08-30 14:29 /tmp

Insert data into HDFS:

First, we create new input directory:

  docker exec hadoop-namenode bin/hadoop fs -mkdir /input

Second, put some data into hadoop file system:

  docker exec hadoop-namenode sh -c "bin/hadoop fs -put etc/hadoop/c*.xml /input"

Third, check on any node whether some files exist in input dir or not:

  docker exec hadoop-datanode2 bin/hadoop fs -ls /input

  Found 9 items
  -rw-r--r--   1 root supergroup       8260 2018-08-30 14:47 /input/capacity-scheduler.xml
  -rw-r--r--   1 root supergroup        992 2018-08-30 14:47 /input/core-site.xml
  -rw-r--r--   1 root supergroup      10431 2018-08-30 14:47 /input/hadoop-policy.xml
  -rw-r--r--   1 root supergroup        867 2018-08-30 14:47 /input/hdfs-site.xml
  -rw-r--r--   1 root supergroup        620 2018-08-30 14:47 /input/httpfs-site.xml
  -rw-r--r--   1 root supergroup       3518 2018-08-30 14:47 /input/kms-acls.xml
  -rw-r--r--   1 root supergroup        682 2018-08-30 14:47 /input/kms-site.xml
  -rw-r--r--   1 root supergroup       1272 2018-08-30 14:47 /input/mapred-site.xml
  -rw-r--r--   1 root supergroup       2501 2018-08-30 14:47 /input/yarn-site.xml

Retrieve data from HDFS:

View some data by cat command:

  docker exec hadoop-datanode2 bin/hadoop fs -cat /input/httpfs-site.xml

Copy file from HDFS into local filesystem:

  docker exec hadoop-namenode bin/hadoop fs -get /input /tmp

Execute some map reduce task:

  docker exec hadoop-resourcemanager bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar grep /input /output 'dfs[a-z.]+'

corner4world / hadoop-cluster Goto Github PK

hadoop-cluster's Introduction

Hadoop Cluster

Architecture of a Hadoop Cluster

HDFS

YARN

Docker images

Quick start

Start hadoop cluster

Stop hadoop cluster

Examples

hadoop-cluster's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent