Git Product home page Git Product logo

bd-infra's Introduction

bd-infra

A development big data infrastructure with docker-compose.
In this platform, you will have HDFS, Hive, Spark, Hue, Zeppelin, Kafka, Zookeeper, and Streamsets connected together.
Just run docker-compose up and enjoy!

bd-infra's People

Contributors

msemn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

bd-infra's Issues

zeppelin | Apache Zeppelin requires either Java 8 update 151 or newer

zeppelin | - Setting hadoop.proxyuser.hue.groups=*

zeppelin | Configuring hdfs

zeppelin | - Setting dfs.webhdfs.enabled=true

zeppelin | - Setting dfs.permissions.enabled=false

zeppelin | Configuring yarn

zeppelin | - Setting yarn.timeline-service.enabled=true

zeppelin | - Setting yarn.resourcemanager.system-metrics-publisher.enabled=true

zeppelin | - Setting yarn.resourcemanager.store.class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore

zeppelin | - Setting yarn.log.server.url=http://historyserver:8188/applicationhistory/logs/

zeppelin | - Setting yarn.resourcemanager.fs.state-store.uri=/rmstate

zeppelin | - Setting yarn.timeline-service.generic-application-history.enabled=true

zeppelin | - Setting yarn.log-aggregation-enable=true

zeppelin | - Setting yarn.resourcemanager.hostname=resourcemanager

zeppelin | - Setting yarn.resourcemanager.resource.tracker.address=resourcemanager:8031

zeppelin | - Setting yarn.timeline-service.hostname=historyserver

zeppelin | - Setting yarn.resourcemanager.scheduler.address=resourcemanager:8030

zeppelin | - Setting yarn.resourcemanager.address=resourcemanager:8032

zeppelin | - Setting yarn.nodemanager.remote-app-log-dir=/app-logs

zeppelin | - Setting yarn.resourcemanager.recovery.enabled=true

zeppelin | Configuring httpfs

zeppelin | Configuring kms

zeppelin | Configuring mapred

zeppelin | Configuring hive

zeppelin | - Setting datanucleus.autoCreateSchema=false

zeppelin | sed: can't read /opt/hive/conf/hive-site.xml: No such file or directory

zeppelin | - Setting javax.jdo.option.ConnectionPassword=hive

zeppelin | sed: can't read /opt/hive/conf/hive-site.xml: No such file or directory

zeppelin | - Setting hive.metastore.uris=thrift://hive-metastore:9083

zeppelin | sed: can't read /opt/hive/conf/hive-site.xml: No such file or directory

zeppelin | - Setting javax.jdo.option.ConnectionURL=jdbc:postgresql://hive-metastore-postgresql/metastore

zeppelin | sed: can't read /opt/hive/conf/hive-site.xml: No such file or directory

zeppelin | - Setting javax.jdo.option.ConnectionUserName=hive

zeppelin | sed: can't read /opt/hive/conf/hive-site.xml: No such file or directory

zeppelin | - Setting javax.jdo.option.ConnectionDriverName=org.postgresql.Driver

zeppelin | sed: can't read /opt/hive/conf/hive-site.xml: No such file or directory

zeppelin | Configuring for multihomed network

zeppelin | Apache Zeppelin requires either Java 8 update 151 or newer

zeppelin exited with code 1

Ports are not available

When I run docker compose up (as administrator) I will get the following error:Error response from daemon: Ports are not available: exposing port TCP 0.0.0.0:50070 -> 0.0.0.0:0: listen tcp 0.0.0.0:50070: bind: An attempt was made to access a socket in a way forbidden by its access permissions.
I am using Windows 11 and IIS is not installed and the port is not assigned to any other service and it is free.
I have tested same docker file on Linux and it will work there.
Could you please guide me what is the problem and how I can solve it?

Fails to start on M1 Mac, missing ARM image

$ docker-compose up
Status: Downloaded newer image for openkbs/docker-spark-bde2020-zeppelin:latest
Pulling database (mysql:5.7)...
5.7: Pulling from library/mysql
ERROR: no matching manifest for linux/arm64/v8 in the manifest list entries

Kerberos integration?

Dear all,
I'm new to Hadoop and Spark. I'm trying to use your setupfor writing integration tests for spark job that works with Clickhouse and Hdfs.
One of the issues I faced - it looks like there is no integration with Kerberos.

Is there any hint or example how to integrate Kerberos auth with current configuration?

hue crashes after a few seconds

Hello,

Thanks for sharing this great repo. Unfortunately, it seems like hue service is crashing and I cannot access any UIs (not even the namenode UI. I am running the following commands in Ubuntu 20.04 inside WSL2:

git clone https://github.com/m-semnani/bd-infra.git
cd bd-infra
docker-compose up -d

Then, after everything is installed and the services are up and running, I do a simple docker ps and see the following results:

CONTAINER ID        IMAGE                                             COMMAND                  CREATED             STATUS                            PORTS                                                      NAMES
d1cac433a6ee        bde2020/hive:2.3.2-postgresql-metastore           "entrypoint.sh /bin/…"   6 seconds ago       Up 4 seconds                      0.0.0.0:10000->10000/tcp, 10002/tcp                        hive-server
603604a7ce37        bde2020/hive:2.3.2-postgresql-metastore           "entrypoint.sh /opt/…"   7 seconds ago       Up 5 seconds                      10000/tcp, 0.0.0.0:9083->9083/tcp, 10002/tcp               hive-metastore
9964d2f49f5b        bde2020/hive-metastore-postgresql:2.3.0           "/docker-entrypoint.…"   8 seconds ago       Up 6 seconds                      5432/tcp                                                   hive-metastore-postgresql
494f5edaed7e        bde2020/spark-worker:2.4.0-hadoop2.7              "/bin/bash /worker.sh"   8 seconds ago       Up 6 seconds                      0.0.0.0:8081->8081/tcp                                     spark-worker
3d42461686fb        gethue/hue:20191107-135001                        "./startup.sh"           8 seconds ago       Up 7 seconds                      0.0.0.0:8888->8888/tcp                                     hue
ef5213648cee        bde2020/hadoop-datanode:2.0.0-hadoop2.7.4-java8   "/entrypoint.sh /run…"   9 seconds ago       Up 7 seconds (health: starting)   0.0.0.0:50075->50075/tcp                                   datanode
82dc9e4ab2ab        bde2020/hadoop-namenode:2.0.0-hadoop2.7.4-java8   "/entrypoint.sh /run…"   10 seconds ago      Up 8 seconds (health: starting)   0.0.0.0:50070->50070/tcp                                   namenode
51271761043f        wurstmeister/kafka:2.12-2.3.0                     "start-kafka.sh"         10 seconds ago      Up 8 seconds                      0.0.0.0:9092->9092/tcp                                     bd-infra_kafka_1
8c6f51379a48        bde2020/spark-master:2.4.0-hadoop2.7              "/bin/bash /master.sh"   10 seconds ago      Up 8 seconds                      0.0.0.0:7077->7077/tcp, 6066/tcp, 0.0.0.0:8080->8080/tcp   spark-master
b8ca46f10106        wurstmeister/zookeeper:3.4.6                      "/bin/sh -c '/usr/sb…"   10 seconds ago      Up 7 seconds                      22/tcp, 2888/tcp, 3888/tcp, 0.0.0.0:2181->2181/tcp         bd-infra_zookeeper_1

As we can see, the hue service is running is up and running with container ID 3d42461686fb. However, after I wait a few seconds and do docker ps again, I see that the hue container is not running anymore:

CONTAINER ID        IMAGE                                             COMMAND                  CREATED             STATUS                    PORTS                                                      NAMES
d1cac433a6ee        bde2020/hive:2.3.2-postgresql-metastore           "entrypoint.sh /bin/…"   29 seconds ago      Up 28 seconds             0.0.0.0:10000->10000/tcp, 10002/tcp                        hive-server
603604a7ce37        bde2020/hive:2.3.2-postgresql-metastore           "entrypoint.sh /opt/…"   30 seconds ago      Up 29 seconds             10000/tcp, 0.0.0.0:9083->9083/tcp, 10002/tcp               hive-metastore
9964d2f49f5b        bde2020/hive-metastore-postgresql:2.3.0           "/docker-entrypoint.…"   31 seconds ago      Up 30 seconds             5432/tcp                                                   hive-metastore-postgresql
494f5edaed7e        bde2020/spark-worker:2.4.0-hadoop2.7              "/bin/bash /worker.sh"   31 seconds ago      Up 30 seconds             0.0.0.0:8081->8081/tcp                                     spark-worker
ef5213648cee        bde2020/hadoop-datanode:2.0.0-hadoop2.7.4-java8   "/entrypoint.sh /run…"   32 seconds ago      Up 31 seconds (healthy)   0.0.0.0:50075->50075/tcp                                   datanode
82dc9e4ab2ab        bde2020/hadoop-namenode:2.0.0-hadoop2.7.4-java8   "/entrypoint.sh /run…"   33 seconds ago      Up 32 seconds (healthy)   0.0.0.0:50070->50070/tcp                                   namenode
51271761043f        wurstmeister/kafka:2.12-2.3.0                     "start-kafka.sh"         33 seconds ago      Up 32 seconds             0.0.0.0:9092->9092/tcp                                     bd-infra_kafka_1
8c6f51379a48        bde2020/spark-master:2.4.0-hadoop2.7              "/bin/bash /master.sh"   33 seconds ago      Up 31 seconds             0.0.0.0:7077->7077/tcp, 6066/tcp, 0.0.0.0:8080->8080/tcp   spark-master
b8ca46f10106        wurstmeister/zookeeper:3.4.6                      "/bin/sh -c '/usr/sb…"   33 seconds ago      Up 31 seconds             22/tcp, 2888/tcp, 3888/tcp, 0.0.0.0:2181->2181/tcp         bd-infra_zookeeper_1

Also, when I try to access the UI of the namenode on localhost:50070 and the UI of hue on localhost:8888, I get the error This site can’t be reached (which is obviously the case for the hue service, but surprisingly also for the namenode.

The only thing I changed compared to your docker-compose file is that I changed the volume path /tmp into ./tmp, but I am getting the same problem in either case. Do you have any recommendations how to fix this issue?

Update 1: Below are the logs of the hue container of a few seconds after running docker-compose up --build -d and then docker-compose logs hue:

Attaching to hue
hue                          | [21/Mar/2021 10:38:00 ] settings     INFO     Welcome to Hue 4.5.0
hue                          | [21/Mar/2021 03:38:04 -0700] decorators   INFO     AXES: BEGIN LOG
hue                          | [21/Mar/2021 03:38:04 -0700] decorators   INFO     Using django-axes 2.2.0
hue                          | Traceback (most recent call last):
hue                          |   File "./build/env/bin/hue", line 11, in <module>
hue                          |     load_entry_point('desktop', 'console_scripts', 'hue')()
hue                          |   File "/usr/share/hue/desktop/core/src/desktop/manage_entry.py", line 225, in entry
hue                          |     execute_from_command_line(sys.argv)
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/core/management/__init__.py", line 364, in execute_from_command_line
hue                          |     utility.execute()
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/core/management/__init__.py", line 356, in execute
hue                          |     self.fetch_command(subcommand).run_from_argv(self.argv)
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/core/management/base.py", line 283, in run_from_argv
hue                          |     self.execute(*args, **cmd_options)
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/core/management/base.py", line 327, in execute
hue                          |     self.check()
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/core/management/base.py", line 359, in check
hue                          |     include_deployment_checks=include_deployment_checks,
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/core/management/base.py", line 346, in _run_checks
hue                          |     return checks.run_checks(**kwargs)
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/core/checks/registry.py", line 81, in run_checks
hue                          |     new_errors = check(app_configs=app_configs)
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/core/checks/model_checks.py", line 30, in check_all_models
hue                          |     errors.extend(model.check(**kwargs))
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/db/models/base.py", line 1284, in check
hue                          |     errors.extend(cls._check_fields(**kwargs))
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/db/models/base.py", line 1359, in _check_fields
hue                          |     errors.extend(field.check(**kwargs))
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/db/models/fields/__init__.py", line 913, in check
hue                          |     errors = super(AutoField, self).check(**kwargs)
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/db/models/fields/__init__.py", line 219, in check
hue                          |     errors.extend(self._check_backend_specific_checks(**kwargs))
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/db/models/fields/__init__.py", line 322, in _check_backend_specific_checks
hue                          |     return connections[db].validation.check_field(self, **kwargs)
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/db/backends/mysql/validation.py", line 49, in check_field
hue                          |     field_type = field.db_type(self.connection)
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/db/models/fields/__init__.py", line 644, in db_type
hue                          |     return connection.data_types[self.get_internal_type()] % data
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/utils/functional.py", line 35, in __get__
hue                          |     res = instance.__dict__[self.name] = self.func(instance)
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/db/backends/mysql/base.py", line 174, in data_types
hue                          |     if self.features.supports_microsecond_precision:
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/utils/functional.py", line 35, in __get__
hue                          |     res = instance.__dict__[self.name] = self.func(instance)
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/db/backends/mysql/features.py", line 53, in supports_microsecond_precision
hue                          |     return self.connection.mysql_version >= (5, 6, 4) and Database.version_info >= (1, 2, 5)
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/utils/functional.py", line 35, in __get__
hue                          |     res = instance.__dict__[self.name] = self.func(instance)
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/db/backends/mysql/base.py", line 385, in mysql_version
hue                          |     with self.temporary_connection() as cursor:
hue                          |   File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
hue                          |     return self.gen.next()
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/db/backends/base/base.py", line 591, in temporary_connection
hue                          |     cursor = self.cursor()
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/db/backends/base/base.py", line 254, in cursor
hue                          |     return self._cursor()
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/db/backends/base/base.py", line 229, in _cursor
hue                          |     self.ensure_connection()
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/db/backends/base/base.py", line 213, in ensure_connection
hue                          |     self.connect()
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/db/utils.py", line 94, in __exit__
hue                          |     six.reraise(dj_exc_type, dj_exc_value, traceback)
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/db/backends/base/base.py", line 213, in ensure_connection
hue                          |     self.connect()
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/db/backends/base/base.py", line 189, in connect
hue                          |     self.connection = self.get_new_connection(conn_params)
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/Django-1.11.22-py2.7.egg/django/db/backends/mysql/base.py", line 274, in get_new_connection
hue                          |     conn = Database.connect(**conn_params)
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/MySQL_python-1.2.5-py2.7-linux-x86_64.egg/MySQLdb/__init__.py", line 81, in Connect
hue                          |     return Connection(*args, **kwargs)
hue                          |   File "/usr/share/hue/build/env/local/lib/python2.7/site-packages/MySQL_python-1.2.5-py2.7-linux-x86_64.egg/MySQLdb/connections.py", line 193, in __init__
hue                          |     super(Connection, self).__init__(*args, **kwargs2)
hue                          | django.db.utils.OperationalError: (2005, "Unknown MySQL server host 'database' (0)")

There are actually more logs, but they are just a repititon of the above. My initial thought was that maybe there is something wrong in the database section of the hue-overrides.ini file, but the host and port name seem to make sense to me. If anyone could share some more insights on this, I'd highly appreciate it.

Thanks,
Kevin

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.