Git Product home page Git Product logo

bigdata_docker's Introduction

BIG DATA ECOSYSTEM COM DOCKER

Ambiente para estudo dos principais frameworks big data em docker.
Esse setup vai criar dockers com os frameworks HDFS, HBase, Hive, Presto, Spark, Jupyter, Hue, Mongodb, Metabase, Nifi, kafka, Mysql e Zookeeper com a seguinte arquitetura:

Ecossistema

SOFTWARES NECESSÁRIOS

Para a criação e uso do ambiente vamos utilizar o git e o Docker

SETUP

OBS: Esse passo deve ser realizado apena uma vez. Após o ambiente criado, utilizar o docker-compose para iniciar os containers como mostrado no tópico INICIANDO O AMBIENTE

Criação do diretório docker:

OBS: Criar um diretório chamado docker

  • Sugestão no Windows:

    • Criar na raiz do seu drive o diretório docker ex: C:\docker
  • Sugestão no Linux:

    • Criar o diretório na home do usuário ex: /home/user/docker

Em um terminal/DOS, dentro diretório docker, realizar o clone do projeto no github

      git clone https://github.com/fabiogjardim/bigdata_docker.git

No diretório bigdata_docker vai existir os seguintes objetos

ls

INICIANDO O AMBIENTE

No Windows abrir PowerShell, do Linux um terminal

No terminal, no diretorio bigdata_docker, executar o docker-compose

      docker-compose up -d        

Verificar imagens e containers

     docker image ls

docker image ls

     docker container ls

docker container

SOLUCIONANDO PROBLEMAS

No Windows abrir o Docker Quickstart Terminal

Parar um containers

     docker stop [nome do container]      

Parar todos containers

     docker stop $(docker ps -a -q)

Remover um container

     docker rm [nome do container]

Remover todos containers

     docker rm $(docker ps -a -q)         

Dados do containers

     docker container inspect [nome do container]

Iniciar um container

     docker-compose up -d [nome do container]

Iniciar todos os containers

     docker-compose up -d 

Acessar log do container

     docker container logs [nome do container] 

Acesso WebUI dos Frameworks

Acesso por shell

HDFS
      docker exec -it datanode bash
HBase
      docker exec -it hbase-master bash
Sqoop
      docker exec -it datanode bash
Kafka
      docker exec -it kafka bash

Acesso JDBC

MySQL
      jdbc:mysql://database/employees
Hive
      jdbc:hive2://hive-server:10000/default
Presto
      jdbc:presto://presto:8080/hive/default

Usuários e senhas

Hue
Usuário: admin
Senha: admin
Metabase
Usuário: [email protected]
Senha: bigdata123 
MySQL
Usuário: root
Senha: secret
MongoDB
Usuário: root
Senha: root
Authentication Database: admin

Imagens

Docker Hub

Documentação Oficial

bigdata_docker's People

Contributors

fabiogjardim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bigdata_docker's Issues

Problema para iniciar imagem mysql

Bom dia amigos,

Estou tentando iniciar a imagem do mysql, entretando após iniciar ele reinicia. Olhando o log, tenho o seguinte erro

`2020-05-29 01:57:46+00:00 [Note] [Entrypoint]: Entrypoint script for MySQL Server 5.7.29-1debian10 started.
2020-05-29 01:57:48+00:00 [Note] [Entrypoint]: Switching to dedicated user 'mysql'
2020-05-29 01:57:48+00:00 [Note] [Entrypoint]: Entrypoint script for MySQL Server 5.7.29-1debian10 started.
2020-05-29T01:57:48.720344Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2020-05-29T01:57:48.732126Z 0 [Note] mysqld (mysqld 5.7.29) starting as process 1 ...
2020-05-29T01:57:48.749124Z 0 [Note] InnoDB: PUNCH HOLE support available
2020-05-29T01:57:48.749141Z 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2020-05-29T01:57:48.749144Z 0 [Note] InnoDB: Uses event mutexes
2020-05-29T01:57:48.749146Z 0 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2020-05-29T01:57:48.749148Z 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2020-05-29T01:57:48.749324Z 0 [Note] InnoDB: Number of pools: 1
2020-05-29T01:57:48.749400Z 0 [Note] InnoDB: Using CPU crc32 instructions
2020-05-29T01:57:48.750569Z 0 [Note] InnoDB: Initializing buffer pool, total size = 128M, instances = 1, chunk size = 128M
2020-05-29T01:57:48.759106Z 0 [Note] InnoDB: Completed initialization of buffer pool
2020-05-29T01:57:48.761304Z 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2020-05-29T01:57:48.809724Z 0 [Note] InnoDB: Highest supported file format is Barracuda.
2020-05-29T01:57:48.822850Z 0 [Note] InnoDB: Log scan progressed past the checkpoint lsn 155984619
2020-05-29T01:57:48.822870Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 155984628
2020-05-29T01:57:48.822874Z 0 [Note] InnoDB: Database was not shutdown normally!
2020-05-29T01:57:48.822876Z 0 [Note] InnoDB: Starting crash recovery.
2020-05-29T01:57:49.364512Z 0 [ERROR] InnoDB: Operating system error number 1 in a file operation.
2020-05-29T01:57:49.364549Z 0 [ERROR] InnoDB: Error number 1 means 'Operation not permitted'
2020-05-29T01:57:49.364555Z 0 [Note] InnoDB: Some operating system error numbers are described at http://dev.mysql.com/doc/refman/5.7/en/operating-system-error-codes.html
2020-05-29T01:57:49.364559Z 0 [ERROR] InnoDB: File ./ibtmp1: 'delete' returned OS error 101.
2020-05-29T01:57:49.364563Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2020-05-29T01:57:49.365233Z 0 [ERROR] InnoDB: Operating system error number 1 in a file operation.
2020-05-29T01:57:49.365244Z 0 [ERROR] InnoDB: Error number 1 means 'Operation not permitted'
2020-05-29T01:57:49.365247Z 0 [Note] InnoDB: Some operating system error numbers are described at http://dev.mysql.com/doc/refman/5.7/en/operating-system-error-codes.html
2020-05-29T01:57:49.365250Z 0 [ERROR] InnoDB: File ./ibtmp1: 'stat' returned OS error 101.
2020-05-29T01:57:49.365275Z 0 [ERROR] InnoDB: os_file_get_status() failed on './ibtmp1'. Can't determine file permissions
2020-05-29T01:57:49.365278Z 0 [ERROR] InnoDB: Could not create the shared innodb_temporary.
2020-05-29T01:57:49.365280Z 0 [ERROR] InnoDB: Plugin initialization aborted with error Generic error
2020-05-29T01:57:49.566478Z 0 [ERROR] InnoDB: Operating system error number 1 in a file operation.
2020-05-29T01:57:49.566527Z 0 [ERROR] InnoDB: Error number 1 means 'Operation not permitted'
2020-05-29T01:57:49.566563Z 0 [Note] InnoDB: Some operating system error numbers are described at http://dev.mysql.com/doc/refman/5.7/en/operating-system-error-codes.html
2020-05-29T01:57:49.566568Z 0 [ERROR] InnoDB: File ./ibtmp1: 'delete' returned OS error 101.
2020-05-29T01:57:49.566573Z 0 [ERROR] Plugin 'InnoDB' init function returned error.
2020-05-29T01:57:49.566576Z 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
2020-05-29T01:57:49.566581Z 0 [ERROR] Failed to initialize builtin plugins.
2020-05-29T01:57:49.566583Z 0 [ERROR] Aborting

2020-05-29T01:57:49.566587Z 0 [Note] Binlog end
2020-05-29T01:57:49.566638Z 0 [Note] Shutting down plugin 'CSV'
2020-05-29T01:57:49.569736Z 0 [Note] mysqld: Shutdown complete`

Assim está minha configuracão da imagem no .yml

database: image: fjardim/mysql container_name: database hostname: database ports: - "33061:3306" deploy: resources: limits: memory: 500m command: mysqld --innodb-flush-method=O_DSYNC --innodb-use-native-aio=OFF --init-file /data/application/init.sql volumes: - /c/docker/bigdata_docker/data/mysql/data:/var/lib/mysql - /c/docker/bigdata_docker/data/init.sql:/data/application/init.sql environment: MYSQL_ROOT_USER: root MYSQL_ROOT_PASSWORD: secret MYSQL_DATABASE: hue MYSQL_USER: root MYSQL_PASSWORD: secret
Alguma idéia do que pode estar causando o erro? Lembrado que estou usando o Windows 10 e docker desktop para executar tudo.

Obrigado!

move no lugar de rm

em uma parte do código, voce pede para renomear o arquivo, porém o comando que é passado é um move. Acredito que o correto seria rename

Dúvida sobre docker

Olá Fábio,

Por favor, de acordo com a imagem do ecossistema, cada um dos itens será colocado em um container específico? Por exemplo, o MongoDB e o Mongo Express ficariam em containers separados ou no mesmo container?

Muito obrigado,

Daniel Adorno Gomes

Windows 10 Enterprise 64-bit

Olá... Infelizmente meu SO: Windows 10 Enterprise 64-bit não tem suporte para virtualizacão:
image

image

Atualmente utilizo o Docker Desktop Windows, dependente do Hyper-V, que quando ativo é incompatível com o VirtualBox...

Infelizmente por se tratar de um computador corporativo, não posso alterar a BIOS para ativar a virtualizacão.

Com base nesse cenário, alguma sugestão? Infelizmente não conheco muito de docker, mas acho que dever ter alguma alternativa.

fatal: repository 'https://github.com/fabiobjardim/bigdata_docker.git/' not found

Olá Fábio, tudo bem?

Tentei executar esse comando, de acordo com as instruções em sua página em: https://github.com/fabiogjardim/bigdata_docker mas não funcionou:

C:\docker>git clone http://github.com/fabiobjardim/bigdata_docker.git
Cloning into 'bigdata_docker'...
info: please complete authentication in your browser...
remote: Repository not found.
fatal: repository 'https://github.com/fabiobjardim/bigdata_docker.git/' not found

Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog

Grande Fábio,

sua distribuição caiu como uma luva pra mim, agradeço muito.

Entretanto estou com um erro ao tentar realizar qualquer conexão do Spark com o Hive. Dá mensagem
Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog. Tanto pelo Jupyter, como diretamente no pyspark dentro da vm do spark.

Já reinstalei tudo (apaguei todas as imagens do docker e inicializei somente o bigdata_docker, verifiquei se tem alguma porta em conflito, aumentei os recursos do Docker para 4 CPU, 16 GB de memória, 4 swap, e não mudou nada. Não achei nada de relevante nas pesquisas pela net.

Estou rodando em um iMac (24 GB RAM) com MacOS Catalina 10.15.4 e Docker 2.2.0.5 .

O restante está tudo funcionando, o HUE o Presto e o Metabase acessam normalmente o Hive.

Agradeço se puder me dar alguma idéia do que está errado. Não alterei nenhuma configuração sua ou das imagens.

root@jupyter-spark:/opt/spark/conf# pyspark
Python 3.5.3 (default, Sep 27 2018, 17:25:39)
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/04/11 17:11:24 WARN spark.SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
20/04/11 17:11:25 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Welcome to
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
/
/ .
_/_,// //_\ version 2.4.1
/
/

Using Python version 3.5.3 (default, Sep 27 2018 17:25:39)
SparkSession available as 'spark'.

from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
sqlContext.sql("show databases").show()
Traceback (most recent call last):
File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o38.sql.
: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':
at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:192)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:103)
at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
at org.apache.spark.sql.hive.HiveSessionStateBuilder.org$apache$spark$sql$hive$HiveSessionStateBuilder$$externalCatalog(HiveSessionStateBuilder.scala:39)
at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$1.apply(HiveSessionStateBuilder.scala:54)
at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$1.apply(HiveSessionStateBuilder.scala:54)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog$lzycompute(SessionCatalog.scala:90)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog(SessionCatalog.scala:90)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listDatabases(SessionCatalog.scala:247)
at org.apache.spark.sql.execution.command.ShowDatabasesCommand$$anonfun$2.apply(databases.scala:44)
at org.apache.spark.sql.execution.command.ShowDatabasesCommand$$anonfun$2.apply(databases.scala:44)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.command.ShowDatabasesCommand.run(databases.scala:44)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
at org.apache.spark.sql.Dataset.(Dataset.scala:194)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:189)
... 36 more
Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException
at org.apache.spark.sql.hive.HiveExternalCatalog.(HiveExternalCatalog.scala:71)
... 41 more
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.metadata.HiveException
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 42 more

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "/opt/spark/python/pyspark/sql/context.py", line 358, in sql
return self.sparkSession.sql(sqlQuery)
File "/opt/spark/python/pyspark/sql/session.py", line 767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/opt/spark/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':"

ingest data / demo example

Hi,

Hope you are all well !

Is it possible to provide an example of ingesting a csv file into this stack ?

Thanks in advance for any insights or inputs on that issue.

Cheers,
X

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.