Git Product home page Git Product logo

yelp_challenge's Introduction

Prerequisites

Docker is necessary for building.

  • Docker Version: 18.09.2

Setting Up Environment and running

Creating docker machines and running containers

To establish the execution of spark tasks and docker container run, only the following script should be executed with providing the path of the data file:

	  chmod +x *.sh
	  ./run.sh TAR_GZ_FILE_LOCATION

Testing db tables

To test the db, the cqlsh client of cassandra container should be executed as:

	>docker exec -ti  yelp_challenge_cassandra_1 bash
	>cqslh
	>describe test.tables
	>select * from test.photo limit 10

The container image's build steps (not necessary to do, since its already ready in docker hub)

The generated image used to create the "task" named container process (task service docker-compose,yml) is taken from following repository: docker pull guherbozdogan2/repo:latest

The generation has been established via this scala-sbt project via sbt docker plugin(via sbt sbt docker:publishLocal command inside this project folder). The command generates Docker file and it's necessary lib/bin folders (with populated applications). Accordingly the above docker hub repo has been generated from the repo with the following steps: (The below steps are not necessary to do for to run project. Only ./run.sh command execution is enough to run project as since the docker image is alreadily existing in docker hub repo: guherbozdogan2/repo:latest )

    >sbt update clean compile package
    >sbt docker:publishLocal
    >cd docker/target/stage
Change the line of: 'ENTRYPOINT ["/opt/docker/bin/yelp-conversion-task"]' to 'ENTRYPOINT /opt/docker/bin/main-task && /opt/docker/bin/main-test-task' and build local image with:
    >docker build -t repo .

Current Limitations/problems

  • The spark application is ran lke a standalone spark application. The best ran condition would be creating a new Docker image based on Spark images for Spark Executors (Docker using Mesos' project's spark executors) and using spark-submit instead of standalone application. This solution currently uses a stand alone spark application instead of utilizing spark-submit( spark client/cluster mode).
  • Unit tests are missing
  • Integration test cases have low coverage and should be automated wth a test framework
  • CI tool integration for builds/tests + log server integration for spark logs/docker compose logs analysis

Queries in integration tests

In integration tests, the following 2 cases are tested for each table: (This task executes right after the migration completes and the tests results are written with Success/Error logs output in console currently(With having prefix of "******************" )

  • The equivalence of cardinality of RDD data and Cassandra table based on partition key (grouped by partition key)
  • Whether sampled 1000 different records (With different partititon keys from RDD) also exist in cassandra table

Sample outputs in console after executing ./run.sh:

******************Testing whether cardinality of RDD is same as CQL Table's cardinality in table:business
******************Success in:Testing whether cardinality of RDD is same as CQL Table's cardinality in table:business
******************Testing randomly sampled 1000 records existence in table:business
******************Success in: Testing randomly sampled 1000 records existence in table:business

yelp_challenge's People

Contributors

guherbozdogan2 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.