Git Product home page Git Product logo

tiwarishubham635 / apache-spark Goto Github PK

View Code? Open in Web Editor NEW

This project forked from yash-sethia/apache-spark

0.0 0.0 0.0 1005 KB

Managing distributed databases using Pyspark API of Apache Spark for customer data of an Ecommerce Website and predicting Yearly Amount spent using Linear Regression

Home Page: https://tiwarishubham635.github.io/Apache-Spark/

Jupyter Notebook 15.68% HTML 84.32%
apache-spark linear-regression pyspark

apache-spark's Introduction

Ecommerce Sales Prediction using Apache Spark

Contributor(s): Yash Sethia, Ritesh Kumar, Shubham

Snapshot


About

Apache Spark, written in Scala, is a general-purpose distributed data processing engine. Or in other words: load big data, do computations on it in a distributed way, and then store it.

Apache Spark contains libraries for data analysis, machine learning, graph analysis, and streaming live data. Spark is generally faster than Hadoop. This is because Hadoop writes intermediate results to disk (that is, lots of I/O operations) whereas Spark tries to keep intermediate results in memory (that is, in-memory computation) whenever possible. Moreover, Spark offers lazy evaluation of operations and optimizes them just before the final result; Sparks maintains a series of transformations that are to be performed without actually performing those operations unless we try to obtain the results. This way, Spark is able to find the best path looking at overall transformations required (for example, reducing two separate steps of adding number 5 and 20 to each element of the dataset into just a single step of adding 25 to each element of the dataset, or not actually doing operations on part of the dataset which will eventually will be filtered out in the final result).

This makes Spark one of the most popular tools for big data analytics currently. PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment.

In this Notebook, we have the data of an Ecommerce website with the following fields:

  • Email ID of the user
  • Address of the user
  • Average Session Length of the user
  • Time spent by the user on the app
  • Time spent by the user on the website
  • Length of the membership of the user
  • Yearly amount spent by the user The database is however distributed. Thus we use Pyspark API of Apache Spark to handle this data to bring out meaningful inferences and try to predict the Amount spent by users using Linear Regression.

To run this in your system:

  • Clone this repository
  • In a terminal, navigate to the folder containing this repository and run the following command:
jupyter notebook
  • This will open the base directory of this repository in your browser. Now, open the file
  • Run all the cells of this file to see the results

apache-spark's People

Contributors

tiwarishubham635 avatar yash-sethia avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.