Git Product home page Git Product logo

click-through-rate-prediction's Introduction

Try Kaggle's Click Through Rate Prediction with Spark Pipeline API

The purpose of this Spark Application is to test Spark Pipeline API with real data for SPARK-13239. So, we tested ML Pipeline API with Kaggle's click-through rate prediction.

Build & Run

You can build this Spark application with sbt clean assembly. And you can run it the command.

$SPARK_HOME/bin/spark-submit \
  -class org.apache.spark.examples.kaggle.ClickThroughRatePredictionWitLogisticRegression \
  /path/to/click-through-rate-prediction-assembly-1.0.jar \
  --train=/path/to/train \
  --test=/path/to/test \
  --result=/path/to/result.csv
  • --train: the training data you downloaded
  • --test: the test data you downloaded
  • --result: result file

You know, Spark ML can't write a single file directly. However, making the number of partitions of result DataFrame 1, this application aggregates the result as a file. So you can get the result CSV file from part-00000 under the path which you set at --result option.

The Kaggle Contest

Predict whether a mobile ad will be clicked In online advertising, click-through rate (CTR) is a very important metric for evaluating ad performance. As a result, click prediction systems are essential and widely used for sponsored search and real-time bidding.

https://www.kaggle.com/c/avazu-ctr-prediction

Approach

  1. Extracts features of categorical features with OneHotEncoder with StringIndexer
  2. Train a model with LogisticRegression with CrossValidator
    • The Evaluator of CrossValidator is the default of BinaryClassificationEvaluator.

We merged the training data with the test data in the extracting features phase. Since, the test data includes values which doesn't exists in the training data. Therefore, we needed to avoid errors about missing values of each variables, when extracting features of the test data.

Result

I got the score: 0.3998684 with the following parameter set.

  • Logistic Regression
    • featuresCol: features
    • fitIntercept: true
    • labelCol: label
    • maxIter: 100
    • predictionCol: prediction
    • probabilityCol: probability
    • rawPredictionCol: rawPrediction
    • regParam: 0.001
    • standardization: true
    • threshold: 0.22
    • tol: 1.0E-6
    • weightCol:

TODO

We should offer more Evaluators, such as logg-loss. Since spark.ml doesn't offer Loggistic-Loss at Spark 1.6, we might get better score with logg-loss evaluator.

click-through-rate-prediction's People

Contributors

yu-iskw avatar

Watchers

James Cloos avatar lslab avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.