Git Product home page Git Product logo

cricket-world-cup-2019's Introduction


How To Use The Project

Expand For Steps

Step 1: Install R Studio

Step 2: Download ODI Matches - Data

Step 3: Clean Data and Get Format ready

Step 4: Clone/Download the Repository

Step 5: Make necessary changes [e.g add new matches data in WC_Train.csv file]

Step 6: Do necessary data analysis EDA

Step 7: Run Random Forest Model

Step 8: Store Results in Random Forest Prediction.csv

Step 9: Run Logistic Regression Model

Sept 10: Store Results in Logistic Regression Prediction.csv

Step 11: Run Compare Model Predict

Step 12: Store Models vs Actual Results in Comapre Predict - RF vs. LR

Table of Content



Objective

To Predict ICC World Cup 2019 Cricket Matches, based on Team’s individual past performances.


Approach


Data Collection

In this study, our approach is to predict ICC WC 2019 matches based on past ODI matches results. Now, stronger teams like Australia, India, New Zealand etc would perform better and weaker teams like Pakistan, West Indies would perish – we are not saying this – but our past ODI matches data study reveales the strong and weak team contender for World Cup 2019.

Hence, we decided to study past ODI matches since 2007 to 2018. To collect dataset, we followed HowStats

For data collection, we extract, ODI matches year on year [since 1987] and stored the dataset in excel sheets. However, for our study we considered only ODI matched played from 2007 to 2018. Because, we believe very old matches results [like early 1990s] should not have significant impact on team wise performance for 2019 WC. Hence, we decided to study latest team wise performances.


Data Cleaning

After extracting data from Howstats, we stored datasets in excel file sheets – year wise.

For cleaning purpose, we used ‘Test to Colum’ function very frequently [Basically we used few excel function to clean entire dataset]

NOTE: Due to lake of data for Afghanistan team matches, we decided to exclude team Afghanistan from the study. [If we would had considered Afghanistan team for WC 2019 world cup prediction study, probably model would have shown team Afghanistan is losing every match – and could become biased!]


Exploratory Data Analysis

For the WC 2019 cricket matches prediction study we decided to count data from 2007 to 2018. However, in many studies we found that more data make model better, True! But, for the objective of the study, we limited ourselves for number of observations. Because for particular study we feel – early 1990s team performance (Especially players which plays significant impact towards winning/loosing particular match.) Like West Indies was star performing team, but in a last decade and longer, the team is barley able to give consistence winning.

We also assume, higher the number of matches team plays, higher the ODI experience and this leads to overall performance of the team.

For the training dataset, we choose 983 observations, where most of the variables are factors.

> dim(ws)   ## Dimension of dataset
> str(ws)   ## Structure of dataset

And hence, before building supervised learning model we converted factors into dummy variables. Based on rpivotTable(wc) function, we found interesting study.

As we can see based on the above chart table, since last 2 years (2017 & 2018) – England team & India Team gave winning performance and are trending at the top positions.

Similarly, you can see the 2011 World Cup final match was between India and Sri Lanka. In these cluster of years Australia was top contender for finals, but how come Sri Lanka reached to the finals! This is because India knockouts Australia in 2nd Quarter Finals. And Sri Lanka faced New Zealand in Semi Finals – and Sri Lanka won by 5 wickets.

Similarly, in World Cup 2015, based on the following bar chart, we can see how New Zealand has emerged from 2012 to 2014 and challenged Australia in 2015 WC finals.

In World Cup 2019, strong contender for world cup are India, England, New Zealand and South Africa.


Build Random Forest Model

Successfully uploaded dataset in R, and we created train variable for 2007 to 2018 cricket matches.

NOTE: As on 26th June Codes has been tuned - For more accurate results - Also included WC 2019 matches to train model.

wc = read.csv('WC_Train.csv')

## Data From 2007 World Cup till 2018 Cricket Matches

train = wc[which(wc$Year >= 2007 & wc$Year <=2018),]

For supervised learning technique RF, we created Team A & Team B’s category variables into dummy variables.

## Creat dummy variable sfor Team A and Team B TRAIN

Team.A.matrix = model.matrix(~ Trim.Team.A - 1, data = train)
train = data.frame(train, Team.A.matrix)

Team.B.matrix = model.matrix(~ Trim.Team.B - 1, data = train)
train = data.frame(train, Team.B.matrix)

As discussed earlier, in the study Target variable is Team.A.Won, which is counts of Team A level team winning particular match – as count ‘1’ and Team A lost particular match – as count ‘0’. Here, count ‘0’ means Team B team won particular match. And, hence with library function randomForest() we build random forest model for train dataset. After tuning the model, we predicted results in ‘class’ type and ‘prob’ type.

print(wc.rf.tune)

test1$Team.A.Win = predict(wc.rf.tune, test1, type = 'class')
test1$Team.A.Score = predict(wc.rf.tune, test1, type = 'prob')

And results ae stored in Random Forest Prediction.csv file


Random Forest Results

Due to high error rate in random Forest model - [And even after tuning the model, we were not able to reduce the error]

Based on the results we were not fully satisfied. And hence decided to work on supervised learning technique Logistic Regression to predict ICC Cricket 2019 World Cup matches.

NOTE: As on 26th June Codes has been tuned - For more accurate results - Also included WC 2019 matches to train model.

Afger 26 June MAtch Results are store in - Random Forest Prediction after 25th June Matches. csv file


Build Logistic Regression Model

Similarly, for Logistic Regression we created a train dataset for ODI matches from 2007 to 2018, and created dummy variables to Target Team.A.Won variable with all the independent variables.

NOTE: As on 26th June Codes has been tuned - For more accurate results - Also included WC 2019 matches to train model

logit = Team.A.Won ~ .  # Few Variables arenot significant, However, due to Teams we decided to consider All variables. 

logit.plot = glm(logit, data = train, family = binomial)

summary(logit.plot)

However, we also found few dummy variables for independent variables set are not significant for the study [like Bangladesh and West Indies]. And Finally, we decided to consider all the teams dummy variables for the study.

Based on the model logit.plot we predicted the test1 file matched for 2019 World Cup. And stored the results in Logistic Regression Prediction.csv file. We also did evaluation of the Logistic Regression model. However, we believe correct evaluation of the model is actual match result.


Logistic Regression Results

To evalute the model we ploted ROC curve and calculated the accuracy for the predicted results.

## Model Evaluation 

m3.matrix = confusion.matrix(test1$Team.A.Win, predict.logit, threshold = 0.5)
m3.matrix

library(pROC)
m3.roc = roc(test1$Team.A.Win, predict.logit)
m3.roc
plot(m3.roc)

## ON RESULT RATIOS DATA SET
accuracy.logit<-sum(diag(m3.matrix))/sum(m3.matrix)
accuracy.logit
[1] 0.7567568 

As shown model accuracy is 75%, and following are the predicted results from the WC 2019 matches.

NOTE: As on 26th June Codes has been tuned - For more accurate results - Also included WC 2019 matches to train model.

Afger 26 June Match Results are store in - Logistic Regression Prediction after 25th June Matches. csv file


Compare Model Performance

Based on the two supervised learning techniques we build model which can predict WC 2019 matched outcome even before actual match starts. And we compared the model results vs. actual matches result.

Hence, we uploaded both the models RF and LR results in -- > Compare Predict - RF vs. LR

colnames(ComparePredict)[colnames(ComparePredict) == 'Team.A.Win'] = 'RF Team.A.Win'
colnames(ComparePredict)[colnames(ComparePredict) == 'Team.A.Win.1'] = 'LR Team.A.Win'

colnames(ComparePredict)[colnames(ComparePredict) == 'Team.A.Score.1'] = 'Prob % RF Team.A.Win'
colnames(ComparePredict)[colnames(ComparePredict) == 'predict.logit'] = 'Prob % LR Team.A.Win'

In the same .csv file we also manually entered actual match result.

Update Date (25/06/2019)

  • RF Predicted 15 correct matches out of 23
  • LR Predicted 15 correct matched out of 23

Note: Afghanistan team matches and Match abandoned due to rain are not included in the result score.

However, few matches were very close call, e.g. in terms of % probability of winning for the team.


LICENSE

This Project/Repository is Licensed under MIT license.


Acknowledge

This Project/Repository is part of Great Learning - Cricket World Cup Challenge.

cricket-world-cup-2019's People

Contributors

rutvijbhutaiya avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.