Football

This program creates a ranking of the best football players basing on the analysis of the matches/fixtures downloaded from Sportmonks API.

What is understood by a "good player"? Football is a team sport. Therefore how good the player is is defined as how much he increases the probability of a win of a team in which he plays.

In other words, this program finds out which players increase the probability of a team winning the most.

Installation

Install required packages from requirements.txt file using pip3.
Create an account on Sportmonks.com and put your api token into api_token.txt file.

Usage

Before you run the program, you need to have a subscription on Sportmonks.com and API token so that the program can connect to Sportmonks API. You need to have Python3 installed as well.

To run the program, execute (from the project root folder):

python3 football/main.py

How does the algorithm work

General idea

It uses a linear regression for accomplishing the task.

If you're not familiar with linear regression, you need to read about it first. Here you have a simple explanation of linear regression (choose one of these links that better fits you):

https://medium.com/simple-ai/linear-regression-intro-to-machine-learning-6-6e320dbdaf06

https://www.spss-tutorials.com/multiple-linear-regression/

The algorithm is trained to predict the results of matches (y) taking the players playing in the match as an input (x)

Let's suppose that:

y - a real number representing how much the local team won over the visitor team in the given match. So y > 0 if the local team is the winner, y = 0 in case of draw, y < 0 if the visitor team is the winner.
x[i] = 1, if the player with id = i (all players have some id assigned) played in the local team, x[i] = -1, if the player with id = i played in the visitor team, x[i] = 0 if the player with id = i didn't play in the match.

If we have a model like this:

y = x[1] * w[1] + x[2] * w[2] + ... + x[n] * w[n]

and we train w (weights) to give correct predictions of y (the result of the match converted to a single real number) basing on x (who played in the match and in which team), then after training w[i] will be as high as good the player with id = i is (basing on the given input).

Why?

We want to learn how the presence of a player in the team affects the result of the match. So if the player with id = i plays in the first team, then x[i] = 1 because he has positive impact on the result (y) - the better he is, the more positive impact he will have on the value of this variable. If he doesn't play, then he doesn't have any impact on the result (y) so x[i] = 0. If he plays in the second team, then he has negative impact on the result (y) so x[i] = -1 - the better he is, the more negative impact he will have on the value of this variable. This way, w[i] will be high if he's good and it will be low if he's weak.

The above model wouldn't be perfect if we wanted to get predictions of the results of the matches because linear regression can't represent for example that two players are good together (for example Xavi and Iniesta can play well and be strong together). The model assumes that the result of the match is a consequence of the strength of all players individually. So if we wanted to get the predictions of matches, the neural network model would be better because it can represent that two players are good together (and we could add more variables as an input than just players). But if we want to learn what is the influence of individual players to the result, then this model is great.

Data set

The algorithm analyzes football matches. The algorithm is given the following information about each match:

Players playing from the first minute in the local team.
Players playing from the first minute in the visitor team.
Result of the match.
Match length (it's sometimes longer than 90 minutes due to extra time).
Substitutions made (who came in, who came out, in which minute).
Goals (which team scored, in which minute).
The date of the match.

Substitutions and goals

In "General idea" section, I explained how the algorithm uses the first three information (players playing in the first team, the second team and the result of the match). But we have also the following data about every match:

...

Match length (it's sometimes longer than 90 minutes due to extra time).
Substitutions made (who came in, who came out, in which minute).
Goals (which team scored, in which minute).

To make use of this information, every match is divided into smaller matches (smaller data samples) with the minutes when substitutions were made as a delimiter.

x[i] in each sample represents if player with ID = i played at the time represented by given sample.

y variable is set to the result of the match only taking into account that period of time which the data sample represent. 'y' is multiplied by (match_length / sample_time) because it represents how much the local team beaten the visitor team and if they scored more goals in the short time then they beaten them more than if they scored it in a long time. The sample weight of the data sample representing this period equals how long was the period divided by match length.

The above explanation is probably not clear, so I'll give an example:

Let's suppose that the data is like this:

Local team players: 2, 3, 4.
Visitor team playes: 7, 8, 9.
The result of the match: 2:1 (local team won).
Substitutions:

a) The player 4 was replaced with the player 5 in the 30 minute.

b) The player 7 was replaced with the player 6 in the 60 minute.

Goals:

a) The first goal was scored in 20 minute by the visitor team.

b) The second goal was scored in 70 minute by the local team.

c) The third goal was scored in 80 minute by the local team.

We divide this match into three data sample:

One represents the period of time from 0 minute to 30 minute (time of the first substitution).
The second one represents the period of time from 30 minute to 60 minute (time of the second substitution).
The third one represents the period of time from 60 minute to 90 minute.

The first data sample will be like this:

x = [0, 1, 1, 1, 0, 0, -1, -1, -1, 0] (indexed from 1) (because those players played from 0 to 30 minute)

y = (-1) * (90 / 30) = -3 (the result was -1, but it was in only 30 minutes, so we multiply it by 3)

sample_weight = 30 / 90 = 1/3 (it is only 1/3 of the match so importance of this sample is 1/3)

The second data sample will be like this:

x = [0, 1, 1, 0, 1, 0, -1, -1, -1, 0] (indexed from 1) (the player 4 was substituted by player 5 in the local team)

y = 0 * (90 / 30) = 0 (the result is 0 because there were no goals in this part of match)

sample_weight = 30 / 90 = 1/3 (it is only 1/3 of the match so importance of this sample is 1/3)

The third data sample will be like this:

x = [0, 1, 1, 0, 1, -1, 0, -1, -1, 0] (indexed from 1) (the player 7 was substituted by player 6 in the visitor team)

y = -2 * (90 / 30) = -6 (visitor team won 2:0 in this part, so the result is -2 multiplied by 3 because it's only 1/3 of the match)

sample_weight = 30 / 90 = 1/3 (it is only 1/3 of the match so importance of this sample is 1/3)

Date of the match

The date of the match only affects the sample weight because if we want to know who is the best player at this moment, then the match that was yesterday is more important than the match that was 10 years ago. Cristiano Ronaldo wasn't 10 years ago as good as he is now, so if we want to know how good he is now, then we should take the match from yesterday into account more than the match from 10 years ago.

h3xxx / football Goto Github PK

football's Introduction

Football

Installation

Usage

How does the algorithm work

General idea

Data set

Substitutions and goals

Date of the match

football's People

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent