In this notebook, we walk through a basic workflow for participating in a kaggle competition.
Specifically, we will cover:
- Training a basic model on kaggle training data.
- Handling missing values.
- Generate predictions for kaggle test data.
- Save predictions to a
.csv
file for submission.
The Kaggle competition we will be completing is the Spaceship Titanic. If you do not have a Kaggle account yet, you will need to create one to participate.
Please begin by reviewing the material on the Kaggle competition before following the instructions below.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
Begin by reading in the training data.
df = pd.read_csv('data/train.csv')
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
PassengerId | HomePlanet | CryoSleep | Cabin | Destination | Age | VIP | RoomService | FoodCourt | ShoppingMall | Spa | VRDeck | Name | Transported | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0001_01 | Europa | False | B/0/P | TRAPPIST-1e | 39.0 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Maham Ofracculy | False |
1 | 0002_01 | Earth | False | F/0/S | TRAPPIST-1e | 24.0 | False | 109.0 | 9.0 | 25.0 | 549.0 | 44.0 | Juanna Vines | True |
2 | 0003_01 | Europa | False | A/0/S | TRAPPIST-1e | 58.0 | True | 43.0 | 3576.0 | 0.0 | 6715.0 | 49.0 | Altark Susent | False |
3 | 0003_02 | Europa | False | A/0/S | TRAPPIST-1e | 33.0 | False | 0.0 | 1283.0 | 371.0 | 3329.0 | 193.0 | Solam Susent | False |
4 | 0004_01 | Earth | False | F/1/S | TRAPPIST-1e | 16.0 | False | 303.0 | 70.0 | 151.0 | 565.0 | 2.0 | Willy Santantines | True |
df.isna().sum()
PassengerId 0
HomePlanet 157
CryoSleep 170
Cabin 153
Destination 150
Age 139
VIP 146
RoomService 142
FoodCourt 140
ShoppingMall 160
Spa 142
VRDeck 128
Name 155
Transported 0
dtype: int64
Handle missing values after train test split.
Target variable is Transported
. Separate target from features and perform train test split.
model_1_df = df.copy()
# Target
y_1 = model_1_df['Transported']
# Single Feature
X_1 = model_1_df[['Spa']]
X_train, X_test, y_train, y_test = train_test_split(X_1, y_1, random_state=42)
# Replace missing values with the median
imputer = SimpleImputer(strategy='median')
# Fit imputer to the indepedent variable
# using only the training data
imputer.fit(X_train)
# Replace missing values in the training and test data
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)
Fit a basic Logistic Regression model
model_1 = LogisticRegression()
model_1.fit(X_train, y_train)
LogisticRegression()
Evaluate model performance.
train_preds = model_1.predict(X_train)
test_preds = model_1.predict(X_test)
train_score = accuracy_score(y_train, train_preds)
test_score = accuracy_score(y_test, test_preds)
print('Train score:', train_score)
print('Test score:', test_score)
Train score: 0.6308029487945805
Test score: 0.6266427718040621
Kaggle competitions will always provide you with a test
dataset that contains all of the independent variables in the training data, but does not contain the target column.
The idea is that you want to build a model using the training data so it can predict accurately when we do not know the target value.
Import testing data
test_df = pd.read_csv('data/tet.csv')
test_df.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
PassengerId | HomePlanet | CryoSleep | Cabin | Destination | Age | VIP | RoomService | FoodCourt | ShoppingMall | Spa | VRDeck | Name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6189_01 | Earth | False | G/1004/P | TRAPPIST-1e | 3.0 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Eulah Garnes |
1 | 6354_01 | Earth | False | F/1315/P | TRAPPIST-1e | 48.0 | False | 410.0 | 2108.0 | 0.0 | 0.0 | 0.0 | Megany Carreralend |
2 | 1704_02 | Mars | False | D/60/S | 55 Cancri e | 18.0 | False | 86.0 | 1164.0 | 516.0 | 0.0 | 0.0 | Allota Fincy |
Repeat same preprocessing steps as before
test_X = test_df[['Spa']]
# Impute using fitted imputer
test_X = imputer.transform(test_X)
Create final predictions
final_preds = model_1.predict(test_X)
Save predictions
The kaggle competition provides the following instructions for submitting predictions:
Your submission should be in the form a csv file with two columns.
PassengerId
Transported
The PassengerId
column should be the PassengerId
column found in the predictors dataset.
For example, if I were to submit a csv of predictions where I predict the mean for every observations, the first three rows of the submission would look like this:
PassengerId | Transported |
---|---|
0013_01 | True |
0018_01 | False |
0019_01 | True |
It is recommended that you save your predictions to csv using pd.to_csv
and that you import the saved file into a notebook, to make sure the file is structured as intended.
The easiest way to do this, is to add the predictions to the original dataframe and then isolate the columns we want.
# Add predictions to the test dataframe
test_df['Transported'] = final_preds
# Isolate the columns we want in our submission
submission_df = test_df[['PassengerId', 'Transported']]
Check the shape. The shape of our submission must be (2000, 2)
submission_df.shape
(2000, 2)
Now we just need to save the submission to a .csv
file.
In this case, you should set index=False
.
submission_df.to_csv('sample_submission.csv', index=False)
Once you have saved you predictions to a csv file, you can submit them here