Kaggle Starter Code

In this notebook, we walk through a basic workflow for participating in a kaggle competition.

Specifically, we will cover:

Training a basic model on kaggle training data.
Handling missing values.
Generate predictions for kaggle test data.
Save predictions to a .csv file for submission.

The Kaggle competition we will be completing is the Spaceship Titanic. If you do not have a Kaggle account yet, you will need to create one to participate.

Please begin by reviewing the material on the Kaggle competition before following the instructions below.

Develop a model

Import Packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score

Begin by reading in the training data.

df = pd.read_csv('data/train.csv')
df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	PassengerId	HomePlanet	CryoSleep	Cabin	Destination	Age	VIP	RoomService	FoodCourt	ShoppingMall	Spa	VRDeck	Name	Transported
0	0001_01	Europa	False	B/0/P	TRAPPIST-1e	39.0	False	0.0	0.0	0.0	0.0	0.0	Maham Ofracculy	False
1	0002_01	Earth	False	F/0/S	TRAPPIST-1e	24.0	False	109.0	9.0	25.0	549.0	44.0	Juanna Vines	True
2	0003_01	Europa	False	A/0/S	TRAPPIST-1e	58.0	True	43.0	3576.0	0.0	6715.0	49.0	Altark Susent	False
3	0003_02	Europa	False	A/0/S	TRAPPIST-1e	33.0	False	0.0	1283.0	371.0	3329.0	193.0	Solam Susent	False
4	0004_01	Earth	False	F/1/S	TRAPPIST-1e	16.0	False	303.0	70.0	151.0	565.0	2.0	Willy Santantines	True

df.isna().sum()

PassengerId       0
HomePlanet      157
CryoSleep       170
Cabin           153
Destination     150
Age             139
VIP             146
RoomService     142
FoodCourt       140
ShoppingMall    160
Spa             142
VRDeck          128
Name            155
Transported       0
dtype: int64

Handle missing values after train test split.

Preprocessing

Target variable is Transported. Separate target from features and perform train test split.

model_1_df = df.copy()

# Target
y_1 = model_1_df['Transported']

# Single Feature
X_1 = model_1_df[['Spa']]

X_train, X_test, y_train, y_test = train_test_split(X_1, y_1, random_state=42)

# Replace missing values with the median
imputer = SimpleImputer(strategy='median')
# Fit imputer to the indepedent variable
# using only the training data
imputer.fit(X_train)
# Replace missing values in the training and test data
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

Fit a basic Logistic Regression model

model_1 = LogisticRegression()
model_1.fit(X_train, y_train)

LogisticRegression()

Evaluate model performance.

train_preds = model_1.predict(X_train)
test_preds = model_1.predict(X_test)

train_score = accuracy_score(y_train, train_preds)
test_score = accuracy_score(y_test, test_preds)

print('Train score:', train_score)
print('Test score:', test_score)

Train score: 0.6308029487945805
Test score: 0.6266427718040621

Create submission predictions

Kaggle competitions will always provide you with a test dataset that contains all of the independent variables in the training data, but does not contain the target column.

The idea is that you want to build a model using the training data so it can predict accurately when we do not know the target value.

Import testing data

test_df = pd.read_csv('data/tet.csv')
test_df.head(3)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	PassengerId	HomePlanet	CryoSleep	Cabin	Destination	Age	VIP	RoomService	FoodCourt	ShoppingMall	Name
0	6189_01	Earth	False	G/1004/P	TRAPPIST-1e	3.0	False	0.0	0.0	0.0	Eulah Garnes
1	6354_01	Earth	False	F/1315/P	TRAPPIST-1e	48.0	False	410.0	2108.0	0.0	Megany Carreralend
2	1704_02	Mars	False	D/60/S	55 Cancri e	18.0	False	86.0	1164.0	516.0	Allota Fincy

Repeat same preprocessing steps as before

test_X = test_df[['Spa']]

# Impute using fitted imputer
test_X = imputer.transform(test_X)

Create final predictions

final_preds = model_1.predict(test_X)

Save predictions

The kaggle competition provides the following instructions for submitting predictions:

Your submission should be in the form a csv file with two columns.

PassengerId
Transported

The PassengerId column should be the PassengerId column found in the predictors dataset.

For example, if I were to submit a csv of predictions where I predict the mean for every observations, the first three rows of the submission would look like this:

PassengerId	Transported
0013_01	True
0018_01	False
0019_01	True

It is recommended that you save your predictions to csv using pd.to_csv and that you import the saved file into a notebook, to make sure the file is structured as intended.

The easiest way to do this, is to add the predictions to the original dataframe and then isolate the columns we want.

# Add predictions to the test dataframe
test_df['Transported'] = final_preds
# Isolate the columns we want in our submission
submission_df = test_df[['PassengerId', 'Transported']]

Check the shape. The shape of our submission must be (2000, 2)

submission_df.shape

(2000, 2)

Now we just need to save the submission to a .csv file.

In this case, you should set index=False.

submission_df.to_csv('sample_submission.csv', index=False)

Submit Predictions

Once you have saved you predictions to a csv file, you can submit them here

coopam / bsc-ds-2022-spaceship-titanic Goto Github PK

bsc-ds-2022-spaceship-titanic's Introduction

Kaggle Starter Code

Develop a model

Import Packages

Preprocessing

Create submission predictions

Submit Predictions

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent