Git Product home page Git Product logo

bsc-ds-2022-spaceship-titanic's Introduction

Kaggle Starter Code

In this notebook, we walk through a basic workflow for participating in a kaggle competition.

Specifically, we will cover:

  • Training a basic model on kaggle training data.
  • Handling missing values.
  • Generate predictions for kaggle test data.
  • Save predictions to a .csv file for submission.

The Kaggle competition we will be completing is the Spaceship Titanic. If you do not have a Kaggle account yet, you will need to create one to participate.

Please begin by reviewing the material on the Kaggle competition before following the instructions below.

Develop a model

Import Packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score

Begin by reading in the training data.

df = pd.read_csv('data/train.csv')
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
PassengerId HomePlanet CryoSleep Cabin Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck Name Transported
0 0001_01 Europa False B/0/P TRAPPIST-1e 39.0 False 0.0 0.0 0.0 0.0 0.0 Maham Ofracculy False
1 0002_01 Earth False F/0/S TRAPPIST-1e 24.0 False 109.0 9.0 25.0 549.0 44.0 Juanna Vines True
2 0003_01 Europa False A/0/S TRAPPIST-1e 58.0 True 43.0 3576.0 0.0 6715.0 49.0 Altark Susent False
3 0003_02 Europa False A/0/S TRAPPIST-1e 33.0 False 0.0 1283.0 371.0 3329.0 193.0 Solam Susent False
4 0004_01 Earth False F/1/S TRAPPIST-1e 16.0 False 303.0 70.0 151.0 565.0 2.0 Willy Santantines True
df.isna().sum()
PassengerId       0
HomePlanet      157
CryoSleep       170
Cabin           153
Destination     150
Age             139
VIP             146
RoomService     142
FoodCourt       140
ShoppingMall    160
Spa             142
VRDeck          128
Name            155
Transported       0
dtype: int64

Handle missing values after train test split.

Preprocessing

Target variable is Transported. Separate target from features and perform train test split.

model_1_df = df.copy()

# Target
y_1 = model_1_df['Transported']

# Single Feature
X_1 = model_1_df[['Spa']]

X_train, X_test, y_train, y_test = train_test_split(X_1, y_1, random_state=42)
# Replace missing values with the median
imputer = SimpleImputer(strategy='median')
# Fit imputer to the indepedent variable
# using only the training data
imputer.fit(X_train)
# Replace missing values in the training and test data
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

Fit a basic Logistic Regression model

model_1 = LogisticRegression()
model_1.fit(X_train, y_train)
LogisticRegression()

Evaluate model performance.

train_preds = model_1.predict(X_train)
test_preds = model_1.predict(X_test)

train_score = accuracy_score(y_train, train_preds)
test_score = accuracy_score(y_test, test_preds)

print('Train score:', train_score)
print('Test score:', test_score)
Train score: 0.6308029487945805
Test score: 0.6266427718040621

Create submission predictions

Kaggle competitions will always provide you with a test dataset that contains all of the independent variables in the training data, but does not contain the target column.

The idea is that you want to build a model using the training data so it can predict accurately when we do not know the target value.

Import testing data

test_df = pd.read_csv('data/tet.csv')
test_df.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
PassengerId HomePlanet CryoSleep Cabin Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck Name
0 6189_01 Earth False G/1004/P TRAPPIST-1e 3.0 False 0.0 0.0 0.0 0.0 0.0 Eulah Garnes
1 6354_01 Earth False F/1315/P TRAPPIST-1e 48.0 False 410.0 2108.0 0.0 0.0 0.0 Megany Carreralend
2 1704_02 Mars False D/60/S 55 Cancri e 18.0 False 86.0 1164.0 516.0 0.0 0.0 Allota Fincy

Repeat same preprocessing steps as before

test_X = test_df[['Spa']]
# Impute using fitted imputer
test_X = imputer.transform(test_X)

Create final predictions

final_preds = model_1.predict(test_X)

Save predictions

The kaggle competition provides the following instructions for submitting predictions:


Your submission should be in the form a csv file with two columns.

  1. PassengerId
  2. Transported

The PassengerId column should be the PassengerId column found in the predictors dataset.

For example, if I were to submit a csv of predictions where I predict the mean for every observations, the first three rows of the submission would look like this:

PassengerId Transported
0013_01 True
0018_01 False
0019_01 True

It is recommended that you save your predictions to csv using pd.to_csv and that you import the saved file into a notebook, to make sure the file is structured as intended.


The easiest way to do this, is to add the predictions to the original dataframe and then isolate the columns we want.

# Add predictions to the test dataframe
test_df['Transported'] = final_preds
# Isolate the columns we want in our submission
submission_df = test_df[['PassengerId', 'Transported']]

Check the shape. The shape of our submission must be (2000, 2)

submission_df.shape
(2000, 2)

Now we just need to save the submission to a .csv file.

In this case, you should set index=False.

submission_df.to_csv('sample_submission.csv', index=False)

Submit Predictions

Once you have saved you predictions to a csv file, you can submit them here

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.