The kaggle from heikelanmao

titanic

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
train=pd.read_csv(r'D:\xunleixiazai\train.csv')
test=pd.read_csv(r'D:\xunleixiazai\test.csv')
train.info()
print('-'*40)
test.info()
train.head(10)
train.describe()

PassengerId 样本编号对分析无用
Survived的获救率为38%
Pclass共有三个取值 50%的等级是3
Age有缺失值
SibSp最多有8个取值
Parch最多有6个取值可以进行因子化
fare最大是512 与Pclass相关
name重复值少对分析无用
sex有两个取值 male有577个
ticket 重复值少对分析无用
cabin 缺失值多舍弃
Embarked 有缺失值有三个取值 S最多有644
所有暂时舍弃name，ticket，cabin，fare 对age，Embarked进行缺失值填充
age进行算数平均填充，Embarked进行最多取值填充


> train['Age']=train['Age'].fillna(train['Age'].mean())
> train['Embarked']=train['Embarked'].fillna(train['Embarked'].max())
> train=train.drop(['Name','Ticket','Cabin'],axis=1)

分析Pclass与Survived的关系


pmean=train[['Pclass','Survived']].groupby(['Pclass'],as_index=False).mean()
pcount=train[['Pclass','Survived']].groupby(['Pclass'],as_index=False).count()
pd.merge(pmean,pcount,on='Pclass',suffixes=('_mean','_count'))

分析Sex与Survived的关系


smean=train[['Sex','Survived']].groupby(['Sex'],as_index=False).mean()
scount=train[['Sex','Survived']].groupby(['Sex'],as_index=False).count()
pd.merge(smean,scount,on='Sex',suffixes=('_mean','_count'))

由于Age取值过大进行划分


train['Age']=train['Age'].astype(np.int32)
pd.cut(train['Age'],[0,18,35,60])
train['Age']=np.where(train['Age']<18,0,np.where(train['Age']<35,1,np.where(train['Age']<60,2,3)))
对Sex也进行01划分frmale=0.male=1
train['Sex']=np.where(train['Sex']=='female',0,1)
对 Embarked进行划分s=0,c=1,q=2
train['Embarked']=np.where(train['Embarked']=='S',0,np.where(train['Embarked']=='C',1,2))

SibSp：乘客在船上的兄弟姐妹和配偶的数量 Parch：乘客在船上的父母以及小孩的数量
改为是否单独一人


train['family']=train['SibSp']+train['Parch']+1
train['isalone']=np.where(train['family']==1,0,1)
train.drop('family',axis=1)

把多余的列删除 'SibSp','Parch','family','Fare','PassengerId'


train=train.drop(['SibSp','Parch'],axis=1)
train=train.drop('family',axis=1)
train=train.drop(['Fare','PassengerId'],axis=1)

对test也做相应的处理


test['Age']=test['Age'].fillna(test['Age'].mean())
test['Embarked']=test['Embarked'].fillna(test['Age'].max())
test=test.drop(['Name','Ticket','Cabin'],axis=1)
test['Age']=test['Age'].astype(np.int32)
pd.cut(test['Age'],[0,18,35,60])
test['Age']=np.where(test['Age']<18,0,np.where(test['Age']<35,1,np.where(test['Age']<60,2,3)))
test['Sex']=np.where(test['Sex']=='female',0,1)
test['Embarked']=np.where(test['Embarked']=='S',0,np.where(test['Embarked']=='C',1,2))
test['family']=test['SibSp']+test['Parch']+1
test['isalone']=np.where(test['family']==1,0,1)
test.drop('family',axis=1)
test=test.drop(['SibSp','Parch'],axis=1)
test=test.drop('family',axis=1)
test=test.drop('Fare',axis=1)

训练


X_train=train.drop('Survived',axis=1)
y_train=train['Survived']
X_test=test.drop('PassengerId',axis=1).copy()
X_train.shape,y_train.shape,X_test.shape

用LR分析

from sklearn.linear_model import LogisticRegression
clf=LogisticRegression()
clf.fit(X_train,y_train)
clf
x_pred=clf.predict(X_test)
x_pred
result=pd.DataFrame({'PassengerId':test['PassengerId'], 'Survived':x_pred.astype(np.int32)})
result.to_csv(r"D:\xunleixiazai\titanic.csv", index=False)

#第一次用lr 得分0.75598 很开心，虽然分数不高
用rf分许


from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(n_estimators=100)
rf.fit(X_train,y_train)
x_pred=rf.predict(X_test)
random_forest.score(X_train, y_train)
result=pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':x_pred.astype(np.int32)})
result.to_csv(r"D:\xunleixiazai\titanic1.csv", index=False)

#第二次用rf 得分0.79462 实际排名1321

heikelanmao / kaggle Goto Github PK

kaggle's Introduction

kaggle

kaggle's People

Contributors

Watchers

kaggle's Issues

titanic

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent