kaggle's Introduction
kaggle's People
kaggle's Issues
titanic
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
train=pd.read_csv(r'D:\xunleixiazai\train.csv')
test=pd.read_csv(r'D:\xunleixiazai\test.csv')
train.info()
print('-'*40)
test.info()
train.head(10)
train.describe()
PassengerId 样本编号 对分析无用
Survived的获救率为38%
Pclass共有三个取值 50%的等级是3
Age有缺失值
SibSp最多有8个取值
Parch最多有6个取值 可以进行因子化
fare最大是512 与Pclass相关
name重复值少 对分析无用
sex有两个取值 male有577个
ticket 重复值少 对分析无用
cabin 缺失值多 舍弃
Embarked 有缺失值 有三个取值 S最多 有644
所有暂时舍弃name,ticket,cabin,fare 对age,Embarked进行缺失值填充
age进行算数平均填充,Embarked进行最多取值填充
> train['Age']=train['Age'].fillna(train['Age'].mean())
> train['Embarked']=train['Embarked'].fillna(train['Embarked'].max())
> train=train.drop(['Name','Ticket','Cabin'],axis=1)
分析Pclass与Survived的关系
pmean=train[['Pclass','Survived']].groupby(['Pclass'],as_index=False).mean()
pcount=train[['Pclass','Survived']].groupby(['Pclass'],as_index=False).count()
pd.merge(pmean,pcount,on='Pclass',suffixes=('_mean','_count'))
分析Sex与Survived的关系
smean=train[['Sex','Survived']].groupby(['Sex'],as_index=False).mean()
scount=train[['Sex','Survived']].groupby(['Sex'],as_index=False).count()
pd.merge(smean,scount,on='Sex',suffixes=('_mean','_count'))
由于Age取值过大进行划分
train['Age']=train['Age'].astype(np.int32)
pd.cut(train['Age'],[0,18,35,60])
train['Age']=np.where(train['Age']<18,0,np.where(train['Age']<35,1,np.where(train['Age']<60,2,3)))
对Sex也进行01划分frmale=0.male=1
train['Sex']=np.where(train['Sex']=='female',0,1)
对 Embarked进行划分s=0,c=1,q=2
train['Embarked']=np.where(train['Embarked']=='S',0,np.where(train['Embarked']=='C',1,2))
SibSp:乘客在船上的兄弟姐妹和配偶的数量 Parch:乘客在船上的父母以及小孩的数量
改为是否单独一人
train['family']=train['SibSp']+train['Parch']+1
train['isalone']=np.where(train['family']==1,0,1)
train.drop('family',axis=1)
把多余的列删除 'SibSp','Parch','family','Fare','PassengerId'
train=train.drop(['SibSp','Parch'],axis=1)
train=train.drop('family',axis=1)
train=train.drop(['Fare','PassengerId'],axis=1)
对test也做相应的处理
test['Age']=test['Age'].fillna(test['Age'].mean())
test['Embarked']=test['Embarked'].fillna(test['Age'].max())
test=test.drop(['Name','Ticket','Cabin'],axis=1)
test['Age']=test['Age'].astype(np.int32)
pd.cut(test['Age'],[0,18,35,60])
test['Age']=np.where(test['Age']<18,0,np.where(test['Age']<35,1,np.where(test['Age']<60,2,3)))
test['Sex']=np.where(test['Sex']=='female',0,1)
test['Embarked']=np.where(test['Embarked']=='S',0,np.where(test['Embarked']=='C',1,2))
test['family']=test['SibSp']+test['Parch']+1
test['isalone']=np.where(test['family']==1,0,1)
test.drop('family',axis=1)
test=test.drop(['SibSp','Parch'],axis=1)
test=test.drop('family',axis=1)
test=test.drop('Fare',axis=1)
训练
X_train=train.drop('Survived',axis=1)
y_train=train['Survived']
X_test=test.drop('PassengerId',axis=1).copy()
X_train.shape,y_train.shape,X_test.shape
用LR分析
from sklearn.linear_model import LogisticRegression
clf=LogisticRegression()
clf.fit(X_train,y_train)
clf
x_pred=clf.predict(X_test)
x_pred
result=pd.DataFrame({'PassengerId':test['PassengerId'], 'Survived':x_pred.astype(np.int32)})
result.to_csv(r"D:\xunleixiazai\titanic.csv", index=False)
#第一次用lr 得分0.75598 很开心,虽然分数不高
用rf分许
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(n_estimators=100)
rf.fit(X_train,y_train)
x_pred=rf.predict(X_test)
random_forest.score(X_train, y_train)
result=pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':x_pred.astype(np.int32)})
result.to_csv(r"D:\xunleixiazai\titanic1.csv", index=False)
#第二次用rf 得分0.79462 实际排名1321
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.