Git Product home page Git Product logo

kaggle's Introduction

kaggle

kaggle's People

Contributors

heikelanmao avatar

Watchers

James Cloos avatar

kaggle's Issues

titanic

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
train=pd.read_csv(r'D:\xunleixiazai\train.csv')
test=pd.read_csv(r'D:\xunleixiazai\test.csv')
train.info()
print('-'*40)
test.info()
train.head(10)
train.describe()

PassengerId 样本编号 对分析无用
Survived的获救率为38%
Pclass共有三个取值 50%的等级是3
Age有缺失值
SibSp最多有8个取值
Parch最多有6个取值 可以进行因子化
fare最大是512 与Pclass相关
name重复值少 对分析无用
sex有两个取值 male有577个
ticket 重复值少 对分析无用
cabin 缺失值多 舍弃
Embarked 有缺失值 有三个取值 S最多 有644
所有暂时舍弃name,ticket,cabin,fare 对age,Embarked进行缺失值填充
age进行算数平均填充,Embarked进行最多取值填充


> train['Age']=train['Age'].fillna(train['Age'].mean())
> train['Embarked']=train['Embarked'].fillna(train['Embarked'].max())
> train=train.drop(['Name','Ticket','Cabin'],axis=1)

分析Pclass与Survived的关系


pmean=train[['Pclass','Survived']].groupby(['Pclass'],as_index=False).mean()
pcount=train[['Pclass','Survived']].groupby(['Pclass'],as_index=False).count()
pd.merge(pmean,pcount,on='Pclass',suffixes=('_mean','_count'))

分析Sex与Survived的关系


smean=train[['Sex','Survived']].groupby(['Sex'],as_index=False).mean()
scount=train[['Sex','Survived']].groupby(['Sex'],as_index=False).count()
pd.merge(smean,scount,on='Sex',suffixes=('_mean','_count'))

由于Age取值过大进行划分


train['Age']=train['Age'].astype(np.int32)
pd.cut(train['Age'],[0,18,35,60])
train['Age']=np.where(train['Age']<18,0,np.where(train['Age']<35,1,np.where(train['Age']<60,2,3)))
对Sex也进行01划分frmale=0.male=1
train['Sex']=np.where(train['Sex']=='female',0,1)
对 Embarked进行划分s=0,c=1,q=2
train['Embarked']=np.where(train['Embarked']=='S',0,np.where(train['Embarked']=='C',1,2))

SibSp:乘客在船上的兄弟姐妹和配偶的数量 Parch:乘客在船上的父母以及小孩的数量
改为是否单独一人


train['family']=train['SibSp']+train['Parch']+1
train['isalone']=np.where(train['family']==1,0,1)
train.drop('family',axis=1)

把多余的列删除 'SibSp','Parch','family','Fare','PassengerId'


train=train.drop(['SibSp','Parch'],axis=1)
train=train.drop('family',axis=1)
train=train.drop(['Fare','PassengerId'],axis=1)

对test也做相应的处理


test['Age']=test['Age'].fillna(test['Age'].mean())
test['Embarked']=test['Embarked'].fillna(test['Age'].max())
test=test.drop(['Name','Ticket','Cabin'],axis=1)
test['Age']=test['Age'].astype(np.int32)
pd.cut(test['Age'],[0,18,35,60])
test['Age']=np.where(test['Age']<18,0,np.where(test['Age']<35,1,np.where(test['Age']<60,2,3)))
test['Sex']=np.where(test['Sex']=='female',0,1)
test['Embarked']=np.where(test['Embarked']=='S',0,np.where(test['Embarked']=='C',1,2))
test['family']=test['SibSp']+test['Parch']+1
test['isalone']=np.where(test['family']==1,0,1)
test.drop('family',axis=1)
test=test.drop(['SibSp','Parch'],axis=1)
test=test.drop('family',axis=1)
test=test.drop('Fare',axis=1)

训练


X_train=train.drop('Survived',axis=1)
y_train=train['Survived']
X_test=test.drop('PassengerId',axis=1).copy()
X_train.shape,y_train.shape,X_test.shape

用LR分析

from sklearn.linear_model import LogisticRegression
clf=LogisticRegression()
clf.fit(X_train,y_train)
clf
x_pred=clf.predict(X_test)
x_pred
result=pd.DataFrame({'PassengerId':test['PassengerId'], 'Survived':x_pred.astype(np.int32)})
result.to_csv(r"D:\xunleixiazai\titanic.csv", index=False)

#第一次用lr 得分0.75598 很开心,虽然分数不高
用rf分许


from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(n_estimators=100)
rf.fit(X_train,y_train)
x_pred=rf.predict(X_test)
random_forest.score(X_train, y_train)
result=pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':x_pred.astype(np.int32)})
result.to_csv(r"D:\xunleixiazai\titanic1.csv", index=False)

#第二次用rf 得分0.79462 实际排名1321

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.