10. 타이타닉 생존자 예측[Decision tree 와 Randomforest 비교하기]

11 Jun 2019 in DataAnalysis-with-Python

SVM알고리즘을 사용하여 FIFA 19데이터의 포지션을 예측해본다.

타이타닉 생존자 예측[Decision tree 와 Randomforest 비교하기]

import pandas as pd
import numpy as np
import sys

Problem definition

Using data from passengers on the Titanic, try to find survival rates based on the characteristics of passengers.

Data definition

Data explanation

survival : Survival [0 = No, 1 = Yes]

pclass : Ticket class [1 = 1st(Upper), 2 = 2nd(Middle), 3 = 3rd(Lower)]

sex : Sex

Age : Age in years

sibsp : # of siblings / spouses aboard the Titanic

parch : # of parents / children aboard the Titanic

ticket : Ticket number

fare : Passenger fare

cabin : Cabin number

embarked : Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

train_f = pd.read_csv("../data/train.csv")
test_f = pd.read_csv("../data/test.csv")
test_a =pd.read_csv("../data/gender_submission.csv")

train_f.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

test_f.head()

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

test_a.head()

	PassengerId	Survived
0	892	0
1	893	1
2	894	0
3	895	0
4	896	1

test_td = pd.concat([test_f, test_a], axis=1)
test_td = test_td.iloc[:,[0,12,1,2,3,4,5,6,7,8,9,10]]
titanic_df = pd.concat([train_f, test_td], axis=0)
titanic_df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

titanic_df.to_csv("../data/titanic_df.csv")

titanic_df = pd.read_csv("../data/titanic_df.csv")

# Check the summary of your current data
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
Unnamed: 0     1309 non-null int64
PassengerId    1309 non-null int64
Survived       1309 non-null int64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1046 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1308 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
dtypes: float64(2), int64(6), object(5)
memory usage: 133.0+ KB

Data preprocessing

In fact, there is information that the accident first carried women and the elderly on lifeboats. Based on this information, we remove features from the data that would not have a significant impact on the survival forecast. Delete columns with too many null values and convert women to 0 and men to 1.

del titanic_df["Name"]
del titanic_df["Ticket"]
del titanic_df["Cabin"]
del titanic_df["Embarked"]
del titanic_df["Unnamed: 0"]

titanic_df.loc[titanic_df["Sex"] == "female",["Sex"]] = 0
titanic_df.loc[titanic_df["Sex"] == "male",["Sex"]] = 1
titanic_df = titanic_df.dropna()

titanic_df.head()

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Fare
0	1	0	3	1	22.0	1	7.2500
1	2	1	1	0	38.0	1	71.2833
2	3	1	3	0	26.0	0	7.9250
3	4	1	1	0	35.0	1	53.1000
4	5	0	3	1	35.0	0	8.0500

titanic_df.describe()

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Fare
count	1045.000000	1045.000000	1045.000000	1045.000000	1045.000000	1045.000000	1045.000000	1045.000000
mean	654.990431	0.399043	2.206699	0.628708	29.851837	0.503349	0.421053	36.686080
std	377.650551	0.489936	0.841542	0.483382	14.389194	0.912471	0.840052	55.732533
min	1.000000	0.000000	1.000000	0.000000	0.170000	0.000000	0.000000	0.000000
25%	326.000000	0.000000	1.000000	0.000000	21.000000	0.000000	0.000000	8.050000
50%	662.000000	0.000000	2.000000	1.000000	28.000000	0.000000	0.000000	15.750000
75%	973.000000	1.000000	3.000000	1.000000	39.000000	1.000000	1.000000	35.500000
max	1307.000000	1.000000	3.000000	1.000000	80.000000	8.000000	6.000000	512.329200

The summary of titanic_df shows that the survival rate is very low at 37.7%.
Gender also shows that there are more males than females, and the average age group was lower than expected at 30.

Data Visualization

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

f,ax=plt.subplots(1,2,figsize=(12,6))

titanic_df['Survived'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.2f%%',ax=ax[0])
ax[0].set_title('Survived')
ax[0].set_ylabel('')

sns.countplot('Survived',data=titanic_df,ax=ax[1])
ax[1].set_title('Survived')
plt.show()

png

Blue indicates death and orange indicates survivors.

plt.figure(figsize=[12,4])
plt.subplot(121)
sns.barplot('Pclass', 'Survived', data=titanic_df)
plt.subplot(122)
sns.countplot('Sex',hue='Survived',data=titanic_df)
plt.show()

png

Comparing the survival rate by seat rating, one can see that the expensive seat has higher survival rate.
If you look at the graph on the right, it shows female deaths, female survivors, male deaths, and male survivors from the left. And we can see survival rate of female is much higher than male.

ax = titanic_df["Age"].hist(bins=15, color='teal', alpha=0.8)
ax.set(xlabel='Age', ylabel='Count')
plt.show()

png

The age distribution of passengers shows that they are concentrated in their 20s and 30s.

f,ax = plt.subplots(figsize=(12,6))
g = sns.kdeplot(titanic_df["Age"][(titanic_df["Survived"] == 0) & (titanic_df["Age"].notnull())], 
                ax = ax, color="Blue", shade = True)
g = sns.kdeplot(titanic_df["Age"][(titanic_df["Survived"] == 1) & (titanic_df["Age"].notnull())], 
                ax =g, color="Green", shade= True)
g.set_xlabel("Age")
g.set_ylabel("Frequency")
g = g.legend(["Not Survived","Survived"])

png

The most important part of this graph comparing survival rate and age is the green section, which represents a high percentage of survivors, and the purple section, which represents a high percentage of deaths.
Age under 17 group have a very high survival rate compared to other age groups.
Age between 17~35 group have a very high death rate compared to other age groups.

By analyzing the overall data, one can see that the younger , female and more expensive class of seats have higher survival rate. If we don’t think about class of seats, people tried to protect the weak first in the accident.

Data Division

# The train_test_split of sklearn makes it easy to divide data into a single line.
from sklearn.model_selection import train_test_split

train, test = train_test_split(titanic_df, test_size=0.2)

import pickle
with open('../data/fifa_train.pkl', 'wb') as train_df:
    pickle.dump(train, train_df)
    
with open('../data/fifa_test.pkl', 'wb') as test_df:
    pickle.dump(test, test_df)

with open('../data/fifa_train.pkl', 'rb') as train_df:
    train = pickle.load(train_df)
    
with open('../data/fifa_test.pkl', 'rb') as test_df:
    test = pickle.load(test_df)

Setting Optimal Parameters for a Model

We can deduce these values(max_depth, min_samples_split,mins_samples_leaf, random_state) through repetition.

from sklearn import tree
from sklearn import preprocessing

X_train = train[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']]
y_train = train[['Survived']]

X_test = test[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']]
y_test = test[['Survived']]

depth = list(range(1,100,5))
sample_split = list(range(2,100,5))
sample_leaf = list(range(2,100,5))
random_state = list(range(1,100,5))
clf_list = []
for i in depth:
    for j in sample_split:
        for k in sample_leaf:
            for l in random_state:
                clf_list.append(tree.DecisionTreeClassifier(max_depth=i, 
                                  min_samples_split=j,
                                  min_samples_leaf=k, 
                                  random_state=l).fit(X_train, y_train))

clf_list[-5:]

[DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=96,
             max_features=None, max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=97, min_samples_split=97,
             min_weight_fraction_leaf=0.0, presort=False, random_state=96,
             splitter='best')]

from sklearn.metrics import accuracy_score
accuracy_score_list = {}
for clf in clf_list:
    pred = clf.predict(X_test)
    accuracy_score_list[clf] = accuracy_score(y_test, pred)

len(accuracy_score_list)

Choose the optimal clf with the highest accuracy among the 160,000 clf values.

max_accuracy = max(list(accuracy_score_list.values()))
best_clf = list(accuracy_score_list.keys())[list(accuracy_score_list.values()).index(max_accuracy)]
best_clf

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=2, min_samples_split=52,
            min_weight_fraction_leaf=0.0, presort=False, random_state=1,
            splitter='best')

Model test

pred = best_clf.predict(X_test)

Check the accuracy of the model prediction.

print("accuracy : " + str( accuracy_score(y_test, pred)) )

accuracy : 0.8325358851674641

Compare the actual and predicted values.

comparison = pd.DataFrame({'prediction':pred, 'ground_truth':y_test["Survived"]}) 
comparison

	prediction	ground_truth
167	0	0
900	0	0
345	1	1
369	1	1
645	0	1
69	0	0
463	0	0
943	1	1
1084	0	0
684	0	0
521	0	0
220	0	1

209 rows × 2 columns

Decision tree Visualization

import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'
import graphviz
dot_data = tree.export_graphviz(best_clf, out_file=None)
graph = graphviz.Source(dot_data)
graph.render("titanic survived")

dot_data = tree.export_graphviz(best_clf, out_file=None, 
                         feature_names=['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare'],  
                         class_names=['Not Survived', 'Survived'],  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = graphviz.Source(dot_data)  
graph

Setting Optimal Parameters for Randomforest

from sklearn import datasets
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")
estimators = list(range(1,200))
clf_list2 = []
for i in estimators:
    clf_list2.append(RandomForestClassifier(n_estimators=i).fit(X_train, y_train))

clf_list2[-5:]

[RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=None, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=199, n_jobs=1,
             oob_score=False, random_state=None, verbose=0,
             warm_start=False)]

from sklearn.metrics import accuracy_score
accuracy_score_list2 = {}
for clf in clf_list2:
    pred2 = clf.predict(X_test)
    accuracy_score_list2[clf] = accuracy_score(y_test, pred2)

len(accuracy_score_list2)

Choose the optimal clf with the highest accuracy among the 199 clf values.

max_accuracy2 = max(list(accuracy_score_list2.values()))
best_clf2 = list(accuracy_score_list2.keys())[list(accuracy_score_list2.values()).index(max_accuracy2)]
best_clf2

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=29, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Randomforest Test

pred2 = best_clf2.predict(X_test)

Check the accuracy of the model prediction.

print("accuracy : " + str( accuracy_score(y_test, pred2)) )

accuracy : 0.8373205741626795

Compare the actual and predicted values.

comparison2 = pd.DataFrame({'prediction':pred2, 'ground_truth':y_test["Survived"]}) 
comparison2

	prediction	ground_truth
167	0	0
900	0	0
345	1	1
369	1	1
645	1	1
69	0	0
463	0	0
943	1	1
1084	0	0
684	0	0
521	0	0
220	0	1

209 rows × 2 columns

Cross_Validation

def cross_validation(classifier,features, labels):
    cv_scores = []

    for i in range(10):
        scores = cross_val_score(classifier, features, labels, cv=10, scoring='accuracy')
        cv_scores.append(scores.mean())
    
    return cv_scores

dt_cv_scores = cross_validation(best_clf, X_test, y_test)

if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")
rf_cv_scores = cross_validation(best_clf2, X_test, y_test)

Randomforest VS Decision tree

cv_list = [    
            ['random_forest',rf_cv_scores],
            ['decision_tree',dt_cv_scores],
          ]
df = pd.DataFrame.from_items(cv_list)

df.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1ad431ac240>

png

Accuracy of Decision Tree

np.mean(dt_cv_scores)

0.7894372294372294

Accuracy of Randomforest

np.mean(rf_cv_scores)

0.7577835497835498

conclusion

In fact, we need to get better results from the random forest model..

However, in this report, the two models were obtained repeatedly to improve the predictability of test data while adjusting all the detailed figures.

In short, the above result is that there is a probability of being overfit.

Therefore, if you do not adjust the detailed figure and set it to the auto, results are as shown in the graph below.

cv_list = [    
            ['random_forest',cross_validation(RandomForestClassifier(), X_test, y_test)],
            ['decision_tree',cross_validation(tree.DecisionTreeClassifier(), X_test, y_test)],
          ]
df = pd.DataFrame.from_items(cv_list)
df.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1ad5d9bf860>

png

etc..

The autosklearn package is said to be a way to obtain appropriate parameter values for different models. I would like to try autosklearn in the next report.

10. 타이타닉 생존자 예측[Decision tree 와 Randomforest 비교하기]

타이타닉 생존자 예측[Decision tree 와 Randomforest 비교하기]

Problem definition

Data definition

Data explanation

Data preprocessing

Data Visualization

By analyzing the overall data, one can see that the younger , female and more expensive class of seats have higher survival rate. If we don’t think about class of seats, people tried to protect the weak first in the accident.

Data Division

Setting Optimal Parameters for a Model

Model test

Decision tree Visualization

Setting Optimal Parameters for Randomforest

Randomforest Test

Cross_Validation

Randomforest VS Decision tree

Accuracy of Decision Tree

Accuracy of Randomforest

conclusion

In fact, we need to get better results from the random forest model..

However, in this report, the two models were obtained repeatedly to improve the predictability of test data while adjusting all the detailed figures.

In short, the above result is that there is a probability of being overfit.

Therefore, if you do not adjust the detailed figure and set it to the auto, results are as shown in the graph below.

etc..

황토매트 연구소

Error

타이타닉 생존자 예측[Decision tree 와 Randomforest 비교하기]

Problem definition

Data definition

Data explanation

Data preprocessing

Data Visualization

By analyzing the overall data, one can see that the younger , female and more expensive class of seats have higher survival rate. If we don’t think about class of seats, people tried to protect the weak first in the accident.

Data Division

Setting Optimal Parameters for a Model

Model test

Decision tree Visualization

Setting Optimal Parameters for Randomforest

Randomforest Test

Cross_Validation

Randomforest VS Decision tree

Accuracy of Decision Tree

Accuracy of Randomforest

conclusion

In fact, we need to get better results from the random forest model..

However, in this report, the two models were obtained repeatedly to improve the predictability of test data while adjusting all the detailed figures.

In short, the above result is that there is a probability of being overfit.

Therefore, if you do not adjust the detailed figure and set it to the auto, results are as shown in the graph below.

etc..

Templates (for web app):

Error