[데이터마이닝]

09 Jun 2021 in Study log / Study / Blog Etc on Python, Blog, Jekyll, Jupyter

데이터셋 설명
- Information
  - x_test 데이터셋 설명
지도학습모형
모형들의 정확도 / F1

패키지

설치된 패키지 접기/펼치기 버튼

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
from pandas.plotting import scatter_matrix

%matplotlib inline
# %matplotlib notebook

import sklearn

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

dataset = pd.read_csv('C:/Users/KimJaeHyuk/데이터마이닝 코드자료 - 복사본/dataset.csv')

dataset

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 12 columns

데이터셋 설명

PassengerID : 단순한 일련번호
Survived : 생존여부(0 : 죽음, 1 : 생존) ◀ 반응변수(y)
Pclass :티켓 등급 (1등석, 2등석, 3등석 등)
Name : 이름
Sex : 성별
Age : 나이
SibSp : 동승한 형제자매, 배우자의 수
Parch : 동승한 부모, 자식의 수
Ticket : 티켓 번호
Fare : 운임
Cabin : 객실번호
Embarked : 탑승항구
(C: Cherbourg, Q: Queenstown, S: Southampton)

Information

dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Age, Cabin, Embarked 결측치가 존재한다.

dataset.shape

(891, 12)

description = dataset.describe()

pd.set_option('display.width', 100)  # 결과물을 잘 보여주기 위한 옵션
pd.set_option('precision', 3)        # 결과물을 잘 보여주기 위한 옵션
description

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000	891.000	891.000	714.000	891.000	891.000	891.000
mean	446.000	0.384	2.309	29.699	0.523	0.382	32.204
std	257.354	0.487	0.836	14.526	1.103	0.806	49.693
min	1.000	0.000	1.000	0.420	0.000	0.000	0.000
25%	223.500	0.000	2.000	20.125	0.000	0.000	7.910
50%	446.000	0.000	3.000	28.000	0.000	0.000	14.454
75%	668.500	1.000	3.000	38.000	1.000	0.000	31.000
max	891.000	1.000	3.000	80.000	8.000	6.000	512.329

PassengerId는 분석으로는 활용성이 없을 듯 보아 인덱스로 사용할 예정이다.

Fare은 다른 feature들 보다 상대적으로 큰 값을 보이고 있다.

dataset['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

ㄴ 891명 중 549명이 사망, 342명이 생존인 것을 알 수 있다.

plt.style.use('ggplot')
sns.set()
sns.set_palette("Set2")

plt.rc('font', size=15)
plt.rc('axes', titlesize=15)
plt.rc('axes', labelsize=15)
plt.rc('xtick', labelsize=15) 
plt.rc('ytick', labelsize=15) 
plt.rc('legend', fontsize=9)
plt.rc('figure', titlesize=15)

plt.rcParams['figure.figsize'] = (10, 10)
plt.rcParams['font.size'] = 10

dataset.hist()
plt.show()

dataset.plot(kind = 'density', subplots = True, layout = (3,3), sharex = False)
plt.show()

plt.rcParams['font.size'] = 15
dataset.plot(kind = 'box', subplots = True, layout = (3,3), sharex = False, sharey=False)
plt.show()

PassengerId는 자료로써 의미가 없어 보이고
Survived는 사망자가 생존자보다 많다는 것을 확인할 수 있다.
Pclass는 등급이 낮아질 수록 인원수가 많은 것을 볼 수 있는데 등급별 사망인원 파악도 중요해 보인다.
Age는 젊은 층이 많아 보인다.
SibSp,Parch를 확인하듯 동승자는 그렇게 많아 보이지 않는다.
Fare은 3등급의 사람이 압도적으로 많아서 그런지 운임이 저렴한대에 몰려있다.

def bar_chart(dataset, feature):
    survived = dataset[dataset['Survived'] == 1][feature].value_counts()
    dead = dataset[dataset['Survived'] == 0][feature].value_counts()
    df = pd.DataFrame([survived, dead])
    df.index = ['Survived', 'Dead']
    df.plot(kind='bar', stacked=True)
    plt.title("{}".format(feature))
    plt.xticks(rotation = 0)

plt.rc('font', size=20)
plt.rc('axes', titlesize=20)
plt.rc('axes', labelsize=20)
plt.rc('xtick', labelsize=20) 
plt.rc('ytick', labelsize=20) 
plt.rc('legend', fontsize=20)
plt.rc('figure', titlesize=20)

bar_chart(dataset, 'Pclass')

1등석에 있던 승객들은 오히려 사망한 사람보다 생존한 사람이 많고,

2등석은 대략 생존율이 50%정도 되는 것 같다.

반면 3등석은 생존한 사람보다 사망한 사람이 훨씬 많다.

자료를 보았을 때 Pclass라는 특성은 승객의 생사를 예측하는 데에 큰 영향을 끼친다는 것을 확인할 수 있다.

bar_chart(dataset, 'Sex')

성별에 따른 분류이다.

여성은 생존률이 매우 높은 반면, 남성은 사망한 사람이 훨씬 많다는 것을 관측할 수 있다.

사고 당시에 남성보다는 여성을 우선적으로 살린 것으로 보인다.

bar_chart(dataset, 'SibSp')

이 그래프는 자매와 배우자의 수에 따른 그래프이다.

SibSp 값이 3 이상부터는 잘 보이지 않아서 추가적인 확인이 필요할 것 같다.

dataset_Sibsp_2 = dataset[(dataset['SibSp'] > 2)]
bar_chart(dataset_Sibsp_2, 'SibSp')

SibSp 값이 3 이상인 사람들은 생존율이 높지 않다는 것을 확인할 수 있다.

bar_chart(dataset, 'Embarked')

지역별로 부유한 도시와 가난한 도시가 있을수도 있을 것 같다.

탑승지별로 1등석, 2등석, 3등석의 수를 비교해 봐야겠다.

S = dataset[dataset['Embarked'] == 'S']['Pclass'].value_counts()
C = dataset[dataset['Embarked'] == 'C']['Pclass'].value_counts()
Q = dataset[dataset['Embarked'] == 'Q']['Pclass'].value_counts()
df = pd.DataFrame([S, C, Q])
df.index = ['S', 'C', 'Q']
df.plot(kind='bar', stacked=True)
plt.title("Embarked")
plt.xticks(rotation = 0)
# C: Cherbourg, Q: Queenstown, S: Southampton

(array([0, 1, 2]), [Text(0, 0, 'S'), Text(1, 0, 'C'), Text(2, 0, 'Q')])

1등석의 비율이 탑승지별로 다른 것을 확인할 수 있다.

Embarked가 C인 사람들은 1등석 비율이 거의 절반에 육박한다.

이는 전 그래프에서 탑승지가 C였던 사람들의 생존률이 거의 50퍼센트에 가깝게 나왔다는 것에 큰 영향이 있을 것 같다. Sex

의 문자열 자료를 숫자로 변환하여 dataset_2에 저장한다.

male
female

dataset_2 = dataset.copy()
dic = {"male": 0, "female": 1}
dataset_2['Sex'] = dataset_2['Sex'].map(dic)
dataset_2

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	0	22.0	1	0	A/5 21171	7.250	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	1	38.0	1	0	PC 17599	71.283	C85	C
2	3	1	3	Heikkinen, Miss. Laina	1	26.0	0	0	STON/O2. 3101282	7.925	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	1	35.0	1	0	113803	53.100	C123	S
4	5	0	3	Allen, Mr. William Henry	0	35.0	0	0	373450	8.050	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	0	27.0	0	0	211536	13.000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	1	19.0	0	0	112053	30.000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	1	NaN	1	2	W./C. 6607	23.450	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	0	26.0	0	0	111369	30.000	C148	C
890	891	0	3	Dooley, Mr. Patrick	0	32.0	0	0	370376	7.750	NaN	Q

891 rows × 12 columns

dataset_2에서 Age의 결측값을 Age의 중앙값으로 대체하여 dataset_2에 저장한다.

(평균과 중앙값의 큰 차이가 없기 때문에 평균을 사용하여 소수점자리수를 만드는 것보다 중앙값을 사용하였다.)

dataset_2['Age'].describe()

count    714.000
mean      29.699
std       14.526
min        0.420
25%       20.125
50%       28.000
75%       38.000
max       80.000
Name: Age, dtype: float64

dataset_2["Age"].fillna(dataset_2["Age"].median(),inplace=True)
dataset_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    int64  
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(6), object(4)
memory usage: 83.7+ KB

dataset_2

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	0	22.0	1	0	A/5 21171	7.250	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	1	38.0	1	0	PC 17599	71.283	C85	C
2	3	1	3	Heikkinen, Miss. Laina	1	26.0	0	0	STON/O2. 3101282	7.925	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	1	35.0	1	0	113803	53.100	C123	S
4	5	0	3	Allen, Mr. William Henry	0	35.0	0	0	373450	8.050	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	0	27.0	0	0	211536	13.000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	1	19.0	0	0	112053	30.000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	1	28.0	1	2	W./C. 6607	23.450	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	0	26.0	0	0	111369	30.000	C148	C
890	891	0	3	Dooley, Mr. Patrick	0	32.0	0	0	370376	7.750	NaN	Q

891 rows × 12 columns

Age's

feature 추가

0: 미성년 (15세 이하)
1: 청년 (16세 이상, 30세 이하)
2: 중년 (31세 이상, 45세 이하)
3: 장년 (46세 이상, 60세 이하)
4: 노년 (61세 이상)

dataset_2.loc[dataset_2["Age"]<=15,"Age's"] = 0
dataset_2.loc[(dataset_2["Age"]>15) & (dataset_2["Age"] <= 30),"Age's"] = 1
dataset_2.loc[(dataset_2["Age"]>30) & (dataset_2["Age"] <= 45),"Age's"] = 2
dataset_2.loc[(dataset_2["Age"]>45) & (dataset_2["Age"] <= 60),"Age's"] = 3
dataset_2.loc[dataset_2["Age"]>60,"Age's"] = 4

dataset_2

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Age's
0	1	0	3	Braund, Mr. Owen Harris	0	22.0	1	0	A/5 21171	7.250	NaN	S	1.0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	1	38.0	1	0	PC 17599	71.283	C85	C	2.0
2	3	1	3	Heikkinen, Miss. Laina	1	26.0	0	0	STON/O2. 3101282	7.925	NaN	S	1.0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	1	35.0	1	0	113803	53.100	C123	S	2.0
4	5	0	3	Allen, Mr. William Henry	0	35.0	0	0	373450	8.050	NaN	S	2.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	0	27.0	0	0	211536	13.000	NaN	S	1.0
887	888	1	1	Graham, Miss. Margaret Edith	1	19.0	0	0	112053	30.000	B42	S	1.0
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	1	28.0	1	2	W./C. 6607	23.450	NaN	S	1.0
889	890	1	1	Behr, Mr. Karl Howell	0	26.0	0	0	111369	30.000	C148	C	1.0
890	891	0	3	Dooley, Mr. Patrick	0	32.0	0	0	370376	7.750	NaN	Q	2.0

891 rows × 13 columns

np.unique(dataset_2["Age's"],return_counts=True)

(array([0., 1., 2., 3., 4.]), array([ 83, 503, 202,  81,  22], dtype=int64))

사분위수에 맞춰서 Fare 범주화 실시

[0,7.910]
(7.910,14.454]
(14.454,31]
31 초과

dataset_2['Fare'].describe()

count    891.000
mean      32.204
std       49.693
min        0.000
25%        7.910
50%       14.454
75%       31.000
max      512.329
Name: Fare, dtype: float64

dataset_2["Fare"] = pd.qcut(dataset_2['Fare'],4,labels=[0,1,2,3])
dataset_2

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Age's
0	1	0	3	Braund, Mr. Owen Harris	0	22.0	1	0	A/5 21171	0	NaN	S	1.0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	1	38.0	1	0	PC 17599	3	C85	C	2.0
2	3	1	3	Heikkinen, Miss. Laina	1	26.0	0	0	STON/O2. 3101282	1	NaN	S	1.0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	1	35.0	1	0	113803	3	C123	S	2.0
4	5	0	3	Allen, Mr. William Henry	0	35.0	0	0	373450	1	NaN	S	2.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	0	27.0	0	0	211536	1	NaN	S	1.0
887	888	1	1	Graham, Miss. Margaret Edith	1	19.0	0	0	112053	2	B42	S	1.0
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	1	28.0	1	2	W./C. 6607	2	NaN	S	1.0
889	890	1	1	Behr, Mr. Karl Howell	0	26.0	0	0	111369	2	C148	C	1.0
890	891	0	3	Dooley, Mr. Patrick	0	32.0	0	0	370376	0	NaN	Q	2.0

891 rows × 13 columns

Ticket, Cabin, Name 의 자료는 분석으로써의 가치가 없을 것으로 판단되어 삭제한다.

dataset_2.drop('Ticket', axis=1, inplace=True)
dataset_2.drop('Cabin', axis=1, inplace=True)
dataset_2.drop('Name', axis=1, inplace=True)

dataset_2.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       2
Age's          0
dtype: int64

아직 Embarked 데이터가 결측값 2개가 남았다.

dataset_2["Embarked"].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

Embarked의 최빈값인 S를 결측값에 넣어주기로 한다.

dataset_2["Embarked"].fillna("S",inplace=True)
dataset_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   PassengerId  891 non-null    int64   
 1   Survived     891 non-null    int64   
 2   Pclass       891 non-null    int64   
 3   Sex          891 non-null    int64   
 4   Age          891 non-null    float64 
 5   SibSp        891 non-null    int64   
 6   Parch        891 non-null    int64   
 7   Fare         891 non-null    category
 8   Embarked     891 non-null    object  
 9   Age's        891 non-null    float64 
dtypes: category(1), float64(2), int64(6), object(1)
memory usage: 63.8+ KB

이로써 결측값을 모두 정리했다.

dataset_2["Embarked"].value_counts()

S    646
C    168
Q     77
Name: Embarked, dtype: int64

Embarked

의 문자열 자료를 숫자로 변환하여 dataset_2에 저장한다.

dic = {"S": 0, "C": 1, "Q": 2}
dataset_2.loc[:, 'Embarked'] = dataset_2.loc[:, 'Embarked'].map(dic)

dataset_2

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	Age's
0	1	0	3	0	22.0	1	0	0	0	1.0
1	2	1	1	1	38.0	1	0	3	1	2.0
2	3	1	3	1	26.0	0	0	1	0	1.0
3	4	1	1	1	35.0	1	0	3	0	2.0
4	5	0	3	0	35.0	0	0	1	0	2.0
...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	0	27.0	0	0	1	0	1.0
887	888	1	1	1	19.0	0	0	2	0	1.0
888	889	0	3	1	28.0	1	2	2	0	1.0
889	890	1	1	0	26.0	0	0	2	1	1.0
890	891	0	3	0	32.0	0	0	0	2	2.0

891 rows × 10 columns

dataset_2 자료에서 "Survived" 자료를 반응변수로 정하여 y_test에 분리한다.

y_test=dataset['Survived']
dataset_2.drop('Survived',axis=1,inplace=True)

dataset_2에서 PassengerId를 인덱스로 사용한다.

dataset_2.set_index("PassengerId",inplace=True)

dataset_2

	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	Age's
PassengerId
1	3	0	22.0	1	0	0	0	1.0
2	1	1	38.0	1	0	3	1	2.0
3	3	1	26.0	0	0	1	0	1.0
4	1	1	35.0	1	0	3	0	2.0
5	3	0	35.0	0	0	1	0	2.0
...	...	...	...	...	...	...	...	...
887	2	0	27.0	0	0	1	0	1.0
888	1	1	19.0	0	0	2	0	1.0
889	3	1	28.0	1	2	2	0	1.0
890	1	0	26.0	0	0	2	1	1.0
891	3	0	32.0	0	0	0	2	2.0

891 rows × 8 columns

x_test=dataset_2.copy()

x_test

	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	Age's
PassengerId
1	3	0	22.0	1	0	0	0	1.0
2	1	1	38.0	1	0	3	1	2.0
3	3	1	26.0	0	0	1	0	1.0
4	1	1	35.0	1	0	3	0	2.0
5	3	0	35.0	0	0	1	0	2.0
...	...	...	...	...	...	...	...	...
887	2	0	27.0	0	0	1	0	1.0
888	1	1	19.0	0	0	2	0	1.0
889	3	1	28.0	1	2	2	0	1.0
890	1	0	26.0	0	0	2	1	1.0
891	3	0	32.0	0	0	0	2	2.0

891 rows × 8 columns

x_test 데이터셋 설명

Pclass :티켓 등급 (1등석, 2등석, 3등석 등)
Sex : 성별 (0:남자, 1:여자)
Age : 나이
Age’s: 나이대
0: 미성년 (15세 이하)
1: 청년 (16세 이상, 30세 이하)
2: 중년 (31세 이상, 45세 이하)
3: 장년 (46세 이상, 60세 이하)
4: 노년 (61세 이상)
SibSp : 동승한 형제자매, 배우자의 수
Parch : 동승한 부모, 자식의 수
Fare : 운임
0: [0,7.910]
1: (7.910,14.454]
2: (14.454,31]
3: 31 초과
Embarked : 탑승항구
(C: Cherbourg, Q: Queenstown, S: Southampton)
0: S
1: C
2: Q

지도학습모형

model1 : 의사결정나무
model2 : K-최근접이웃법
model3 : 로지스틱 회귀분석
model4 : 앙상블
model5 : 랜덤포레스트
model6 : 인공신경망

02_1. 의사결정나무(Decision Tree)

model1 = DecisionTreeClassifier()

model1.fit(x_test, y_test)

y_pred1 = model1.predict(x_test)

confusion_matrix(y_test,y_pred1, labels=[1,0])

array([[299,  43],
       [  3, 546]], dtype=int64)

np.round(accuracy_score(y_test,y_pred1),4)

0.9484

np.round(f1_score(y_test,y_pred1),4)

0.9286

f=plt.subplots(figsize=(10,5))
pd.Series(model1.feature_importances_,x_test.columns).sort_values(ascending=True).plot.barh(width=0.8)
plt.title('DecisionTreeClassifier')

Text(0.5, 1.0, 'DecisionTreeClassifier')

model1.feature_importances_

array([0.13732302, 0.34262405, 0.28184117, 0.06581069, 0.04331226,
       0.0767628 , 0.0389664 , 0.01335962])

x_test.columns

Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'Age's'], dtype='object')

values1 = model1.feature_importances_

pd1 = pd.Series(values1, index=x_test.columns[:]).sort_values(ascending=False)
pd1

Sex         0.343
Age         0.282
Pclass      0.137
Fare        0.077
SibSp       0.066
Parch       0.043
Embarked    0.039
Age's       0.013
dtype: float64

model1 = DecisionTreeClassifier(max_depth=4)
model1.fit(x_test, y_test)
y_pred1 = model1.predict(x_test)
plt.figure(figsize=(22,22))  # set plot size (denoted in inches)
sklearn.tree.plot_tree(model1, fontsize = 10)
plt.show()

survived = dataset[dataset['Survived'] == 1]["Sex"].value_counts()
dead = dataset[dataset['Survived'] == 0]["Sex"].value_counts()
print('남성 죽음:',str(dead[0]),'남성 생존:', str(survived[1]))
print('여성 죽음:',str(dead[1]),'여성 생존:', str(survived[0]))

남성 죽음: 468 남성 생존: 109
여성 죽음: 81 여성 생존: 233

의사결정 나무의 정확도:0.9484, F1:0.9286

특성 중요도는 성별이 가장 크고 그 다음 연령 그 다음 티켓등급이다.

의사결정 나무 그림 깊이는 4로 지정해서 출력하여 줬다.

(깊이를 지정하지 않으면 복잡한 구조가 출력되어 분석하기에 어려움이 있다.)

의사결정 나무의 그림을 살펴보면 성별로 남자(577명)와 여자(314명)로 구분짓고

남자 가지를 살펴보면 나이로 6살이하(24명)와 7살이상(553명)으로 나눈것을 알 수 있다.

6살 이하 남자 가지를 살펴보면 동승한 형제자매, 배우자의 수가 2명이하인지 3명이상인지 로 나누어 동승한 형제자매, 배우자의 수가 2명이하인 6살 남자아이 15명은 모두 생존한 것을 알 수 있다.

02_2. K-최근접이웃법(K-Nearest Neighbor)

model2 = KNeighborsClassifier()
model2.fit(x_test, y_test)
y_pred2 = model2.predict(x_test)

confusion_matrix(y_test,y_pred2, labels=[1,0])

array([[257,  85],
       [ 44, 505]], dtype=int64)

np.round(accuracy_score(y_test,y_pred2),4)

0.8552

np.round(f1_score(y_test,y_pred2),4)

0.7994

K-최근접이웃법의 정확도:0.8552, F1:0.7994

모형의 성능이 의사결정 나무 보다 나쁜 것을 볼 수 있다.

02_3. 로지스틱 회귀분석(Logistic Regression)

model3 = LogisticRegression()
model3.fit(x_test,y_test)
y_pred3 = model3.predict(x_test)

D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

confusion_matrix(y_test,y_pred3,labels=[1,0])

array([[245,  97],
       [ 79, 470]], dtype=int64)

np.round(accuracy_score(y_test,y_pred3),4)

0.8025

np.round(f1_score(y_test,y_pred3),4)

0.7357

로지스틱 회귀분석의 정확도:0.8025, F1:0.7357

모형의 성능이 의사결정 나무 보다 나쁜 것을 볼 수 있다.

02_5. 앙상블(Ensemble) - 함수 이용

model4 = VotingClassifier(estimators=[('DT',DecisionTreeClassifier()),
                                      ('KNN',KNeighborsClassifier()),
                                      ('LR',LogisticRegression())],
                          voting='soft')

model4.fit(x_test,y_test)
y_pred4 = model4.predict(x_test)

D:\Anaconda\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

confusion_matrix(y_test,y_pred4, labels=[1,0])

array([[299,  43],
       [ 15, 534]], dtype=int64)

np.round(accuracy_score(y_test,y_pred4),4)

0.9349

np.round(f1_score(y_test,y_pred4),4)

0.9116

앙상블의 정확도:0.9349, F1:0.9116

의사결정나무, K-최근접이웃법,로지스틱 회귀분석을 활용하여 앙상블을 만들었다.

모형의 성능이 K-최근접이웃법, 로지스틱 회귀분석 보다는 좋지만,

의사결정 나무 보다 나쁜 것을 볼 수 있다.

03.Random Forests

model5 = RandomForestClassifier()
model5.fit(x_test,y_test)
y_pred5 = model5.predict(x_test)

confusion_matrix(y_test,y_pred5, labels=[1,0])

array([[308,  34],
       [ 12, 537]], dtype=int64)

np.round(accuracy_score(y_test,y_pred5),4)

0.9484

np.round(f1_score(y_test,y_pred5),4)

0.9305

f=plt.subplots(figsize=(10,5))
pd.Series(model5.feature_importances_,x_test.columns).sort_values(ascending=True).plot.barh(width=0.8)
plt.title('RandomForestClassifier')

Text(0.5, 1.0, 'RandomForestClassifier')

values2 = model5.feature_importances_

pd2 = pd.Series(values2, index=x_test.columns[:]).sort_values(ascending=False)
pd2

Age         0.288
Sex         0.287
Pclass      0.112
Fare        0.094
SibSp       0.064
Age's       0.057
Parch       0.050
Embarked    0.048
dtype: float64

랜덤포레스트의 정확도:0.9484, F1:0.9305

랜덤포레스트에서는 특성 중요도 연령이 가장 크고 그 다음 성별 그 다음 티켓등급이다.

모형의 성능이 지금까지의 모형들 중 가장 좋은 것을 볼 수 있다.

신경망(다층신경망, MLP; Multi-Layer Perceptron)

model6 = MLPClassifier(hidden_layer_sizes = (10), activation = 'relu', solver = 'adam', 
                      batch_size = 'auto', learning_rate_init = 0.01, max_iter = 1000)

model6.fit(x_test, y_test)
y_pred6 = model6.predict(x_test)

hidden_layer_sizes : 히든 레이어의 층 및 노드 개수 / (기본값) 100
activation : 활성함수 / (기본값) ‘relu’=’렐루함수’, ‘identity’, ‘logistic’, ‘tanh’
solver : 경사하강법(Gradient Descent) / (기본값) ‘adam’, ‘lbfgs’, ‘sgd’
batch_size : 배치(batch)의 크기 / (기본값) min(200, 자료의 개수)
learning_rate_init : 학습률 / (기본값) 0.001
max_iter(epoch) : 반복회수 (기본값) 200

confusion_matrix(y_test,y_pred6, labels=[1,0])

array([[242, 100],
       [ 87, 462]], dtype=int64)

np.round(accuracy_score(y_test,y_pred6),4)

0.7901

np.round(f1_score(y_test,y_pred6),4)

0.7213

인공신경망의 정확도:0.7901, F1:0.7213

모형의 성능을 확인해 보면 그렇게 성능이 좋지 않을 것을 볼 수 있다.

히든 레이어의 수랑 학습률을 변경해봐도 성능이 많이 좋아지지 않는다.

모형들의 정확도 / F1

의사결정나무 : 0.9484 / 0.9286
K-최근접이웃법 : 0.8552 / 0.7994
로지스틱 회귀분석: 0.8025 / 0.7357
앙상블 : 0.9349 / 0.9116
랜덤포레스트 : 0.9484 / 0.9301
신경망 : 0.7901 / 0.7213

모형의 성능이 가장 좋았던 랜덤포레스트를 사용하여 x_test를 확인하여 보겠다.

prediction = model5.predict(x_test)
pred = pd.DataFrame({"PassengerId" : x_test.index, 
                    "Survived" : prediction})
pred.head(10)

	PassengerId	Survived
0	1	0
1	2	1
2	3	1
3	4	1
4	5	0
5	6	0
6	7	0
7	8	0
8	9	1
9	10	1

test = np.array(pred["Survived"]==dataset["Survived"])

np.unique(test, return_counts=True)

(array([False,  True]), array([ 46, 845], dtype=int64))

845/(46+845)

0.9483726150392817

정확도가 랜덤포레스트 모형의 정확도와 일치하는 것을 볼 수 있다.

[데이터마이닝]

데이터셋 설명

Information

x_test 데이터셋 설명

지도학습모형

02_1. 의사결정나무(Decision Tree)

02_2. K-최근접이웃법(K-Nearest Neighbor)

02_3. 로지스틱 회귀분석(Logistic Regression)

02_5. 앙상블(Ensemble) - 함수 이용

03.Random Forests

신경망(다층신경망, MLP; Multi-Layer Perceptron)

모형들의 정확도 / F1

Biostatistics

Error

데이터셋 설명

Information

x_test 데이터셋 설명

지도학습모형

02_1. 의사결정나무(Decision Tree)

02_2. K-최근접이웃법(K-Nearest Neighbor)

02_3. 로지스틱 회귀분석(Logistic Regression)

02_5. 앙상블(Ensemble) - 함수 이용

03.Random Forests

신경망(다층신경망, MLP; Multi-Layer Perceptron)

모형들의 정확도 / F1

Templates (for web app):

Error