์บ”์— ์ธ์‡„๋œ ์•Œ์ฝ”์˜ฌ, ๋„์ˆ˜, ๋‹น๋„, pH ๊ฐ’์œผ๋กœ ์™€์ธ ์ข…๋ฅ˜ ๊ตฌ๋ณ„

 


๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋กœ ์™€์ธ ๋ถ„๋ฅ˜ํ•˜๊ธฐ

species ์—ด์—์„œ ๊ณ ์œ ๊ฐ’ ์ถ”์ถœ

## ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋กœ ์™€์ธ ๋ถ„๋ฅ˜ํ•˜๊ธฐ

import pandas as pd
wine = pd.read_csv('https://bit.ly/wine-date')

 

์ƒ˜ํ”Œ ํ™•์ธ - head

* head() : ์ฒ˜์Œ n๊ฐœ์˜ ์ƒ˜ํ”Œ ํ™•์ธ (๊ธฐ๋ณธ๊ฐ’ 5)

wine.head()

์ฒ˜์Œ 3๊ฐœ ์—ด(alchol, sugar pH) - ๊ฐ๊ฐ ์•Œ์ฝ”์˜ฌ ๋„์ˆ˜, ๋‹น๋„, pH๊ฐ’ ๋‚˜ํƒ€๋ƒ„

๋„ค ๋ฒˆ์งธ ์—ด(class) ํƒ€๊นƒ๊ฐ’ - 0 ๋ ˆ๋“œ์™€์ธ, 1 ํ™”์ดํŠธ ์™€์ธ

-> ๋ ˆ๋“œ์™€์ธ vs ํ™”์ดํŠธ ์•„์ธ ๊ตฌ๋ถ„ํ•˜๋Š” ์ด์ง„ ๋ถ„๋ฅ˜

ํ™”์ดํŠธ ์™€์ธ์ด ์–‘์„ฑ ํด๋ž˜์Šค -> ์ „์ฒด ์™€์ธ ๋ฐ์ดํ„ฐ์—์„œ ํ™”์ดํŠธ ์™€์ธ ๊ณจ๋ผ๋‚ด๋Š” ๋ฌธ์ œ

 

 

์ƒ˜ํ”Œ ํ™•์ธ - info

* info() : ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ๊ฐ ์—ด์˜ ๋ฐ์ดํ„ฐ ํƒ€์ž…๊ณผ ๋ˆ„๋ž๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธ

wine.info()

์ƒ˜ํ”Œ - 6497๊ฐœ     /    ์—ด - 4๊ฐœ (๋ชจ๋‘ ์‹ค์ˆ˜๊ฐ’)    /    non-null count๊ฐ€ 6497๊ฐœ -> ๋ˆ„๋ฝ๋œ ๊ฐ’ ์—†์Œ

 

 

์ƒ˜ํ”Œ ํ™•์ธ - describe()

* describe() : ์—ด์— ๋Œ€ํ•œ ๊ฐ„๋žตํ•œ ํ†ต๊ณ„ ์ถœ๋ ฅ. ์ตœ์†Œ, ์ตœ๋Œ€, ํ‰๊ท ๊ฐ’ ๋“ฑ์„ ๋ณผ ์ˆ˜ ์žˆ์Œ

wine.describe()

ํ‰๊ท (mean), ํ‘œ์ค€ํŽธ์ฐจ(std), ์ตœ์†Œ(min), ์ตœ๋Œ€(max), ์ค‘๊ฐ„๊ฐ’(50%), 1์‚ฌ๋ถ„์œ„์ˆ˜(25%), 3์‚ฌ๋ถ„์œ„์ˆ˜(75%) ์ถœ๋ ฅ

-> ์•Œ์ฝ”์˜ฌ ๋„์ˆ˜์™€ ๋‹น๋„, pH ๊ฐ’์˜ ์Šค์ผ€์ผ์ด ๋‹ค๋ฆ„ -> StandardScaler ํด๋ž˜์Šค ์‚ฌ์šฉ ํŠน์„ฑ ํ‘œ์ค€ํ™” ํ•„์š”

 

 

ํŒ๋‹ค์Šค ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ -> ๋„˜ํŒŒ์ด ๋ฐฐ์—ด ๋ณ€ํ™˜

wine ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ ์ฒ˜์Œ 3๊ฐœ ์—ด ๋„˜ํŒŒ์ด ๋ฐฐ์—ด๋กœ ๋ฐ”๊ฟ”์„œ data ์ €์žฅ,

๋งˆ์ง€๋ง‰ class์—ด ๋„˜ํŒŒ์ด ๋ฐฐ์—ด๋กœ ๋ฐ”๊ฟ”์„œ target ๋ฐฐ์—ด์— ์ €์žฅ

data = wine[['alcohol','sugar','pH']].to_numpy()
target = wine['class'].to_numpy()

 

ํ›ˆ๋ จ ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ์„ธํŠธ ๋‚˜๋ˆ„๊ธฐ

sklearn.model_select ์•„๋ž˜ train_test_split

* train_test_split() : ํ›ˆ๋ จ์„ธํŠธ, ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ๋‚˜๋ˆ”  ( test_size ํ…Œ์ŠคํŠธ ์„ธํŠธ ์ง€์ •, ์ง€์ • x - 25% )

from sklearn.model_selection import train_test_split
train_input, test_input, train_target, test_target = train_test_split(data, target, test_size=0.2, random_state=42)

 

ํ›ˆ๋ จ ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ์„ธํŠธ ํฌ๊ธฐ ํ™•์ธ

* shape : ํฌ๊ธฐ ํ™•์ธ

print(train_input.shape, test_input.shape)

(5197, 3) (1300, 3)   -> ํ›ˆ๋ จ ์„ธํŠธ 5197๊ฐœ  /  ํ…Œ์ŠคํŠธ ์„ธํŠธ 1300๊ฐœ

 

 

์ „์ฒ˜๋ฆฌ

* sklearn.preprocessing ์•ˆ์˜ StandardScaler : ์ „์ฒ˜๋ฆฌ

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss.fit(train_input)
train_scaled = ss.transform(train_input)
test_scaled = ss.transform(test_input)

 

 

๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ๋ชจ๋ธ ํ›ˆ๋ จ

* sklearn.linear_model  ์•ˆ์˜ LogisticRegression

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(train_scaled, train_target)
print(lr.score(train_scaled, train_target))
print(lr.score(test_scaled, test_target))

0.7808350971714451

0.7776923076923077

-> ํ›ˆ๋ จ์„ธํŠธ, ํ…Œ์ŠคํŠธ ์„ธํŠธ ์ ์ˆ˜ ๋ชจ๋‘ ๋‚ฎ์Œ -> ๊ณผ์†Œ์ ํ•ฉ

 

 

๊ณ„์ˆ˜, ์ ˆํŽธ ์ถœ๋ ฅ

๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๊ฐ€ ํ•™์Šตํ•œ ๊ณ„์ˆ˜์™€ ์ ˆํŽธ ์ถœ๋ ฅ

* .coef_ : ๊ณ„์ˆ˜

* .intercept_ : ์ ˆํŽธ

print(lr.coef_, lr.intercept_)

[[ 0.51270274 1.6733911 -0.68767781]] [1.81777902]

 

์ด ๋ชจ๋ธ์€ ์•Œ์ฝ”์˜ฌ ๋„์ˆ˜์— 0.51270274๋ฅผ ๊ณฑํ•˜๊ณ , ๋‹น๋„์— 1.6733911์„ ๊ณฑํ•˜๊ณ , pH์— -0.68767781์„ ๊ณฑํ•œ ๋‹ค์Œ ๋ชจ๋‘ ๋”ํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ 1.81777902๋ฅผ ๋”ํ•œ๋‹ค. ์ด ๊ฐ’์ด 0๋ณด๋‹ค ํฌ๋ฉด ํ™”์ดํŠธ ์™€์ธ, ์ž‘์œผ๋ฉด ๋ ˆ๋“œ์™€์ธ์ด๋‹ค. ํ˜„์žฌ ์•ฝ 77% ์ •ํ™•๋„๋กœ ํ™”์ดํŠธ ์™€์ธ์„ ๋ถ„๋ฅ˜ํ–ˆ๋‹ค......  ~~~~~~~>  ๋ณด๊ณ ์„œ ์ดํ•ด ๋ถˆ๊ฐ€! ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋Š” ์„ค๋ช…์ด ์–ด๋ ค์›€!

 


 

05-1. ๊ฒฐ์ •ํŠธ๋ฆฌ

๋ชจ๋ธ์ด ์ด์œ ๋ฅผ ์„ค๋ช…ํ•˜๊ธฐ ์‰ฌ์›€

๊ฒฐ์ • ํŠธ๋ฆฌ ์งˆ๋ฌธ์„ ํ•˜๋‚˜์”ฉ ๋˜์ ธ์„œ ์ •๋‹ต์„ ๋งž์ถฐ๋‚˜๊ฐ„๋‹ค (=์Šค๋ฌด๊ณ ๊ฐœ)

๋ฐ์ดํ„ฐ๋ฅผ ์ž˜ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๋Š” ์งˆ๋ฌธ์„ ์ฐพ๋Š”๋‹ค๋ฉด ๊ณ„์† ์งˆ๋ฌธ์„ ์ถ”๊ฐ€ํ•ด์„œ ๋ถ„๋ฅ˜ ์ •ํ™•๋„ ๋†’์ผ ์ˆ˜ ์žˆ์Œ

 

๊ฒฐ์ •ํŠธ๋ฆฌ ๋ชจ๋ธ ํ›ˆ๋ จ

* sklearn.tree ์•ˆ์— DecisionTreeClassifier ํด๋ž˜์Šค ์‚ฌ์šฉ

* fit() : ๋ชจ๋ธ ํ›ˆ๋ จ

* score() : ์ •ํ™•๋„ ํ‰๊ฐ€

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(random_state=42)
dt.fit(train_scaled, train_target)
print(dt.score(train_scaled, train_target)) # ํ›ˆ๋ จ ์„ธํŠธ
print(dt.score(test_scaled, test_target))   # ํ…Œ์ŠคํŠธ ์„ธํŠธ

0.996921300750433

0.8592307692307692

-> ํ›ˆ๋ จ ์„ธํŠธ ์ ์ˆ˜ ์—„์ฒญ ๋†’์Œ -> ๊ณผ๋Œ€์ ํ•ฉ

 

 

๊ฒฐ์ •ํŠธ๋ฆฌ ๋ชจ๋ธ ํ›ˆ๋ จ

* plot_tree() : ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ์ดํ•ดํ•˜๊ธฐ ์‰ฌ์šด ํŠธ๋ฆฌ ๊ทธ๋ฆผ์œผ๋กœ ์ถœ๋ ฅ

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
plt.figure(figsize=(10, 7))
plot_tree(dt)
plt.show()

๊ฒฐ์ •ํŠธ๋ฆฌ๋Š” ์œ„์—์„œ๋ถ€ํ„ฐ ์•„๋ž˜๋กœ ๊ฑฐ๊พธ๋กœ ์ž๋ผ๋‚จ

๋…ธ๋“œ : ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ํ•ต์‹ฌ ์š”์†Œ, ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ์— ๋Œ€ํ•œ ํ…Œ์ŠคํŠธ๋ฅผ ํ‘œํ˜„

๋ฃจํŠธ ๋…ธ๋“œ : ๋งจ ์œ„์˜ ๋…ธ๋“œ    /  ๋ฆฌํ”„ ๋…ธ๋“œ : ๋งจ ์•„๋ž˜ ๋์— ๋‹ฌ๋ฆฐ ๋…ธ๋“œ

๊ฐ€์ง€ : ํ…Œ์ŠคํŠธ์˜ ๊ฒฐ๊ณผ(Ture, False)

์ผ๋ฐ˜์ ์œผ๋กœ ํ•˜๋‚˜์˜ ๋…ธ๋“œ๋Š” 2๊ฐœ์˜ ๊ฐ€์ง€ ๊ฐ€์ง

 

 

ํŠธ๋ฆฌ ๊นŠ์ด ์ œํ•œ ์ถœ๋ ฅ

* max_depth : ๊นŠ์ด (1 - ๋ฃจํ”„ ๋…ธ๋“œ๋ฅผ ์ œ์™ธํ•˜๊ณ  ํ•˜๋‚˜์˜ ๋…ธ๋“œ๋ฅผ ๋” ํ™•์žฅ)

* filled : ํด๋ž˜์Šค์— ๋งž๊ฒŒ ๋…ธ๋“œ ์ƒ‰์น 

* feature_names : ํŠน์„ฑ์˜ ์ด๋ฆ„ ์ „๋‹ฌ

plt.figure(figsize=(10,7))
plot_tree(dt, max_depth=1, filled=True, feature_names=['alcohol', 'sugar', 'pH'])
plt.show()

 

> ๊ทธ๋ฆผ์ด ๋‹ด๊ณ  ์žˆ๋Š” ์ •๋ณด

 

๋ฃจํ”„๋…ธ๋“œ ๋‹น๋„ -0.239 ์ดํ•˜์ธ์ง€ ์งˆ๋ฌธ

-0.239์™€ ๊ฐ™๊ฑฐ๋‚˜ ์ž‘์œผ๋ฉด ์™ผ์ชฝ ๊ฐ€์ง€(yes)

๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ์˜ค๋ฅธ์ชฝ ๊ฐ€์ง€(no) ์ด๋™

 

๋ฃจํ”„๋…ธ๋“œ ์ƒ˜ํ”Œ์ˆ˜ = 5197

์ด ์ค‘์—์„œ ์Œ์„ฑํด๋ž˜์Šค(๋ ˆ๋“œ์™€์ธ) 1258๊ฐœ, ์–‘์„ฑํด๋ž˜์Šค(ํ™”์ดํŠธ์™€์ธ) 3939๊ฐœ

 

* plot_tree() ํ•จ์ˆ˜์—์„œ filled=True๋กœ ์ง€์ •ํ•˜๋ฉด ํด๋ž˜์Šค๋งˆ๋‹ค ์ƒ‰๊น”์„ ๋ถ€์—ฌ,

                 ์–ด๋–ค ํด๋ž˜์Šค์˜ ๋น„์œจ์ด ๋†’์•„์ง€๋ฉด ์ ์  ์ง„ํ•œ ์ƒ‰์œผ๋กœ ํ‘œ์‹œ

 

๋ฆฌํ”„ ๋…ธ๋“œ์—์„œ ๊ฐ€์žฅ ๋งŽ์€ ํด๋ž˜์Šค ->์˜ˆ์ธก ํด๋ž˜์Šค

 

๋ถˆ์ˆœ๋„

gini ์ง€๋‹ˆ ๋ถˆ์ˆœ๋„

 

* criterion : ๋…ธ๋“œ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„ํ• ํ•  ๊ธฐ์ค€ ์ •ํ•จ

                  DecisionTreeClassifier ํด๋ž˜์Šค์˜ criterion ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ธฐ๋ณธ๊ฐ’ 'gini'

 

์ง€๋‹ˆ ๋ถˆ์ˆœ๋„๋Š” ํด๋ž˜์Šค์˜ ๋น„์œจ์„ ์ œ๊ณฑํ•ด์„œ ๋”ํ•œ ๋‹ค์Œ 1์—์„œ ๋นผ๋ฉด ๋จ

์ง€๋‹ˆ ๋ถˆ์ˆœ๋„ = 1 - (์Œ์„ฑ ํด๋ž˜์Šค ๋น„์œจ^2 + ์–‘์„ฑ ํด๋ž˜์Šค ๋น„์œจ^2)

 

ex)  ๋ฃจํŠธ ๋…ธ๋“œ์˜ ์ง€๋‹ˆ ๋ถˆ์ˆœ๋„ ๊ณ„์‚ฐ

์ƒ˜ํ”Œ ์ด 5197๊ฐœ, ์Œ์„ฑ ํด๋ž˜์Šค 1258๊ฐœ, ์Œ์„ฑ ํด๋ž˜์Šค 3939๊ฐœ

๋ฃจํ”„๋…ธ๋“œ ์ง€๋‹ˆ ๋ถˆ์ˆœ๋„ = 1 - ((1258/5197)^2 + (3939/5197)^2) = 0.367

 

๋ถˆ์ˆœ๋„ 0.5 = ์ตœ์•…

์ˆœ์ˆ˜ ๋…ธ๋“œ : ๋ถˆ์ˆœ๋„ 0 

 

์ •๋ณด์ด๋“

๋ถ€๋ชจ์™€ ์ž์‹ ๋…ธ๋“œ ์‚ฌ์ด์˜ ๋ถˆ์ˆœ๋„ ์ฐจ์ด

๊ฒฐ์ • ํŠธ๋ฆฌ ๋ชจ๋ธ์€ ๋ถ€๋ชจ ๋…ธ๋“œ์™€ ์ž์‹ ๋…ธ๋“œ์˜ ๋ถˆ์ˆœ๋„ ์ฐจ์ด๊ฐ€ ๊ฐ€๋Šฅํ•œ ํฌ๋„๋ก ํŠธ๋ฆฌ๋ฅผ ์„ฑ์žฅ์‹œํ‚ด

- > ๋ถ€๋ชจ ๋…ธ๋“œ์™€ ์ž์‹ ๋…ธ๋“œ์˜ ๋ถˆ์ˆœ๋„ ์ฐจ์ด ๊ณ„์‚ฐ 

 

์ž์‹ ๋…ธ๋“œ์˜ ๋ถˆ์ˆœ๋„๋ฅผ ์ƒ˜ํ”Œ ๊ฐœ์ˆ˜์— ๋น„๋ก€ํ•˜์—ฌ ๋ชจ๋‘ ๋”ํ•œ ๋‹ค์Œ ๋ถ€๋ชจ ๋…ธ๋“œ์˜ ๋ถˆ์ˆœ๋„์— ๋นผ๋ฉด ๋จ

์ •๋ณด ์ด๋“ = ๋ถ€๋ชจ์˜ ๋ถˆ์ˆœ๋„ -(์™ผ์ชฝ ๋…ธ๋“œ ์ƒ˜ํ”Œ ์ˆ˜ / ๋ถ€๋ชจ ์ƒ˜ํ”Œ ์ˆ˜) x ์™ผ์ชฝ ๋…ธ๋“œ ๋ถˆ์ˆœ๋„ - (์˜ค๋ฅธ์ชฝ ๋…ธ๋“œ ์ƒ˜ํ”Œ ์ˆ˜ / ๋ถ€๋ชจ ์ƒ˜ํ”Œ ์ˆ˜) x ์˜ค๋ฅธ์ชฝ ๋…ธํŠธ ๋ถˆ์ˆœ๋„

 

ex) ๋ฃจํŠธ ๋…ธ๋“œ๋ฅผ ๋ถ€๋ชจ ๋…ธ๋“œ, ์™ผ์ชฝ๊ณผ ์˜ค๋ฅธ์ชฝ ๋…ธ๋“œ๊ฐ€ ์ž์‹ ๋…ธ๋“œ

์™ผ์ชฝ ๋…ธ๋“œ๋กœ๋Š” 2922 ์ƒ˜ํ”Œ ์ด๋™, ์˜ค๋ฅธ์ชฝ ๋…ธ๋“œ๋กœ๋Š” 2275๊ฐœ ์ƒ˜ํ”Œ ์ด๋™

์ •๋ณด ์ด๋“ = 0.367 - (2922.5197) x 0.481 - (2275 / 5197) x 0.069 = 0.066

 

=> ๊ฒฐ์ • ํŠธ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ถˆ์ˆœ๋„ ๊ธฐ์ค€์„ ์‚ฌ์šฉํ•ด ์ •๋ณด ์ด๋“์ด ์ตœ๋Œ€๊ฐ€ ๋˜๋„๋ก ๋…ธ๋“œ๋ฅผ ๋ถ„ํ• 

๋…ธ๋“œ๋ฅผ ์ˆœ์ˆ˜ํ•˜๊ฒŒ ๋‚˜๋ˆŒ์ˆ˜๋ก ์ •๋ณด ์ด๋“์ด ์ปค์ง

์ƒˆ๋กœ์šด ์ƒ˜ํ”„์— ๋Œ€ํ•ด ์˜ˆ์ธกํ•  ๋•Œ์—๋Š” ๋…ธ๋“œ์˜ ์งˆ๋ฌธ์— ๋”ฐ๋ผ ํŠธ๋ฆฌ ์ด๋™

๋งˆ์ง€๋ง‰์— ๋„๋‹ฌํ•œ ๋…ธ๋“œ์˜ ํด๋ž˜์Šค ๋น„์œจ์„ ๋ณด๊ณ  ์˜ˆ์ธก ๋งŒ๋“ฆ

 

 

๊ฐ€์ง€์น˜๊ธฐ

๊ฐ€์ง€์น˜๊ธฐ๋ฅผ ์•ˆ ํ•˜๋ฉด ๋ฌด์ž‘์ • ๋๊นŒ์ง€ ์ž๋ผ๋‚˜๋Š” ํŠธ๋ฆฌ ๋งŒ๋“ค์–ด์ง

ํ›ˆ๋ จ ์„ธํŠธ์—๋Š” ์•„์ฃผ ์ž˜ ๋งž์ง€๋งŒ, ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ ์ ์ˆ˜๋Š” ๊ทธ์— ๋ชป ๋ฏธ์นจ -> ๊ณผ๋Œ€์ ํ•ฉ -> ์ผ๋ฐ˜ํ™” ์ž˜ ์•ˆ ๋จ

* max_depth() : ์ž๋ผ๋‚  ์ˆ˜ ์žˆ๋Š” ํŠธ๋ฆฌ์˜ ์ตœ๋Œ€ ๊นŠ์ด ์ง€์ •

dt = DecisionTreeClassifier(max_depth=3, random_state=42)
dt.fit(train_scaled, train_target)
print(dt.score(train_scaled, train_target))
print(dt.score(test_scaled, test_target))

0.8454877814123533

0.8415384615384616

-> ํ›ˆ๋ จ ์„ธํŠธ ์„ฑ๋Šฅ ๋‚ฎ์•„์กŒ์ง€๋งŒ ํ…Œ์ŠคํŠธ ์„ธํŠธ ์„ฑ๋Šฅ์€ ๊ฑฐ์˜ ๊ทธ๋Œ€๋กœ

 

ํŠธ๋ฆฌ ๊ทธ๋ž˜ํ”„

* .plot_tree() : ํŠธ๋ฆฌ ๊ทธ๋ž˜ํ”„

plt.figure(figsize=(20,15))
plot_tree(dt, filled=True, feature_names=['alcohol', 'sugar', 'pH'])
plt.show()

 

 

๊ทธ๋Ÿฐ๋ฐ -0.802๋ผ๋Š” ์Œ์ˆ˜๋กœ ๋œ ๋‹น๋„ ... ?

๋ถˆ์ˆœ๋„ ๊ธฐ์ค€์œผ๋กœ ์ƒ˜ํ”Œ์„ ๋‚˜๋ˆ”. ๋ถˆ์ˆœ๋„๋Š” ํด๋ž˜์Šค๋ณ„ ๋น„์œจ์„ ๊ฐ€์ง€๊ณ  ๊ณ„์‚ฐ

์ƒ˜ํ”Œ์„ ์–ด๋–ค ํด๋ž˜์Šค ๋น„์œจ๋กœ ๋‚˜๋ˆ„๋Š”์ง€ ๊ณ„์‚ฐํ•  ๋•Œ ํŠน์„ฑ๊ฐ’์˜ ์Šค์ผ€์ผ ์˜ํ–ฅ ๋ฏธ์น˜์ง€ ์•Š์Œ

=> ๊ฒฐ์ • ํŠธ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํ‘œ์ค€ํ™” ์ „์ฒ˜๋ฆฌ ํ•  ํ•„์š” ์—†์Œ

 

 

์ „์ฒ˜๋ฆฌ ์ „ ํ›ˆ๋ จ ์„ธํŠธ, ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ๊ฒฐ์ • ํŠธ๋ฆฌ ๋ชจ๋ธ ํ›ˆ๋ จ

dt = DecisionTreeClassifier(max_depth=3, random_state=42)
dt.fit(train_input, train_target)
print(dt.score(train_input, train_target))
print(dt.score(test_input, test_target))

0.8454877814123533

0.8415384615384616

 

ํŠธ๋ฆฌ ๊ทธ๋ž˜ํ”„

plt.figure(figsize=(20,15))
plot_tree(dt, filled=True, feature_names=['alcohol', 'sugar', 'pH'])
plt.show()

๊ฒฐ๊ณผ๋Š” ๊ฐ™์€ ํŠธ๋ฆฌ์ง€๋งŒ, ํŠน์„ฑ๊ฐ’์„ ํ‘œ์ค€์ ์ˆ˜๋กœ ๋ฐ”๊พธ์ง€ ์•Š์•„ ์ดํ•ดํ•˜๊ธฐ ํ›จ์”ฌ ์‰ฌ์›€

๋‹น๋„๊ฐ€  1.625๋ณด๋‹ค ํฌ๊ณ  4.325๋ณด๋‹ค ์ž‘์€ ์™€์ธ ์ค‘์— ์•Œ์ฝ”์˜ฌ ๋„์ˆ˜๊ฐ€ 11.025์™€ ๊ฐ™๊ฑฐ๋‚˜ ์ž‘์€ ๊ฒƒ์ด ๋ ˆ๋“œ ์™€์ธ, ๊ทธ์™ธ ํ™”์ดํŠธ ์™€์ธ

 

ํŠน์„ฑ ์ค‘์š”๋„

* feature_importances_  : ์–ด๋–ค ํŠน์„ฑ์ด ๊ฐ€์žฅ ์œ ์šฉํ•œ์ง€ ํŠน์„ฑ ์ค‘์š”๋„

print(dt.feature_importances_)

[0.12345626 0.86862934 0.0079144 ]

-> ๋‘ ๋ฒˆ์งธ ํŠน์„ฑ์ธ ๋‹น๋„๊ฐ€ 0.87๋กœ ํŠน์„ฑ ์ค‘์š”๋„ ๊ฐ€์žฅ ๋†’์Œ - ์•Œ์ฝ”์˜ฌ ๋„์ˆ˜ - pH

 

ํŠน์„ฑ ์ค‘์š”๋„ ๊ฐ’์„ ๋ชจ๋‘ ๋”ํ•˜๋ฉด 1

ํŠน์„ฑ ์ค‘์š”๋„๋Š” ๊ฐ ๋…ธ๋“œ์˜ ์ •๋ณด ์ด๋“๊ณผ ์ „์ฒด ์ƒ˜ํ”Œ์— ๋Œ€ํ•œ ๋น„์œจ์„ ๊ณฑํ•œ ํ›„ ํŠน์„ฑ๋ณ„๋กœ ๋”ํ•˜์—ฌ ๊ณ„์‚ฐ

ํŠน์„ฑ ์ค‘์š”๋„๋ฅผ ํ™œ์šฉํ•˜๋ฉด ๊ฒฐ์ • ํŠธ๋ฆฌ ๋ชจ๋ธ์„ ํŠน์„ฑ ์„ ํƒ์— ํ™œ์š”ํ•  ์ˆ˜ ์žˆ์Œ

 

 


 

05-2. ๊ต์ฐจ ๊ฒ€์ฆ๊ณผ ๊ทธ๋ฆฌ๋“œ ์„œ์น˜

์ง€๊ธˆ๊นŒ์ง€ ํ›ˆ๋ จ์„ธํŠธ์—์„œ ๋ชจ๋ธ ํ›ˆ๋ จ, ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ ๋ชจ๋ธ ํ‰๊ฐ€ํ•จ

ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ ์–ป์€ ์ ์ˆ˜๋ฅผ ๋ณด๊ณ  ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ๊ฐ€๋Š 

๊ทธ๋Ÿฐ๋ฐ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋ฅผ ์‚ฌ์šฉํ•ด ์ž๊พธ ์„ฑ๋Šฅ์„ ํ™•์ธํ•˜๋‹ค ๋ณด๋ฉด ์ ์  ํ…Œ์ŠคํŠธ ์„ธํŠธ์— ๋งž์ถ”๊ฒŒ ๋˜๋Š” ์…ˆ

ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์˜ˆ์ธกํ•˜๋ ค๋ฉด ๊ฐ€๋Šฅํ•œ ํ•œ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ๋ง์•„์•ผ ํ•จ

๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  ๋‚˜์„œ ๋งˆ์ง€๋ง‰์— ๋”ฑ ํ•œ ๋ฒˆ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Œ

 

๊ฒ€์ฆ ์„ธํŠธ

ํ…Œ์ŠคํŠธ ์„ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์œผ๋ฉด ๋ชจ๋ธ์ด ๊ณผ๋Œ€/๊ณผ์†Œ์ ํ•ฉ์ธ์ง€ ํŒ๋‹จํ•˜๊ธฐ ์–ด๋ ค์›€

ํ…Œ์ŠคํŠธ ์„ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ์ด๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์€ ํ›ˆ๋ จ ์„ธํŠธ๋ฅผ ๋˜ ๋‚˜๋ˆˆ ๊ฒƒ -> ์ด ๋ฐ์ดํ„ฐ๋ฅผ '๊ฒ€์ฆ ์„ธํŠธ'๋ผ๊ณ  ๋ถ€๋ฆ„

 

ํ›ˆ๋ จ ์„ธํŠธ - ๋ชจ๋ธ ํ›ˆ๋ จ   /   ๊ฒ€์ฆ ์„ธํŠธ - ๋ชจ๋ธ ํ‰๊ฐ€    /    ํ…Œ์ŠคํŠธ ์„ธํŠธ - ์ตœ์ข… ์ ์ˆ˜ ํ‰๊ฐ€

 

 

๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import pandas as pd

wine = pd.read_csv('https://bit.ly/wine-date')

 

ํƒ€๊นƒ, ํŠน์„ฑ ๋ฐฐ์—ด ์ €์žฅ

class ์—ด - ํƒ€๊นƒ  /  ๋‚˜๋จธ์ง€ ์—ด - ํŠน์„ฑ ๋ฐฐ์—ด ์ €์žฅ

data = wine[['alcohol', 'sugar', 'pH']].to_numpy()
target = wine['class'].to_numpy()

 

 

ํ›ˆ๋ จ / ํ…Œ์ŠคํŠธ ์„ธํŠธ ๋‚˜๋ˆ„๊ธฐ

train_test_split() ํ•จ์ˆ˜ 2๋ฒˆ ์ ์šฉํ•ด์„œ ํ›ˆ๋ จ ์„ธํŠธ์™€ ๊ฒ€์ฆ ์„ธํŠธ๋กœ ๋‚˜๋ˆ ์คŒ

 

ํ›ˆ๋ จ ์„ธํŠธ์˜ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์™€ ํ‹ฐ๊นƒ ๋ฐ์ดํ„ฐ๋ฅผ train_input, train_target ๋ฐฐ์—ด์— ์ €์žฅ

from sklearn.model_selection import train_test_split
train_input, test_input, train_target, test_target = train_test_split(data, target, test_size=0.2, random_state=42)

 

train_input๊ณผ train_target์„ ๋‹ค์‹œ train_test_split() ํ•จ์ˆ˜์— ๋„ฃ์–ด

ํ›ˆ๋ จ์„ธํŠธ sub_input, sub_target  /  ๊ฒ€์ฆ ์„ธํŠธ val_input,  val_target์„ ๋งŒ๋“ฆ

sub_input, val_input, sub_target, val_target = train_test_split(train_input, train_target, test_size=0.2, random_state=42)

 

 

ํ›ˆ๋ จ ์„ธํŠธ, ๊ฒ€์ฆ ์„ธํŠธ ํฌ๊ธฐ ํ™•์ธ

print(sub_input.shape, val_input.shape)

(4157, 3) (1040, 3)

-> ์›๋ž˜ 5197๊ฐœ์˜€๋˜ ํ›ˆ๋ จ์„ธํŠธ๊ฐ€ 4157๋กœ ์ค„๊ณ , ๊ฒ€์ฆ ์„ธํŠธ๋Š” 1040๊ฐœ๊ฐ€ ๋จ

 

 

๋ชจ๋ธ ์ƒ์„ฑ ํ›„ ํ‰๊ฐ€

sub_input, sub_target, val_input, val_target ์‚ฌ์šฉํ•ด ๋ชจ๋ธ ๋งŒ๋“ค๊ณ  ํ‰๊ฐ€

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(random_state=42)
dt.fit(sub_input, sub_target)
print(dt.score(sub_input, sub_target))
print(dt.score(val_input, val_target))

 

 

๊ต์ฐจ ๊ฒ€์ฆ

๊ฒ€์ฆ ์„ธํŠธ ๋งŒ๋“œ๋А๋ผ ํ›ˆ๋ จ ์„ธํŠธ ์ค„์—ˆ์Œ - ๋ณดํ†ต ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ํ›ˆ๋ จ์— ์‚ฌ์šฉํ•  ์ˆ˜๋ก ์ข‹์€ ๋ชจ๋ธ์ด ๋งŒ๋“ค์–ด์ง

๊ฒ€์ฆ ์„ธํŠธ๋ฅผ ๋„ˆ๋ฌด ์กฐ๊ธˆ๋งŒ ๋–ผ์–ด ๋†“์œผ๋ฉด ๊ฒ€์ฆ ์ ์ˆ˜ ๋ถˆ์•ˆ์ •

=> ๊ต์ฐจ ๊ฒ€์ฆ์„ ์ด์šฉ!

 

๊ต์ฐจ๊ฒ€์ฆ : ๊ฒ€์ฆ ์„ธํŠธ๋ฅผ ๋—ด์–ด ๋‚ด์—ฌ ํ‰๊ฐ€ํ•˜๋Š” ๊ณผ์ •์„ ์—ฌ๋Ÿฌ ๋ฒˆ ๋ฐ˜๋ณต

                  ๊ทธ ๋‹ค์Œ ์ด ์ ์ˆ˜๋ฅผ ํ‰๊ท ํ•˜์—ฌ ์ตœ์ • ๊ฒ€์ฆ ์ ์ˆ˜ ์–ป์Œ

 

k-ํด๋“œ ๊ต์ฐจ ๊ฒ€์ฆ : ํ›ˆ๋ จ ์„ธํŠธ๋ฅผ ๋ช‡ ๋ถ€๋ถ„์œผ๋กœ ๋‚˜๋ˆ„๋ƒ์— ๋”ฐ๋ผ..

๊ฐ ํด๋“œ์—์„œ ๊ณ„์‚ฐํ•œ ๊ฒ€์ฆ ์ ์ˆ˜๋ฅผ ํ‰๊ท ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์•ˆ์ •์ ์ธ ์ ์ˆ˜ ์–ป์„ ์ˆ˜ ์žˆ์Œ

 

 

๊ต์ฐจ ๊ฒ€์ฆ

* cross_validate() : ๊ต์ฐจ ๊ฒ€์ฆ ํ•จ์ˆ˜ (๊ธฐ๋ณธ๊ฐ’ 5-ํด๋“œ ๊ต์ฐจ ๊ฒ€์ฆ)

                             cross_validate(ํ‰๊ฐ€ํ•  ๋ชจ๋ธ ๊ฐ์ฒด, ํ›ˆ๋ จ์„ธํŠธ ์ „์ฒด)

from sklearn.model_selection import cross_validate
scores = cross_validate(dt, train_input, train_target)
print(scores)

{'fit_time': array([0.01800036, 0.00823307, 0.00744772, 0.00747132, 0.00703192]), 'score_time': array([0.0013876 , 0.00073624, 0.00069189, 0.00070477, 0.00066996]), 'test_score': array([0.86923077, 0.84615385, 0.87680462, 0.84889317, 0.83541867])}

 

-> fit_time, scroe_time, test_score ํ‚ค๋ฅผ ๊ฐ€์ง„ ๋”•์…”๋„ˆ๋ฆฌ ๋ฐ˜ํ™˜

* fit_time, score_time : ๊ฐ์ž ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๋Š” ์‹œ๊ฐ„๊ณผ ๊ฒ€์ฆํ•˜๋Š” ์‹œ๊ฐ„

 

cross_validate() ๊ธฐ๋ณธ๊ฐ’ 5-ํด๋“œ ๊ต์ฐจ ๊ฒ€์ฆ

* .cv ๋งค๊ฐœ๋ณ€์ˆ˜์—์„œ ํด๋“œ ์ˆ˜ ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ์Œ

 

 

๊ต์ฐจ ๊ฒ€์ฆ ์ตœ์ข… ์ ์ˆ˜

* test_scroe : ๊ฒ€์ฆ ํด๋“œ์˜ ์ ์ˆ˜

๊ต์ฐจ ๊ฒ€์ฆ์˜ ์ตœ์ข… ์ ์ˆ˜๋Š” test_score ํ‚ค์— ๋‹ด๊ธด ์ ์ˆ˜ ํ‰๊ท 

import numpy as np
print(np.mean(scores['test_score']))

0.855300214703487

 

 

์ฃผ์˜. cross_validate() ํ›ˆ๋ จ ์„ธํŠธ๋ฅผ ์„ž์–ด ํด๋“œ ๋‚˜๋ˆ„์ง€ x

train_test_split() ํ•จ์ˆ˜๋Š” ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ์„ž์€ ํ›„ ํ›ˆ๋ จ ์„ธํŠธ๋ฅผ ์ค€๋น„ -> ๋”ฐ๋กœ ์„ž์„ ํ•„์š” ์—†์Œ

๊ต์ฐจ ๊ฒ€์ฆ์„ ํ•  ๋•Œ ํ›ˆ๋ จ ์„ธํŠธ๋ฅผ ์„ž์œผ๋ ค๋ฉด ๋ถ„ํ• ๊ธฐ ์ง€์ •ํ•ด์•ผ ํ•จ

 

๋ถ„ํ• ๊ธฐ : ๊ต์ฐจ ๊ฒ€์ฆ์—์„œ ํด๋“œ๋ฅผ ์–ด๋–ป๊ฒŒ ๋‚˜๋ˆŒ์ง€ ๊ฒฐ์ •

* .cross_validate() ๋Š” ํšŒ๊ท€ ๋ชจ๋ธ - KFold ๋ถ„ํ• ๊ธฐ   /    ๋ถ„๋ฅ˜ ๋ชจ๋ธ - StratifiedKFold ์‚ฌ์šฉ

 

์•ž์—์„œ ์ˆ˜ํ–‰ํ•œ ์ฝ”๋“œ๋Š” ๋‹ค์Œ ์ฝ”๋“œ์™€ ๋™์ผ

from sklearn.model_selection import StratifiedKFold

scores = cross_validate(dt, train_input, train_target, cv=StratifiedKFold())
print(np.mean(scores['test_score']))

0.855300214703487

 

 

ํ›ˆ๋ จ ์„ธํŠธ ์„ž์€ ํ›„, 10-ํด๋“œ ๊ต์ฐจ ๊ฒ€์ฆ

* n_splits : ๋ช‡(k) ํด๋“œ ๊ต์ฐจ ๊ฒ€์ฆํ• ์ง€ ์ง€์ •

splitter = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_validate(dt, train_input, train_target, cv=splitter)
print(np.mean(scores['test_score']))

 

 

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹

๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ : ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์ด ํ•™์Šตํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ : ๋ชจ๋ธ์ด ํ•™์Šตํ•  ์ˆ˜ ์—†์–ด์„œ ์‚ฌ์šฉ์ž๊ฐ€ ์ง€์ •ํ•ด์•ผ๋งŒ ํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๋ชจ๋‘ ํด๋ž˜์Šค๋‚˜ ๋ฉ”์„œ๋“œ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ํ‘œํ˜„

๊ธฐ๋ณธ๊ฐ’์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•ด ๋ชจ๋ธ์„ ํ›ˆ๋ จ -> ๊ฒ€์ฆ ์„ธํŠธ ์ ์ˆ˜๋‚˜ ๊ต์ฐจ ๊ฒ€์ฆ์„ ํ†ตํ•ด์„œ ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐ๊ธˆ์”ฉ ๋ณ€๊ฒฝ 

 

๊ทธ๋ฆฌ๋“œ ์„œ์น˜ GridSearchCV : ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํƒ์ƒ‰๊ณผ ๊ต์ฐจ ๊ฒ€์ฆ์„ ํ•œ ๋ฒˆ์— ์ˆ˜ํ–‰

 

 

min_impurtiy_decrease ๋งค๊ฐœ๋ณ€์ˆ˜ ์ตœ์ ๊ฐ’ ์ฐพ๊ธฐ

GridSearchCV ํด๋ž˜์Šค ์ž„ํฌํŠธ

ํƒ์ƒ‰ํ•  ๋งค๊ฐœ๋ณ€์ˆ˜์™€ ํƒ์ƒ‰ํ•  ๊ฐ’์˜ ๋ฆฌ์ŠคํŠธ๋ฅผ ๋”•์…”๋„ˆ๋ฆฌ๋กœ ๋งŒ๋“ฆ

from sklearn.model_selection import GridSearchCV

params = {'min_impurity_decrease': [0.0001, 0.0002, 0.0003, 0.0004, 0.0005]}

0.0001๋ถ€ํ„ฐ 0.0005๊นŒ์ง€ 0.0001์”ฉ ์ฆ๊ฐ€ํ•˜๋Š” 5๊ฐœ ๊ฐ’ ์‹œ๋„

 

 

๊ทธ๋ฆฌ๋“œ ์„œ์น˜ ๊ฐ์ฒด ์ƒ์„ฑ

GridSearchCV ํด๋ž˜์Šค์— ํƒ์ƒ‰ ๋Œ€์ƒ ๋ชจ๋ธ๊ณผ params ๋ณ€์ˆ˜ ์ „๋‹ฌํ•˜์—ฌ ๊ทธ๋ฆฌ๋“œ ์„œ์น˜ ๊ฐ์ฒด ์ƒ์„ฑ

gs = GridSearchCV(DecisionTreeClassifier(random_state=42), params, n_jobs=-1)

 

 

gs๊ฐ์ฒด์— fit() ๋ฉ”์„œ๋“œ ํ˜ธ์ถœ

์ด ๋ฉ”์„œ๋“œ๋ฅผ ํ˜ธ์ถœํ•˜๋ฉด ๊ทธ๋ฆฌ๋“œ ์„œ์น˜ ๊ฐ์ฒด๋Š” ๊ฒฐ์ • ํŠธ๋ฆฌ ๋ชจ๋ธ min_impurity_decrease ๊ฐ’์„ ๋ฐ”๊ฟ”๊ฐ€๋ฉฐ ์ด 5๋ฒˆ ์‹คํ–‰ํ•จ

GridSearchCV์˜ cv ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ธฐ๋ณธ๊ฐ’ 5

๋”ฐ๋ผ์„œ  min_impurity_decrease 5 x 5 = 25๊ฐœ์˜ ๋ชจ๋ธ ํ›ˆ๋ จ

* n_jobs() : ๋ณ‘๋ ฌ ์‹คํ–‰์— ์‚ฌ์šฉํ•  CPU ์ฝ”์–ด ์ˆ˜ ์ง€์ • (๊ธฐ๋ณธ๊ฐ’ 1 /  -1 ์ง€์ •์‹œ ์‹œ์Šคํ…œ์— ์žˆ๋Š” ๋ชจ๋“  ์ฝ”์–ด ์‚ฌ์šฉ)

gs.fit(train_input, train_target)

 

 

best_estimator_

๊ทธ๋ฆฌ๋“œ ์„œ์น˜๋Š” ํ›ˆ๋ จ์ด ๋๋‚˜๋ฉด ๋ชจ๋ธ ์ค‘์—์„œ ๊ฒ€์ฆ ์ ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ๋ชจ๋ธ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐํ•ฉ์œผ๋กœ

์ „์ฒด ํ›ˆ๋ จ ์„ธํŠธ์—์„œ ์ž๋™์œผ๋กœ ๋‹ค์‹œ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•จ 

์ด ๋ชจ๋ธ์€ gs ๊ฐ์ฒด์˜ best_estimator_ ์†์„ฑ์— ์ €์žฅ๋จ

* best_estimator_ : ๊ฒ€์ฆ ์ ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ๋ชจ๋ธ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐํ•ฉ์œผ๋กœ ์ „์ฒด ํ›ˆ๋ จ ์„ธํŠธ์—์„œ ๋‹ค์‹œ ๋ชจ๋ธ ํ›ˆ๋ จ

dt = gs.best_estimator_
print(dt.score(train_input, train_target))

0.9615162593804117

 

best_params_ 

* best_params_ : ๊ทธ๋ฆฌ๋“œ ์„œ์น˜๋กœ ์ฐพ์€ ์ตœ์ ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐ’ ์ €์žฅ

print(gs.best_params_)

{'min_impurity_decrease': 0.0001}

-> 0.0001์ด ๊ฐ€์žฅ ์ข‹์€ ๊ฐ’์œผ๋กœ ์„ ํƒ๋จ

 

mean_test_score

* cv_results_ ์†์„ฑ์˜ mean_test_score : ๊ฐ ๋งค๊ฐœ๋ณ€์ˆ˜์—์„œ ์ˆ˜ํ–‰ํ•œ ๊ต์ฐจ ๊ฒ€์ฆ์˜ ํ‰๊ท  ์ ์ˆ˜ ์ €์žฅ

print(gs.cv_results_['mean_test_score'])

 

 

์ตœ์ƒ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐํ•ฉ

* argmax() : ๊ฐ€์žฅ ํฐ ๊ฐ’์˜ ์ธ๋ฑ์Šค ์ถ”์ถœ

์ธ๋ฑ์Šค ์‚ฌ์šฉํ•ด params ํ‚ค์— ์ €์žฅ๋œ ๋งค๊ฐœ๋ณ€์ˆ˜ ์ถœ๋ ฅ

=> ์ตœ์ƒ์˜ ๊ฒ€์ฆ ์ ์ˆ˜ ๋งŒ๋“  ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐํ•ฉ

best_index = np.argmax(gs.cv_results_['mean_test_score'])
print(gs.cv_results_['params'][best_index])

{'min_impurity_decrease': 0.0001}

 

์ •๋ฆฌ

1. ๋จผ์ € ํƒ์ƒ‰ํ•  ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ง€์ •

2. ํ›ˆ๋ จ์„ธํŠธ์—์„œ ๊ทธ๋ฆฌ๋“œ ์„œ์น˜๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ ์ตœ์ƒ์˜ ํ‰๊ท  ๊ฒ€์ฆ ์ ์ˆ˜๊ฐ€ ๋‚˜์˜ค๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐํ•ฉ ์ฐพ๊ธฐ

    ์ด ์กฐํ•ฉ์€ ๊ทธ๋ฆฌ๋“œ  ์„œ์น˜ ๊ฐ์ฒด์— ์ €์žฅ๋จ

3. ๊ทธ๋ฆฌ๋“œ ์„œ์น˜๋Š” ์ตœ์ƒ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜์—์„œ ์ „์ฒด ํ›ˆ๋ จ ์„ธํŠธ๋ฅผ ์‚ฌ์šฉํ•ด ์ตœ์ข… ๋ชจ๋ธ์„ ํ›ˆ๋ จ

   ์ด ๋ชจ๋ธ๋„ ๊ทธ๋ฆฌ๋“œ ์„œ์น˜ ๊ฐ์ฒด์— ์ €์žฅ

 

 

๋ณต์žกํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐํ•ฉ ํƒ์ƒ‰

* min_impurity_decrease : ๋…ธ๋“œ๋ฅผ ๋ถ„ํ• ํ•˜๊ธฐ ์œ„ํ•œ ๋ถˆ์ˆœ๋„ ๊ฐ์†Œ ์ตœ์†Œ๋Ÿ‰

* max_depth : ํŠธ๋ฆฌ์˜ ๊นŠ์ด ์ œํ•œ

* min_samples_split : ๋…ธ๋“œ๋ฅผ ๋‚˜๋ˆ„๊ธฐ ์œ„ํ•œ ์ตœ์†Œ ์ƒ˜ํ”Œ ์ˆ˜

params = {'min_impurity_decrease': np.arange(0.0001, 0.001, 0.0001),
          'max_depth': range(5, 20, 1), 
          'min_samples_split': range(2, 100, 10)}

arrange() ํ•จ์ˆ˜ 0.0001์—์„œ ์‹œ์ž‘ 0.001๋  ๋•Œ๊นŒ์ง€ 0.0001์„ ๊ณ„์† ๋”ํ•จ

range() max_depth๋ฅผ 5์—์„œ 20๊นŒ์ง€ 1์”ฉ ์ฆ๊ฐ€ํ•˜๋ฉด์„œ 15๊ฐœ ๊ฐ’ ๋งŒ๋“ฆ

min_samples_split 2์—์„œ 100๊นŒ์ง€ 10์”ฉ ์ฆ๊ฐ€ํ•˜๋ฉด์„œ 10๊ฐœ์˜ ๊ฐ’

 

=> ์ด ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ์ˆ˜ํ–‰ํ•  ๊ต์ฐจ ๊ฒ€์ฆ ํšŸ์ˆ˜๋Š” 9 x 15 x 10 =1350๊ฐœ

๊ธฐ๋ณธ 5-ํด๋“œ ๊ต์ฐจ ๊ฒ€์ฆ์„ ์ˆ˜ํ–‰ -> ๋งŒ๋“ค์–ด์ง€๋Š” ๋ชจ๋ธ ์ˆ˜ 6750๊ฐœ

 

 

๊ทธ๋ฆฌ๋“œ ์„œ์น˜ ์‹คํ–‰

gs = GridSearchCV(DecisionTreeClassifier(random_state=42), params, n_jobs=-1)
gs.fit(train_input, train_target)

 

์ตœ์ƒ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐํ•ฉ ํ™•์ธ

print(gs.best_params_)

{'max_depth': 14, 'min_impurity_decrease': 0.0004, 'min_samples_split': 12}

 

 

์ตœ์ƒ์˜ ๊ต์ฐจ ๊ฒ€์ฆ ์ ์ˆ˜ ํ™•์ธ

print(np.max(gs.cv_results_['mean_test_score']))

0.8683865773302731

 

GridSearchCV ํด๋ž˜์Šค - ์›ํ•˜๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐ’์„ ๋‚˜์—ดํ•˜๋ฉด ์ž๋™์œผ๋กœ ๊ต์ฐจ ๊ฒ€์ฆ์„ ์ˆ˜ํ–‰ํ•ด์„œ ์ตœ์ƒ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ์ฐพ์„ ์ˆ˜ ์žˆ์Œ

 

 

 

๋žœ๋ค ์„œ์น˜

๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐ’์˜ ๋ชฉ๋ก์„ ์ „๋‹ฌํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ƒ˜ํ”Œ๋งํ•  ์ˆ˜ ์žˆ๋Š” ํ™•๋ฅ  ๋ถ„ํฌ ๊ฐ์ฒด ์ „๋‹ฌ

 

์‹ธ์ดํŒŒ์ด - ํ™•๋ฅ  ๋ถ„ํฌ ํด๋ž˜์Šค ์ž„ํฌํŠธ

from scipy.stats import uniform, randint

์‹ธ์ดํŒŒ์ด stats ์„œ๋ธŒ ํŒจํ‚ค์ง€ uniform, randint ํด๋ž˜์Šค๋Š” ๋ชจ๋‘ ์ฃผ์–ด์ง„ ๋ฒ”์œ„์—์„œ ๊ณ ๋ฅด๊ฒŒ ๊ฐ’์„ ๋ฝ‘์Œ

์ด๋ฅผ ๊ท ๋“ฑ ๋ถ„ํฌ์—์„œ ์ƒ˜ํ”Œ๋งํ•œ๋‹ค๊ณ  ํ•จ

 

 

 randint 

* randint : ์ฃผ์–ด์ง„ ๋ฒ”์œ„์—์„œ ๊ณ ๋ฅด๊ฒŒ ์ •์ˆ˜๊ฐ’ ๋ฝ‘์Œ

0์—์„œ 10์‚ฌ์ด์˜ ๋ฒ”์œ„๋ฅผ ๊ฐ–๋Š” ๊ฐ์ฒด ๋งŒ๋“ค๊ณ  10๊ฐœ์˜ ์ˆซ์ž ์ƒ˜ํ”Œ๋ง

rgen = randint(0, 10)
rgen.rvs(10)

 

uniform

* uniform : ์ฃผ์–ด์ง„ ๋ฒ”์œ„์—์„œ ๊ณ ๋ฅด๊ฒŒ ์‹ค์ˆ˜๊ฐ’ ๋ฝ‘์Œ

0~1 ์‚ฌ์ด์—์„œ 10๊ฐœ์˜ ์‹ค์ˆ˜ ์ถ”์ถœ

ugen = uniform(0,1)
ugen.rvs(10)

๋žœ๋ค์„œ์น˜์— randint, uniform ํด๋ž˜์Šค ๊ฐ์ฒด ๋„˜๊ฒจ์ฃผ๊ณ  ์ด ๋ช‡ ๋ฒˆ์„ ์ƒ˜ํ”Œ๋งํ•ด์„œ ์ตœ์ ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ฐพ์œผ๋ผ๊ณ  ๋ช…๋ น ๊ฐ€๋Šฅ

 

 

min_samples_leaf

* min_samples_leaf : ๋ฆฌํ”„ ๋…ธ๋“œ๊ฐ€ ๋˜๊ธฐ ์œ„ํ•œ ์ตœ์†Œ ์ƒ˜ํ”Œ์˜ ๊ฐœ์ˆ˜

                                    ์–ด๋–ค ๋…ธ๋“œ๊ฐ€ ๋ถ„ํ• ํ•˜์—ฌ ๋งŒ๋“ค์–ด์งˆ ์ž์‹ ๋…ธ๋“œ์˜ ์ƒ˜ํ”Œ ์ˆ˜๊ฐ€ ์ด ๊ฐ’๋ณด๋‹ค ์ž‘์„ ๊ฒฝ์šฐ ๋ถ„ํ• ํ•˜์ง€ ์•Š์Œ

params = {'min_impurity_decrease': uniform(0.0001, 0.001),
          'max_depth': randint(20, 50),
          'min_samples_split': randint(2, 25),
          'min_samples_leaf': randint(1, 25)}

min_imputiry_decrease๋Š” 0.0001์—์„œ 0.001 ์‚ฌ์ด์˜ ์‹ค์ˆ˜๊ฐ’์„ ์ƒ˜ํ”Œ๋ง

max_depth์€ 20์—์„œ 50์‚ฌ์ด์˜ ์ •์ˆ˜

min_samples_split 2์—์„œ 25 ์‚ฌ์ด์˜ ์ •์ˆ˜

min_samples_leaf 1์—์„œ 25 ์‚ฌ์ด์˜ ์ •์ˆ˜ ์ƒ˜ํ”Œ๋ง

 

 

n_iter

* n_iter : ์ƒ˜ํ”Œ๋ง ํšŸ์ˆ˜

์ƒ˜ํ”Œ๋ง ํšŸ์ˆ˜๋Š” ์‚ฌ์ดํ‚ท๋Ÿฐ ๋žœ๋ค ์„œ์น˜ ํด๋ž˜์Šค์ธ RandomizedSerchCV์˜ n_iter ๋งค๊ฐœ๋ณ€์ˆ˜์— ์ง€์ •

from sklearn.model_selection import RandomizedSearchCV

gs = RandomizedSearchCV(DecisionTreeClassifier(random_state=42), 
                        params, n_iter=100, n_jobs=-1, random_state=42)
gs.fit(train_input, train_target)

params์— ์ •์˜๋œ ๋งค๊ฐœ๋ณ€์ˆ˜ ๋ฒ”์œ„์—์„œ ์ด 100๋ฒˆ์„ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ ๊ต์ฐจ ๊ฒ€์ฆ์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ์ตœ์ ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐํ•ฉ ์ฐพ์Œ

 

 

์ตœ์ ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ์กฐํ•ฉ 

print(gs.best_params_)

{'max_depth': 39, 'min_impurity_decrease': 0.00034102546602601173, 'min_samples_leaf': 7, 'min_samples_split': 13}

 

 

์ตœ๊ณ ์˜ ๊ต์ฐจ ๊ฒ€์ฆ ์ ์ˆ˜

print(np.max(gs.cv_results_['mean_test_score']))

 

์ตœ์ข… ๋ชจ๋ธ ํ…Œ์ŠคํŠธ ์„ธํŠธ ์„ฑ๋Šฅ ํ™•์ธ

์ตœ์ ์˜ ๋ชจ๋ธ์€ ์ด๋ฏธ ์ „์ฒด ํ›ˆ๋ จ ์„ธํŠธ(train_input, train_target)๋กœ ํ›ˆ๋ จ๋˜์–ด 

best_estimator ์†์„ฑ์— ์ €์žฅ  -> ์ด ๋ชจ๋ธ์„ ์ตœ์ข… ๋ชจ๋ธ๋กœ ๊ฒฐ์ •  -> ํ…Œ์ŠคํŠธ ์„ธํŠธ ์„ฑ๋Šฅ ํ™•์ธ

dt = gs.best_estimator_
print(dt.score(test_input, test_target))

0.86

 

= ์ˆ˜๋™์œผ๋กœ ๋งค๊ฐœ๋ณ€์ˆ˜ ๋ฐ”๊พธ๋Š” ๋Œ€์‹ ์— ๊ทธ๋ฆฌ๋“œ ์„œ์น˜, ๋žœ๋ค ์„œ์น˜ ์‚ฌ์šฉํ•˜์ž!

 

 


 

 

05-3.  ํŠธ๋ฆฌ์˜ ์•™์ƒ๋ธ”

๋Œ€์ฒด๋กœ ์„ฑ๋Šฅ์ด ์ข‹์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜ - ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ

 

์ •ํ˜• ๋ฐ์ดํ„ฐ์™€ ๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ

์ •ํ˜• ๋ฐ์ดํ„ฐ : ์–ด๋–ค ๊ตฌ์กฐ๋กœ ๋˜์–ด ์žˆ๋Š” ๋ฐ์ดํ„ฐ ex) csv, ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค, ์—‘์…€ 

๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ : ์ •ํ˜• ๋ฐ์ดํ„ฐ์™€ ๋ฐ˜๋Œ€ ex) ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ, ์‚ฌ์ง„, ์Œ์•…

 

์•™์ƒ๋ธ” ํ•™์Šต : ์ •ํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ๋Š” ๋ฐ ๊ฐ€์žฅ ๋›ฐ์–ด๋‚œ ์„ฑ๊ณผ๋ฅผ ๋‚ด๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜

์‹ ๊ฒฝ๋ง ์•Œ๊ณ ๋ฆฌ์ฆ˜ : ๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ๋Š” ๊ทœ์น™์„ฑ์„ ์ฐพ๊ธฐ ์–ด๋ ค์›€. ์‹ ๊ฒฝ๋ง ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋ชจ๋ธ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Œ

 


๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ RandomForestClassifier

์•™์ƒ๋ธ” ํ•™์Šต์˜ ๋Œ€ํ‘œ ์ฃผ์ž ์ค‘ ํ•˜๋‚˜๋กœ ์•ˆ์ •์ ์ธ ์„ฑ๋Šฅ ๋•๋ถ„์— ๋„๋ฆฌ ์‚ฌ์šฉ

๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ๋žœ๋คํ•˜๊ฒŒ ๋งŒ๋“ค์–ด ๊ฒฐ์ • ํŠธ๋ฆฌ์˜ ์ˆฒ์„ ๋งŒ๋“ฆ

๊ทธ๋ฆฌ๊ณ  ๊ฐ ๊ฒฐ์ • ํŠธ๋ฆฌ์˜ ์˜ˆ์ธก์„ ์‚ฌ์šฉํ•ด ์ตœ์ข… ์˜ˆ์ธก์„ ๋งŒ๋“ฆ

๊ฐ ํŠธ๋ฆฌ๋ฅผ ํ›ˆ๋ จํ•˜๊ธฐ ์œ„ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋žœ๋คํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค

์ž…๋ ฅํ•œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์—์„œ ๋žœ๋คํ•˜๊ฒŒ ์ƒ˜ํ”Œ์„ ์ถ”์ถœํ•˜์—ฌ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ ๋‹ค - ์ด๋•Œ ํ•œ ์ƒ˜ํ”Œ์ด ์ค‘๋ณต๋˜์–ด ์ถ”์ถœ๋  ์ˆ˜ ์žˆ์Œ

์˜ˆ) 1000๊ฐœ ๊ฐ€๋ฐฉ์—์„œ 100๊ฐœ์”ฉ ์ƒ˜ํ”Œ์„ ๋ฝ‘๋Š”๋‹ค๋ฉด ๋จผ์ € 1๊ฐœ๋ฅผ ๋ฝ‘๊ณ , ๋ฝ‘์•˜๋˜ 1๊ฐœ๋ฅผ ๋‹ค์‹œ ๊ฐ€๋ฐฉ์— ๋„ฃ๋Š”๋‹ค

      ์ด๋Ÿฐ ์‹์œผ๋กœ ๊ณ„์†ํ•ด์„œ 100๊ฐœ๋ฅผ ๊ฐ€๋ฐฉ์—์„œ ๋ฝ‘์œผ๋ฉด ์ค‘๋ณต๋œ ์ƒ˜ํ”Œ ๋ฝ‘์„ ์ˆ˜ ์žˆ์Œ  -> ๋ถ€ํŠธ์ŠคํŠธ๋žฉ ์ƒ˜ํ”Œ

๋ถ€ํŠธ์ŠคํŠธ๋žฉ : ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ์ค‘๋ณต์„ ํ—ˆ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ˜ํ”Œ๋งํ•˜๋Š” ๋ฐฉ์‹

๋ถ€ํŠธ์ŠคํŠธ๋žฉ ์ƒ˜ํ”Œ์€ ํ›ˆ๋ จ ์„ธํŠธ์˜ ํฌ๊ธฐ์™€ ๊ฐ™์Œ

 

๊ฐ ๋…ธ๋“œ๋ฅผ ๋ถ„ํ• ํ•  ๋•Œ ์ „์ฒด ํŠน์„ฑ ์ค‘์—์„œ ์ผ๋ถ€ ํŠน์„ฑ์„ ๋ฌด์ž‘์œ„๋กœ ๊ณ ๋ฅธ ๋‹ค์Œ ์ด ์ค‘์—์„œ ์ตœ์„ ์˜ ๋ถ„ํ• ์„ ์ฐพ์Œ

๋ถ„๋ฅ˜ ๋ชจ๋ธ RandomForestClassifier - ์ „์ฒด ํŠน์„ฑ ๊ฐœ์ˆ˜์˜ ์ œ๊ณฑ๊ทผ๋งŒํผ์˜ ํŠน์„ฑ์„ ์„ ํƒ

ํšŒ๊ท€ ๋ชจ๋ธ RandomForestRegressor - ์ „์ฒด ํŠน์„ฑ ์‚ฌ์šฉ

 

์‚ฌ์ดํ‚ท๋Ÿฐ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋Š” ๊ธฐ๋ณธ 100๊ฐœ์˜ ๊ฒฐ์ • ํŠธ๋ฆฌ ํ›ˆ๋ จ

๋ถ„๋ฅ˜ - ๊ฐ ํŠธ๋ฆฌ์˜ ํด๋ž˜์Šค๋ณ„ ํ™•๋ฅ ์„ ํ‰๊ท ํ•˜์—ฌ ๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ์„ ๊ฐ€์ง„ ํด๋ž˜์Šค๋ฅผ ์˜ˆ์ธก

ํšŒ๊ท€ - ๋‹จ์ˆœํžˆ ๊ฐ ํŠธ๋ฆฌ์˜ ์˜ˆ์ธก ํ‰๊ท 

 

๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋Š” ๋žœ๋คํ•˜๊ฒŒ ์„ ํƒํ•œ ์ƒ˜ํ”Œ๊ณผ ํŠน์„ฑ์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— 

ํ›ˆ๋ จ ์„ธํŠธ์— ๊ณผ๋Œ€์ ํ•ฉ๋˜๋Š” ๊ฒƒ์„ ๋ง‰์•„์ฃผ๊ณ  ๊ฒ€์ฆ ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ ์•ˆ์ •์ ์ธ ์„ฑ๋Šฅ ์–ป์Œ

 

 

๋ฐ์ดํ„ฐ ์ค€๋น„, ํ›ˆ๋ จ/ํ…Œ์ŠคํŠธ ์„ธํŠธ ๋ถ„ํ• 

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

wine = pd.read_csv('https://bit.ly/wine-date')

data = wine[['alcohol', 'sugar', 'pH']].to_numpy()
target = wine['class'].to_numpy()

train_input, test_input, train_target, test_target = train_test_split(data, 
                                                                      target, test_size=0.2, random_state=42)

 

 

๊ต์ฐจ ๊ฒ€์ฆ

* return_train_score : True ์ง€์ • - ๊ฒ€์ฆ ์ ์ˆ˜๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ํ›ˆ๋ จ ์„ธํŠธ์— ๋Œ€ํ•œ ์ ์ˆ˜๋„ ๊ฐ™์ด ๋ฐ˜ํ™˜ (๊ธฐ๋ณธ๊ฐ’ False)

from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_jobs=-1, random_state=42)
scores = cross_validate(rf, train_input, train_target, return_train_score=True, n_jobs=-1)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

0.9973541965122431

0.8905151032797809

 

๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋Š” ๊ฒฐ์ • ํŠธ๋ฆฌ์˜ ์•™์ƒ๋ธ”์ด๊ธฐ ๋•Œ๋ฌธ์— DecisionTreeClassifier๊ฐ€ ์ œ๊ณตํ•˜๋Š” ์ค‘์š” ๋งค๊ฐœ๋ณ€์ˆ˜ ๋ชจ๋‘ ์ œ๊ณต

criterion, max_depth, max_features, min_samples_split, min_impurity_decrease, min_samples_leaf

 

 

๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ๋ชจ๋ธ ํ›ˆ๋ จ ํ›„ ํŠน์„ฑ ์ค‘์š”๋„ ์ถœ๋ ฅ

๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ํŠน์„ฑ ์ค‘์š”๋„ : ๊ฐ ๊ฒฐ์ • ํŠธ๋ฆฌ์˜ ํŠน์„ฑ ์ค‘์š”๋„๋ฅผ ์ทจํ•ฉํ•œ ๊ฒƒ

rf.fit(train_input, train_target)
print(rf.feature_importances_)

[0.23167441 0.50039841 0.26792718]

 

๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋Š” ํŠน์„ฑ์˜ ์ผ๋ถ€๋ฅผ ๋žœ๋คํ•˜๊ฒŒ ์„ ํƒํ•˜์—ฌ ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ํ›ˆ๋ จ -> ํ•˜๋‚˜์˜ ํŠน์„ฑ์— ๊ณผ๋„ํ•˜๊ฒŒ ์ง‘์ค‘ํ•˜์ง€ ์•Š๊ณ 

์ข€ ๋” ๋งŽ์€ ํŠน์„ฑ์ด ํ›ˆ๋ จ์— ๊ธฐ์—ฌํ•  ๊ธฐํšŒ ์–ป์Œ -> ๊ณผ๋Œ€์ ํ•ฉ ์ค„์ด๊ณ  ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ๋†’์ด๋Š”๋ฐ ๋„์›€ 

 

 

OOB ์ ์ˆ˜ ์ถœ๋ ฅ

์ž์ฒด์ ์œผ๋กœ ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•˜๋Š” ์ ์ˆ˜ ์–ป์„ ์ˆ˜ ์žˆ์Œ

OBB ์ƒ˜ํ”Œ : ๋ถ€ํŠธ์ŠคํŠธ๋žฉ ์ƒ˜ํ”Œ์— ํฌํ•จ๋˜์ง€ ์•Š๊ณ  ๋‚จ๋Š” ์ƒ˜ํ”Œ

์ด ๋‚จ์€ ์ƒ˜ํ”Œ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ€ํŠธ์ŠคํŠธ๋žฉ ์ƒ˜ํ”Œ๋กœ ํ›ˆ๋ จํ•œ ๊ฒฐ์ • ํŠธ๋ฆฌ ํ‰๊ฐ€ ~ ๊ฒ€์ฆ ์„ธํŠธ ์—ญํ• !

rf = RandomForestClassifier(oob_score=True, n_jobs=-1, random_state=42)

rf.fit(train_input, train_target)
print(rf.oob_score_)

OOB ์ ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๊ต์ฐจ ๊ฒ€์ฆ์„ ๋Œ€์‹ ํ•  ์ˆ˜ ์žˆ์–ด์„œ

๊ฒฐ๊ณผ์ ์œผ๋กœ ํ›ˆ๋ จ ์„ธํŠธ์— ๋” ๋งŽ์€ ์ƒ˜ํ”Œ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ

 


์—‘์ŠคํŠธ๋ผ ํŠธ๋ฆฌ ExtraTreesClassifier

๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์™€ ์œ ์‚ฌ

์ฐจ์ด์  : ๋ถ€ํŠธ์ŠคํŠธ๋žฉ ์ƒ˜ํ”Œ ์‚ฌ์šฉ X

๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ๋งŒ๋“ค ๋•Œ ์ „์ฒด ํ›ˆ๋ จ ์„ธํŠธ๋ฅผ ์‚ฌ์šฉ

๋Œ€์‹ , ๋…ธ๋“œ๋ฅผ ๋ถ„ํ• ํ•  ๋•Œ ๊ฐ€์žฅ ์ข‹์€ ๋ถ„ํ• ์„ ์ฐพ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๋ฌด์ž‘์œ„๋กœ ๋ถ„ํ• 

์—‘์ŠคํŠธ๋ผ ํŠธ๋ฆฌ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฐ์ • ํŠธ๋ฆฌ splitter='random'

 

ํ•˜๋‚˜์˜ ๊ฒฐ์ • ํŠธ๋ฆฌ์—์„œ ํŠน์„ฑ์„ ๋ฌด์ž‘์œ„๋กœ ๋ถ„ํ• ํ•œ๋‹ค๋ฉด ์„ฑ๋Šฅ์ด ๋‚ฎ์•„์ง€๊ฒ ์ง€๋งŒ

๋งŽ์€ ํŠธ๋ฆฌ๋ฅผ ์•™์ƒ๋ธ” ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณผ๋Œ€์ ํ•ฉ์„ ๋ง‰๊ณ  ๊ฒ€์ฆ ์„ธํŠธ์˜ ์ ์ˆ˜ ๋†’์ด๋Š” ํšจ๊ณผ์žˆ์Œ

 

 

์—‘์ŠคํŠธ๋ผ ํŠธ๋ฆฌ ๊ต์ฐจ ๊ฒ€์ฆ ์ ์ˆ˜ ํ™•์ธ

from sklearn.ensemble import ExtraTreesClassifier
et = ExtraTreesClassifier(n_jobs=-1, random_state=42)
scores = cross_validate(et, train_input, train_target, 
                        return_train_score=True, n_jobs=-1)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

0.9974503966084433

0.8887848893166506

 

๋ณดํ†ต ์—‘์ŠคํŠธ๋ผ ํŠธ๋ฆฌ๊ฐ€ ๋ฌด์ž‘์œ„์„ฑ์ด ์ข€ ๋” ํฌ๊ธฐ ๋•Œ๋ฌธ์— ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋ณด๋‹ค ๋” ๋งŽ์€ ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ํ›ˆ๋ จํ•ด์•ผ ํ•จ

ํ•˜์ง€๋งŒ ๋žœ๋คํ•˜๊ฒŒ ๋…ธ๋“œ๋ฅผ ๋ถ„ํ• ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋น ๋ฅธ ๊ณ„์‚ฐ ์†๋„๊ฐ€ ์žฅ์ 

 

 

ํŠน์„ฑ ์ค‘์š”๋„ ์ถœ๋ ฅ

et.fit(train_input, train_target)
print(et.feature_importances_)

[0.20183568 0.52242907 0.27573525]

 

 


๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ… GradientBoosintClassifier

๊นŠ์ด๊ฐ€ ์–•์€ ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด์ „ ํŠธ๋ฆฌ์˜ ์˜ค์ฐจ๋ฅผ ๋ณด์™„ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์•™์ƒ๋ธ”

๊ธฐ๋ณธ์ ์œผ๋กœ ๊นŠ์ด๊ฐ€ 3์ธ ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ 100๊ฐœ ์‚ฌ์šฉ

๊นŠ์ด๊ฐ€ ์–•์€ ๊ฒฐ์ •ํŠธ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณผ๋Œ€์ ํ•ฉ์— ๊ฐ•ํ•˜๊ณ  ์ผ๋ฐ˜์ ์œผ๋กœ ๋†’์€ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ๊ธฐ๋Œ€

 

๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ํŠธ๋ฆฌ๋ฅผ ์•™์ƒ๋ธ”์— ์ถ”๊ฐ€

๋ถ„๋ฅ˜ - ๋กœ์ง€์Šคํ‹ฑ ์†์‹ค ํ•จ์ˆ˜

ํšŒ๊ท€ - ํ‰๊ท  ์ œ๊ณฑ ์˜ค์ฐจ ํ•จ์ˆ˜

 

๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ• : ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์‚ฐ์œผ๋กœ ์ •์˜ํ•˜๊ณ  

                      ๋ชจ๋ธ์˜ ๊ฐ€์ค‘์น˜์™€ ์ ˆํŽธ์„ ์กฐ๊ธˆ์”ฉ ๋ฐ”๊ฟ”์„œ ๊ฐ€์žฅ ๋‚ฎ์€ ๊ณณ์„ ์ฐพ์•„ ๋‚ด๋ ค์˜ค๋Š” ๊ณผ์ •

๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ… - ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ๊ณ„์† ์ถ”๊ฐ€ํ•˜๋ฉด์„œ ๊ฐ€์žฅ ๋‚ฎ์€ ๊ณณ์„ ์ฐพ์•„ ์ด๋™

 

 

๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๊ต์ฐจ ๊ฒ€์ฆ ์ ์ˆ˜

from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(random_state=42)
scores = cross_validate(gb, train_input, train_target,
                        return_train_score=True, n_jobs=-1)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

0.8881086892152563

0.8720430147331015

 

๊ฒฐ์ • ํŠธ๋ฆฌ์˜ ๊ฐœ์ˆ˜๋ฅผ ๋Š˜๋ ค๋„ ๊ณผ๋Œ€์ ํ•ฉ์— ๋งค์šฐ ๊ฐ•ํ•จ

ํ•™์Šต๋ฅ ์„ ์ฆ๊ฐ€์‹œํ‚ค๊ณ  ํŠธ๋ฆฌ์˜ ๊ฐœ์ˆ˜๋ฅผ ๋Š˜๋ฆฌ๋ฉด ์กฐ๊ธˆ ๋” ์„ฑ๋Šฅ ํ–ฅ์ƒ ๊ฐ€๋Šฅ

* learning_rate : ํ•™์Šต๋ฅ  (๊ธฐ๋ณธ๊ฐ’ 0.1)

 

ํŠน์„ฑ ์ค‘์š”๋„

gb.fit(train_input, train_target)
print(gb.feature_importances_)

[0.11949946 0.74871836 0.13178218]

 

* subsample : ํŠธ๋ฆฌ ํ›ˆ๋ จ์— ์‚ฌ์šฉํ•  ํ›ˆ๋ จ ์„ธํŠธ์˜ ๋น„์œจ ์ •ํ•จ

                       (๊ธฐ๋ณธ๊ฐ’ 1.0 ์ „์ฒด ํ›ˆ๋ จ ์„ธํŠธ ์‚ฌ์šฉ / 1๋ณด๋‹ค ์ž‘์œผ๋ฉด ํ›ˆ๋ จ ์„ธํŠธ์˜ ์ผ๋ถ€ ์‚ฌ์šฉ)

 

๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ…์ด ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋ณด๋‹ค ์กฐ๊ธˆ ๋” ๋†’์€ ์„ฑ๋Šฅ ์–ป์„ ์ˆ˜ ์žˆ์Œ

ํ•˜์ง€๋งŒ ์ˆœ์„œ๋Œ€๋กœ ํŠธ๋ฆฌ๋ฅผ ์ถ”๊ฐ€ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ›ˆ๋ จ ์†๋„ ๋А๋ฆผ

์ฆ‰, GradientBoostingClassifier์—๋Š” n_jobs ๋งค๊ฐœ๋ณ€์ˆ˜ ์—†์Œ

 


 

ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ… HistGradientBoosintClassifier

์ •ํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ๋Š” ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘์—์„œ ๊ฐ€์žฅ ์ธ๊ธฐ๊ฐ€ ๋†’์Œ ์•Œ๊ณ ๋ฆฌ์ฆ˜

์ž…๋ ฅ ํŠน์„ฑ์„ 256๊ฐœ์˜ ๊ตฌ๊ฐ„์œผ๋กœ ๋‚˜๋ˆ” - ๋…ธ๋“œ๋ฅผ ๋ถ„ํ• ํ•  ๋•Œ ์ตœ์ ์˜ ๋ถ„ํ• ์„ ๋งค์šฐ ๋น ๋ฅด๊ฒŒ ์ฐพ์„ ์ˆ˜ ์ž‡์Œ

256๊ฐœ์˜ ๊ตฌ๊ฐ„ ์ค‘์—์„œ ํ•˜๋‚˜๋ฅผ ๋—ด์–ด ๋†“๊ณ  ๋ˆ„๋ฝ๋œ ๊ฐ’์„ ์œ„ํ•ด์„œ ์‚ฌ์šฉ

 

ํŠธ๋ฆฌ ๊ฐœ์ˆ˜๋ฅผ ์ง€์ •ํ•˜๋Š”๋ฐ n_estimators ๋Œ€์‹  ๋ถ€์ŠคํŒ… ๋ฐ˜๋ณต ํšŸ์ˆ˜๋ฅผ ์ง€์ •ํ•˜๋Š” max_iter ์‚ฌ์šฉ

 

 

ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ… ๊ฒ€์ฆ ์ ์ˆ˜

from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

hgb = HistGradientBoostingClassifier(random_state=42)
scores = cross_validate(hgb, train_input, train_target, return_train_score=True, n_jobs=-1)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

0.9321723946453317

0.8801241948619236

 

๊ณผ๋Œ€์ ํ•ฉ์„ ์ž˜ ์–ต์ œํ•˜๋ฉด์„œ, ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ…๋ณด๋‹ค ์ข€ ๋” ๋†’์€ ์„ฑ๋Šฅ

 

 

ํ›ˆ๋ จ ์„ธํŠธ ํŠน์„ฑ ์ค‘์š”๋„

* permutation_importance()  : ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ…์˜ ํŠน์„ฑ ์ค‘์š”๋„

                                   ํŠน์„ฑ์„ ํ•˜๋‚˜์”ฉ ๋žœ๋คํ•˜๊ฒŒ ์„ž์–ด์„œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ๋ณ€ํ™”ํ•˜๋Š”์ง€๋ฅผ ๊ด€์ฐฐํ•˜์—ฌ ์–ด๋–ค ํŠน์„ฑ์ด ์ค‘์š”ํ•œ์ง€ ๊ณ„์‚ฐ

* n_repeats : ๋žœ๋คํ•˜๊ฒŒ ์„ž์„ ํšŸ์ˆ˜ (๊ธฐ๋ณธ๊ฐ’ 5)

from sklearn.inspection import permutation_importance

hgb.fit(train_input, train_target)
result = permutation_importance(hgb, train_input, train_target,
                                n_repeats=10, random_state=42, n_jobs=-1)
print(result.importances_mean)

[0.08876275 0.23438522 0.08027708]

 

permutation_importancer() ํ•จ์ˆ˜๊ฐ€ ๋ฐ˜ํ™˜ํ•˜๋Š” ๊ฐ์ฒด๋Š” ๋ฐ˜๋ณตํ•˜์—ฌ ์–ป์€ ํŠน์„ฑ ์ค‘์š”๋„, ํ‰๊ท , ํ‘œ์ค€ํŽธ์ฐจ ๋‹ด๊ณ  ์žˆ์Œ

 

 

ํ…Œ์ŠคํŠธ ์„ธํŠธ ํŠน์„ฑ ์ค‘์š”๋„

result = permutation_importance(hgb, test_input, test_target,
                                n_repeats=10, random_state = 42, n_jobs=-1)
print(result.importances_mean)

[0.05969231 0.20238462 0.049 ]

 

 

ํ…Œ์ŠคํŠธ ์„ธํŠธ ์ตœ์ข… ์„ฑ๋Šฅ ํ™•์ธ

hgb.score(test_input, test_target)

0.8723076923076923

 

 

์‚ฌ์ดํ‚ท๋Ÿฐ ๋ง๊ณ  ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ… ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌํ˜„ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

XGBoost

from xgboost import XGBClassifier

xgb = XGBClassifier(tree_method='hist', random_state=42)
scores = cross_validate(xgb, train_input, train_target, return_train_score=True, n_jobs=-1)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

 

LightGBM

from lightgbm import LGBMClassifier

lgb = LGBMClassifier(random_state=42)
scores = cross_validate(lgb, train_input, train_target, return_train_score=True, n_jobs=1)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

 

 

 

 

 


์ฐธ๊ณ ๋„์„œ : ํ˜ผ์ž๊ณต๋ถ€ํ•˜๋Š” ๋จธ์‹ ๋Ÿฌ๋‹ + ๋”ฅ๋Ÿฌ๋‹, ๋ฐ•ํ•ด์„ , ํ•œ๋น›๋ฏธ๋””์–ด, 2020๋…„

๋ฐ˜์‘ํ˜•

BELATED ARTICLES

more