[ํ…์ŠคํŠธ๋งˆ์ด๋‹] ํ•œ๊ธ€ Word Clustering

2023. 2. 13. 06:57

ํ•œ๊ธ€ Word Clustering

1. ํŒจํ‚ค์ง€ ๋ฐ ๋ฐ์ดํ„ฐ ํŒŒ์ผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

- ํŒจํ‚ค์ง€ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import pickle

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram, ward
from sklearn.metrics.pairwise import cosine_similarity 

from matplotlib import font_manager
import matplotlib.pyplot as plt

 

- ๋ฐ์ดํ„ฐ ํŒŒ์ผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

with open('ko_stopped_join.bin','rb') as fp :
    ko_word_joined = pickle.load(fp)


์‚ฌ์ „์— ์ €์žฅํ•ด๋‘” ์ „์ฒ˜๋ฆฌ๋ฅผ ๊ฑฐ์นœ ํ•œ๊ธ€ ๋ฌธ์„œ ํŒŒ์ผ์„ ๋ถˆ๋Ÿฌ์˜จ๋‹ค.
์ด๋•Œ ๋‹จ์–ด๋“ค์ด ํ•˜๋‚˜์˜ ๋ฌธ์žฅ์ฒ˜๋Ÿผ ๋‚˜์—ด๋œ ํ˜•ํƒœ ์ž๋ฃŒ์—ฌ์•ผ ํ•œ๋‹ค.

โ— with ๋ฌธ
ํŒŒ์ผ ์ž…์ถœ๋ ฅ ๊ตฌ๋ฌธ์„ ํ•˜๋‚˜์˜ with ๊ตฌ๋ฌธ์œผ๋กœ ๋ฌถ์–ด์„œ ์‚ฌ์šฉ
with expression as target : suite

2. TFIDF์˜ DTM / TDM ๋งŒ๋“ค๊ธฐ

- ko_tfidf_vectorizer์— TfidifVectorizer ์‚ฌ์šฉ ์„ ์–ธ

ko_tfidf_vectorizer = TfidfVectorizer()

TF: ํ•˜๋‚˜์˜ ๋ฌธ์„œ์—์„œ ํŠน์ • ๋‹จ์–ด(Term)์˜ ๋“ฑ์žฅ ๋นˆ๋„(Frequency)
IDF: ํŠน์ • ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ•œ ๋ฌธ์„œ(Document)์˜ ๋นˆ๋„(Frequency)์˜ ์—ญ์ˆ˜(Inverse)

 

- ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต์‹œํ‚ค๊ณ , TDM ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜

ko_tfidf_dtm = ko_tfidf_vectorizer.fit_transform(ko_word_joined)
ko_tfidf_tdm = ko_tfidf_dtm.T

 

- ์ด๋ฆ„ ๊ฐ€์ ธ์˜ค๊ธฐ

ko_tfidf_tdm_word = ko_tfidf_vectorizer.get_feature_names()


โ— get_feature_names(): ์ธ๋ฑ์Šค์—์„œ ์ด๋ฆ„ ๊ฐ€์ ธ์˜ด


- ์ฝ”์‚ฌ์ธ ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ

ko_tfidf_dist = 1-cosine_similarity(ko_tfidf_tdm)

โ— cosine_similarity() : ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ์ธก์ •
๊ฐ™์€ ๋ฐฉํ–ฅ(0°) 1 / ์™„์ „ํžˆ ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ (180°) -1 / ์„œ๋กœ ๋…๋ฆฝ์ (90°) 0
-> ์œ ์‚ฌํ•  ์ˆ˜๋ก 1์— ๊ฐ€๊นŒ์›€

โ— 1 - cosine_similarity() : ์ฝ”์‚ฌ์ธ ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ
๊ฐ™์€ ๋ฐฉํ–ฅ(0°) 0 / ์™„์ „ํžˆ ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ(180°) 2 /์„œ๋กœ ๋…๋ฆฝ์ (90°) 1
-> ๊ตฐ์ง‘ ๋ถ„์„์„ ์œ„ํ•ด์„œ๋Š” ๊ฑฐ๋ฆฌ์˜ ๊ฐœ๋…์ด ํ•„์š”ํ•จ
๋‹ค๋ฅผ ์ˆ˜๋ก ์ˆซ์ž๊ฐ€ ์ปค์ง€๊ณ , ๊ฐ€๊นŒ์šธ ์ˆ˜๋ก ์ž‘์€ ๊ฐ’ ๊ตฌํ•˜๋„๋ก 1 ์—์„œ cosine_similarity ๋นผ์คŒ

3. ๊ตฐ์ง‘๋ถ„์„ ์ˆ˜ํ–‰

- ๊ตฐ์ง‘ ์ˆ˜ ์„ค์ •

k = 5

 

- K-means ์ ์šฉ

ko_kmeans_model = KMeans(n_clusters=k, init='k-means++', max_iter=10, n_init=10, random_state=777).fit(ko_tfidf_tdm)


KMeans ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ

  • n_cluster : ํด๋Ÿฌ์Šคํ„ฐ์˜ ์ˆ˜(k) ์„ค์ •
  • init : ์ดˆ๊ธฐํ™” ๋ฉ”์†Œ๋“œ (๊ธฐ๋ณธ๊ฐ’ k-means++)
  • max_iter : ๋ฐ˜๋ณต ์‹คํ–‰ํ•˜๋Š” ์ตœ๋Œ€ ํšŸ์ˆ˜
  • n_init : ์ดˆ๊ธฐ ์ค‘์‹ฌ์œ„์น˜ ์‹œ๋„ ํšŸ์ˆ˜

 

- ๊ตฐ์ง‘ ์ •๋ ฌ

order_centroids = ko_kmeans_model.cluster_centers_.argsort()[:, ::-1] #์˜ค๋ฆ„์ฐจ์ˆœ
  • .cluster_centers_ : ์ขŒํ‘œ๊ฐ’ ํ™•์ธ
  • argsort() : ์ธ๋ฑ์Šค๋ฅผ ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜

 

- ๋‹จ์–ด์™€ ๋ผ๋ฒจ ๋ถ™์ด๊ธฐ

ko_kmeans_model_word_label = ko_kmeans_model.labels_ 
word_dict = dict(zip(ko_tfidf_tdm_word, ko_kmeans_model_word_label))

๊ฐ ๊ตฐ์ง‘์— ์†ํ•ด ์žˆ๋Š” ๋ผ๋ฒจ์„ ์ €์žฅํ•˜๊ณ , ๊ฐ๊ฐ์˜ ๋‹จ์–ด์˜ ์ด๋ฆ„์ด ์ €์žฅ๋œ tdm ์ž๋ฃŒ์™€ ์—ฐ๊ฒฐํ•˜์—ฌ(zip) ์‚ฌ์ „ํ™”ํ•œ๋‹ค.
์ด ๋‹จ์–ด๊ฐ€ ๋ช‡ ๋ฒˆ์งธ ๋‹จ์–ด์ธ์ง€ ๋งค์นญํ•˜๋Š” ์ž‘์—…์ด๋‹ค. ๋ผ๋ฒจ์ด ID์™€ ๊ฐ™์€ ์—ญํ• ์„ ํ•˜๊ฒŒ ๋œ๋‹ค.

 

- ๋‹จ์–ด ๊ตฐ์ง‘ํ™”

for i in range(k) :
    word_cluster = [k for k, v in word_dict.items() if v == i]
    print('* cluster {}', format(i))
    print('Words: {}\n'.format(' '.join(str(x) for x in word_cluster)))

for๋ฌธ์„ ํ†ตํ•ด n๋ฒˆ์งธ ๋ฌธ์„œ๊ฐ€ ๊ตฐ์ง‘์— ์–ด๋–ค ๋‹จ์–ด๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธํ•œ๋‹ค.
๊ตฐ์ง‘๋ณ„๋กœ ์ •๋ฆฌํ•˜์—ฌ ๊ตฐ์ง‘ ๋ฒˆํ˜ธ, ๊ตฐ์ง‘๋ณ„๋กœ ์กฐ์ธ๋œ ๋‹จ์–ด๋“ค์„ ๋‚˜์—ดํ•œ๋‹ค.

 

4. ๋ด๋“œ๋กœ๊ทธ๋žจ ์‹œ๊ฐํ™”

- ward ์—ฐ๊ฒฐ๋ฒ• ์‚ฌ์šฉ, ๋งคํŠธ๋ฆญ์Šค ํ˜•ํƒœ๋กœ ๋งŒ๋“ค๊ธฐ

ko_linkage_matrix = ward(ko_tfidf_dist)
ko_linkage_matrix.shape
ko_linkage_matrix

1 - cosine_similarity๋กœ ๊ณ„์‚ฐํ•œ ๊ฑฐ๋ฆฌ๋ฅผ ward์˜ ๊ตฐ์ง‘๊ฐ„์˜ ์—ฐ๊ฒฐ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•˜๋‚˜์˜ ๋งคํŠธ๋ฆญ์Šค๋กœ ๋งŒ๋“ ๋‹ค.
2585๊ฐœ์˜ ๋‹จ์–ด์— ๋Œ€ํ•ด 4๊ฐœ์˜ ๊ตฐ์ง‘์œผ๋กœ ๋‚˜๋ˆ ์ง„๋‹ค.

โ— ward ์—ฐ๊ฒฐ๋ฒ•
์—ฐ๊ฒฐ๋  ์ˆ˜ ์žˆ๋Š” ๊ตฐ์ง‘ ์กฐํ•ฉ์„ ๋งŒ๋“ค๊ณ , ๊ตฐ์ง‘ ๋‚ด ํŽธ์ฐจ๋“ค์˜ ์ œ๊ณฑํ•ฉ์„ ๊ธฐ์ค€์œผ๋กœ ์˜ค์ฐจ ์ œ๊ณฑํ•ฉ์„ ์ธก์ •ํ•˜์—ฌ
์ตœ์†Œ ์ œ๊ณฑํ•ฉ์„ ๊ฐ€์ง€๊ฒŒ ๋˜๋Š” ๊ตฐ์ง‘๋ผ๋ฆฌ ์—ฐ๊ฒฐํ•˜๋Š” ๋ฐฉ๋ฒ•

 

- ํฐํŠธ ๊ฒฝ๋กœ ์„ค์ •

ko_font_location = "C:/Windows/Fonts/malgun.ttf"
ko_font_name = font_manager.FontProperties(fname=ko_font_location).get_name()
plt.rcParams['font.family'] = ko_font_name

ํ•œ๊ธ€์˜ ๊ฒฝ์šฐ ํฐํŠธ ๊ฒฝ๋กœ๋ฅผ ์œˆ๋„์šฐ ํฐํŠธ๋กœ ์„ค์ •ํ•˜์—ฌ ํฐํŠธ ๊นจ์ง์„ ๋ฐฉ์ง€ํ•ด์•ผ ํ•œ๋‹ค.

 

- ๋ด๋“œ๋กœ๊ทธ๋žจ ๊ทธ๋ฆฌ๊ธฐ

fig, ax = plt.subplots(figsize=(100,60))
plt.title('Clustering Dendrogram')
plt.ylabel('Distance')
plt.xlabel('Words')
ax = dendrogram(ko_linkage_matrix, leaf_font_size=10, leaf_rotation=50, orientation='top', labels=ko_tfidf_tdm_word)
plt.show()

 

๋ด๋“œ๋กœ๊ทธ๋žจ์˜ ๊ฒฝ์šฐ ์œ„์—์„œ ๋ถ„์„ํ•œ ๊ตฐ์ง‘๋ถ„์„ ๊ฒฐ๊ณผ์™€ 100% ์ผ์น˜ํ•˜์ง€ ์•Š๋Š”๋‹ค.
์ „๋ฐ˜์ ์ธ ๋‹จ์–ด์˜ ๋ฌถ์ž„ ํ˜„์ƒ์„ ๋ณด๊ธฐ ์œ„ํ•œ ๊ทธ๋ฆผ์œผ๋กœ ๊ฒฝํ–ฅ ์ „๋‹ฌ์˜ ๋ชฉ์ ์ด๋‹ค.
๋”ฐ๋ผ์„œ ๋ด๋“œ๋กœ๊ทธ๋žจ์„ ๊ฐ€์ง€๊ณ  ์–ด๋–ค ๊ตฐ์ง‘๋“ค์ด ๋ฌถ์—ฌ์žˆ๋Š”์ง€๋ฅผ ํ•˜๋‚˜์”ฉ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ์€ ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค.







์ฐธ๊ณ ๊ฐ•์˜ : ๋™์•„๋Œ€ INSPIRE - python ํ…์ŠคํŠธ๋งˆ์ด๋‹ 28๊ฐ• ํ•œ๊ธ€ Word Clustering

๋ฐ˜์‘ํ˜•

BELATED ARTICLES

more