์กฐ ๋ฐ”์ด๋“  ๋Œ€ํ†ต๋ น ์ทจ์ž„์‹ ์—ฐ์„ค๋ฌธ ์›Œ๋“œํด๋ผ์šฐ๋“œ

Python์„ ์ด์šฉํ•˜์—ฌ ์กฐ ๋ฐ”์ด๋“  ๋Œ€ํ†ต๋ น ์ทจ์ž„์‹ ์—ฐ์„ค๋ฌธ์„ ํ…์ŠคํŠธ ๋ถ„์„ํ•ด ๋ณด์•˜์Šต๋‹ˆ๋‹ค.

 

์•„๋ž˜ ํŒŒ์ผ์€ ์ œ๊ฐ€ ์‹ค์Šต์— ์‚ฌ์šฉํ•œ ์˜์–ด ์›๋ฌธ ํ…์ŠคํŠธ ์ž๋ฃŒ์ž…๋‹ˆ๋‹ค.

csv ํŒŒ์ผ์˜ ๊ฒฝ์šฐ ์—‘์…€์„ ์ด์šฉํ•ด ๋งˆ์นจํ‘œ ๊ธฐ์ค€์œผ๋กœ ํ…์ŠคํŠธ๋ฅผ ๋‚˜๋ˆ„์—ˆ๊ณ , ๊ทธ ์™ธ ์ „์ฒ˜๋ฆฌ๋Š” ๋ชจ๋‘ ํŒŒ์ด์ฌ์„ ํ™œ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

 

์กฐ๋ฐ”์ด๋“  ๋Œ€ํ†ต๋ น ์ทจ์ž„์‚ฌ.txt
0.01MB
์กฐ๋ฐ”์ด๋“  ๋Œ€ํ†ต๋ น ์ทจ์ž„์‚ฌ.csv
0.01MB

 


1. ๋ฐ์ดํ„ฐ ์ค€๋น„

- ํŒจํ‚ค์ง€ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import pandas as pd 
import numpy as np
import sklearn # ํŠน์ง• ์ถ”์ถœ
import re # ์ •๊ทœ์‹
from nltk.tokenize import word_tokenize  # ๋‹จ์–ด ํ† ํฐํ™”
from nltk.corpus import stopwords  # ๋ถˆ์šฉ์–ด
from nltk.stem import PorterStemmer  # ์–ด๊ฐ„์ถ”์ถœ
from gensim import corpora  # gensim - topic modeling ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ, corpora - corpus ๋ณต์ˆ˜ํ˜•
from sklearn.feature_extraction.text import CountVectorizer # ์ˆซ์žํ™”
from wordcloud import WordCloud, STOPWORDS # ์›Œ๋“œํด๋ผ์šฐ๋“œ, ๋ถˆ์šฉ์–ด ์ฒ˜๋ฆฌ
import matplotlib.pyplot as plt # ์‹œ๊ฐํ™”

 

- ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

# csv ํŒŒ์ผ
from google.colab import drive
drive.mount('/content/drive')

data = pd.read_csv('ํŒŒ์ผ๊ฒฝ๋กœ', encoding='cp949').address

# ํ…์ŠคํŠธ ํŒŒ์ผ
from google.colab import drive
drive.mount('/content/drive')

f = open('ํŒŒ์ผ๊ฒฝ๋กœ',"r",encoding="UTF-8")  # r ์ฝ๊ธฐ๋ชจ๋“œ
text = f.read()
f.close()

 


2. ์ „์ฒ˜๋ฆฌ

csv ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์™€ ์ „์ฒ˜๋ฆฌ

 

- ๋ถˆ์šฉ์–ด, ์–ด๊ฐ„์ถ”์ถœ ์‚ฌ์ „ ์ •์˜

stopWords = set(stopwords.words("english"))
stemmer = PorterStemmer()

 

- for ๊ฒฐ๊ณผ ๋‹ด์„ ๋นˆ words ๋ฆฌ์ŠคํŠธ ์ƒ์„ฑ

words = []

 

 

- ์†Œ๋ฌธ์žํ™”, ํ† ํฐํ™”, ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ, ์–ด๊ฐ„์ถ”์ถœ

for doc in data :
    tokenizedWords = word_tokenize(doc.lower()) # ์†Œ๋ฌธ์ž ์ฒ˜๋ฆฌ
    print(tokenizedWords) # ์†Œ๋ฌธ์ž๋กœ ํ†ต์ผ๋œ ๊ฒฐ๊ณผ๋งŒ
    stoppedWords = [ w for w in tokenizedWords if w not in stopWords] # tokenizedWords์— ์žˆ๋Š” ๋‹จ์–ด ๊ณ„์†, ๋งŒ์•ฝ ๊ทธ ๋‹จ์–ด๊ฐ€ stopwords์— ์žˆ๋Š” ๋‹จ์–ด๊ฐ€ ์•„๋‹ˆ๋ฉด stoppedwWords์— ๋„ฃ์–ด๋ผ
    stemmedWords = [stemmer.stem(w) for w in stoppedWords] # ์œ„์—์„œ ์ฒ˜๋ฆฌ๋œ ๋‹จ์–ด๋ฅผ ์–ด๊ฐ„์ถ”์ถœ
    words.append(stemmedWords)

data์— ์žˆ๋Š” ๋ฌธ์žฅ๋“ค์„ doc์— ํ•˜๋‚˜์”ฉ ๋„ฃ์–ด ์‚ดํŽด๋ด„

์†Œ๋ฌธ์ž ์ฒ˜๋ฆฌ ํ›„ ํ† ํฐํ™” ํ•˜์—ฌ tokenizedwords์— ๋„ฃ์Œ

tokizedwords์—์„œ ๋‹จ์–ด๋ฅผ stopwords์— ์žˆ๋Š” ๋‹จ์–ด๊ฐ€ ์•„๋‹ˆ๋ฉด stoppedwords์— ๋„ฃ์–ด๋ผ(๋ถˆ์šฉ์–ด ๋บ€ ๋‚˜๋จธ์ง€)

๋ถˆ์šฉ์–ด ์ฒ˜๋ฆฌ๋œ ๊ฒƒ์„ ์–ด๊ฐ„์ถ”์ถœํ•˜์—ฌ stemmedwords์— ๋„ฃ์Œ

๋งˆ์ง€๋ง‰์— ์ •์ œ๋œ ๊ฒƒ์€ words์— ๋ถ™์—ฌ ๋„ฃ์–ด๋ผ

 

 

- ๋ฌธ์ž์—ด์—์„œ ์•ŒํŒŒ๋ฒณ๋งŒ ๋‚จ๊ธฐ๊ธฐ

alp_words = re.sub(r"[^a-zA-Z]", " ", str(words))
alp_words

re.sub(์ •๊ทœ ํ‘œํ˜„์‹, ์น˜ํ™˜ ๋ฌธ์ž, ๋Œ€์ƒ ๋ฌธ์ž์—ด)

^ ์•„๋‹Œ ๊ฒƒ๊ณผ ๋งค์น˜, [a-zA-Z] : ์•ŒํŒŒ๋ฒณ ๋ชจ๋‘

 

+) ์ •๊ทœ ํ‘œํ˜„์‹ ์ฐธ๊ณ 

 

Regular Expression HOWTO

Author, A.M. Kuchling < amk@amk.ca>,. Abstract: This document is an introductory tutorial to using regular expressions in Python with the re module. It provides a gentler introduction than the corr...

docs.python.org

 

 


3. ๋นˆ๋„๋ถ„์„

- ํ† ํฐํ™”

cv = CountVectorizer(max_features=100, stop_words='english').fit([alp_words])

CountVectorizer : ๋ฌธ์„œ ์ง‘ํ•ฉ์—์„œ ๋‹จ์–ด ํ† ํฐ์„ ์ƒ์„ฑํ•˜๊ณ  ๊ฐ ๋‹จ์–ด์˜ ์ˆ˜๋ฅผ ์„ธ์–ด BOW ์ธ์ฝ”๋”ฉ ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“ฆ

  • max_features ์ตœ๋Œ€ ๋ช‡ ๊ฐœ ๊ฐ€์ ธ์˜ฌ ๊ฒƒ์ธ์ง€ (๋†’์€ ์ˆœ์œผ๋กœ)

sklearn์— ์žˆ๋Š” stopwords, ์œ„์—์„œ๋Š” nltk์˜ stopwords ~ ์•ˆ ๊ฒน์น˜๋Š” ๊ฑฐ ์žˆ์œผ๋ฉด ๋น ์ง€๋„๋ก 

 

 

- DTM ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜

dtm = cv.fit_transform([alp_words])

fit_transform : ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์ •๋ณด๋Š” ์œ ์ง€ํ•˜๋ฉฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜

ํ˜•ํƒœ ๋ณ€ํ™˜ํ•œ ๊ฒƒ์„ dtm์œผ๋กœ ์ €์žฅ

 

 

- feature ๋‹จ์–ด ๋ชฉ๋ก ํ™•์ธ

alp_words = cv.get_feature_names()
alp_words

get_feature_names : DTM์— ์‚ฌ์šฉ๋œ feature ๋‹จ์–ด ๋ชฉ๋ก

 

- dtm ๋‹จ์–ด ํ•ฉ๊ณ„

count_mat = dtm.sum(axis=0) # ์„ธ๋กœ ํ•ฉ๊ณ„
count_mat

dtm์˜ ์ž๋ฃŒ๋ฅผ ํ•ฉ์ณ์„œ ์„ธ๋กœ๋กœ count_mat์— ๋„ฃ์Œ

 

 

- ๋‹จ์–ด, ๋นˆ๋„์ˆ˜๋ฅผ ํ•ฉ์ณ์„œ ๋ฆฌ์ŠคํŠธ๋กœ ๋งŒ๋“ฆ

count = np.squeeze(np.asarray(count_mat)) # arrayํ˜•ํƒœ๋กœ ์ฐจ์›์„ ์ค„์—ฌ๋ผ
word_count = list(zip(alp_words, count)) # ๋‹จ์–ด์™€ count ๋ฅผ list ํ˜•ํƒœ๋กœ ์Œ์„ ์ด๋ฃจ๋„๋ก ๋ฌถ์Œ
word_count

squeeze : ์ฐจ์›์„ ์ถ•์†Œ (matrix ํ–‰๋ ฌ -> array ๋ฐฐ์—ด ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜)

zip : ๋™์ผ ๊ฐœ์ˆ˜๋กœ ์ด๋ฃจ์–ด์ง„ ์ž๋ฃŒํ˜• ๋ฌถ์–ด์ฃผ๋Š” ํ•จ์ˆ˜

-> zip์œผ๋กœ ๋ฐ˜ํ™˜๋จ -> list(zip(๋ฆฌ์ŠคํŠธ1, ๋ฆฌ์ŠคํŠธ2)) list ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜

 

 

- ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ

word_count = sorted(word_count, key=lambda x:x[1], reverse=True)

์›Œ๋“œํด๋ผ์šฐ๋“œ ์ž‘์„ฑ์œ„ํ•ด ํฐ ๊ฒƒ ๋จผ์ € ๋‚˜์˜ค๋„๋ก ์ •๋ ฌํ•˜์—ฌ ์ƒˆ๋กœ์šด ๋ฆฌ์ŠคํŠธ ๋ฐ˜ํ™˜

sorted(์ •๋ ฌํ•  ๋ฐ์ดํ„ฐ, key ์ •๋ ฌ ๊ธฐ์ค€, reverse ์˜ค๋ฆ„/๋‚ด๋ฆผ)

  • reverse=True ๋‚ด๋ฆผ์ฐจ์ˆœ / reverse=false,์ƒ๋žต : ์˜ค๋ฆ„์ฐจ์ˆœ

lamvda : ๋ฐ”๋กœ ์ •์˜ํ•˜์—ฌ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ํ•จ์ˆ˜-

                ->  (key ์ธ์ž์— ํ•จ์ˆ˜๋ฅผ ๋„˜๊ฒจ์ฃผ๋ฉด ํ•ด๋‹น ํ•จ์ˆ˜์˜ ๋ฐ˜ํ™˜๊ฐ’์„ ๋น„๊ตํ•˜๋ฉฐ ์ˆœ์„œ๋Œ€๋กœ ์ •๋ ฌ)

 


4. ์›Œ๋“œํด๋ผ์šฐ๋“œ

- ์›Œ๋“œํด๋ผ์šฐ๋“œ ์„ค์ •

wc = WordCloud(background_color='black', width=800, height=600)

๋ฐฐ๊ฒฝ์ƒ‰=๋ธ”๋ž™, ๋„“์ด=800, ๋†’์ด=600

 

- ๋‹จ์–ด ๋นˆ๋„ ์‚ฌ์ „

cloud = wc.generate_from_frequencies(dict(word_count)) # ๋นˆ๋„(์‚ฌ์ „ํ˜•ํƒœ())

 

- ์›Œ๋“œํด๋ผ์šฐ๋“œ ๊ทธ๋ฆฌ๊ธฐ

plt.figure(figsize=(12,9))  # plt ๊ทธ๋ฆผ ๊ทธ๋ฆฌ๊ธฐ (์‚ฌ์ด์ฆˆ)
plt.imshow(cloud) # ํด๋ผ์šฐ๋“œ ๊ทธ๋ฆฌ๊ธฐ
plt.axis('off') # ์ขŒํ‘œ๊ฐ’(์ถ•) ๋„๊ธฐ
plt.show() # ๋ณด์—ฌ์คŒ

 

ํ…์ŠคํŠธ๋ฅผ ๊ฐ„๋‹จํ•˜๊ฒŒ ์ „์ฒ˜๋ฆฌํ•œ ๋’ค ๋‹จ์–ด ๋นˆ๋„๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์›Œ๋“œํด๋ผ์šฐ๋“œ๋ฅผ ๊ทธ๋ ค๋ณด์•˜์Šต๋‹ˆ๋‹ค. ์ž์ฃผ ์–ธ๊ธ‰๋˜๋Š” ์ƒ์œ„ 100๊ฐœ์˜ ๋‹จ์–ด๋“ค์„ ๊ธฐ์ค€์œผ๋กœ ํ•˜์˜€์œผ๋‚˜ ๊ณผ์—ฐ ์ด ๋‹จ์–ด๋“ค ์ž์ฒด๋ฅผ ํŠน๋ณ„ํ•˜๊ณ  ์ค‘์š”ํ•˜๋‹ค ๋ณผ ์ˆ˜ ์žˆ๋Š”์ง€๋Š” ์˜๋ฌธ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ ์–ด๊ฐ„ ์ถ”์ถœ์ด ๋งค๋„๋Ÿฝ์ง€ ๋ชป ํ•˜๋ฉฐ american, america์™€ ๊ฐ™์€ ์œ ์‚ฌ์–ด์˜ ๊ฒฝ์šฐ์—๋Š” ์‚ฌ์ „์„ ๋”ฐ๋กœ ์ •์˜ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

๋”ฐ๋ผ์„œ ์—ฌ๋Ÿฌ ๋‹จ์–ด๊ฐ€ ํŠน์ • ๋ฌธ์„œ ๋‚ด์—์„œ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ ๊ฒƒ์ธ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” TF-IDF๋ฅผ ์ด์šฉํ•ด ๋ณด๋Š” ๊ฒƒ๋„ ์ข‹์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋‹จ์ˆœํžˆ '๋‹จ์–ด๊ฐ€ ํ”ํ•˜๊ฒŒ ๋“ฑ์žฅํ•˜๋Š” ๊ฒƒ'์ด ์•„๋‹Œ '๋ฌธ์„œ ๋‚ด์—์„œ ๋…ํŠนํ•˜๊ฒŒ ์‚ฌ์šฉ๋œ ๋‹จ์–ด'๋“ค์„ ํŒŒ์•…ํ•˜์—ฌ ํ…์ŠคํŠธ๋ฅผ ๋น ๋ฅด๊ฒŒ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ ๊ธฐ๋Œ€๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

 

 

 

 

 

 

๋ฐ˜์‘ํ˜•

BELATED ARTICLES

more