[ํ…์ŠคํŠธ๋งˆ์ด๋‹] Word2Vec Modeling ์‹ค์Šต

2023. 1. 30. 17:14

Word2Vec ์ด๋ก 

one-hot encoding์€ ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ๋‚˜ํƒ€๋‚ผ ๋•Œ ์ด ๋‹จ์–ด ์ˆ˜๋งŒํผ์˜ ๊ธธ์ด์˜ ๋ฒกํ„ฐ์—์„œ ๋‹ค๋ฅธ ๋ชจ๋“  ๊ฐ’์€ 0์œผ๋กœ ํ•˜๊ณ  ๋‹จ์–ด ๋ฒˆํ˜ธ์— ํ•ด๋‹นํ•˜๋Š” ์›์†Œ๋งŒ 1๋กœ ํ‘œ์‹œํ•œ๋‹ค. 'ํ† ๋ผ', '๋„์„œ๊ด€', '๋ฌผ' 3 ๋‹จ์–ด๋งŒ ์žˆ๊ณ  ์ˆœ์„œ๋Œ€๋กœ 1~3๋ฒˆ์ด๋ผ๋ฉด ํ† ๋ผ๋Š”(1,0,0), ๋„์„œ๊ด€์€(0,1,0), ๋ฌผ์€(0,0,1)๋กœ ๋‚˜ํƒœ๋‚œ๋‹ค. ๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š์œผ๋ฉฐ ๋ฒกํ„ฐ์˜ ๊ธธ์ด๊ฐ€ ์ด ๋‹จ์–ด ์ˆ˜๊ฐ€ ๋˜๋ฏ€๋กœ ํฌ๋ฐ•ํ•œ ํ˜•ํƒœ๊ฐ€ ๋œ๋‹ค.

์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ์กฐ๋ฐ€ํ•œ ์ฐจ์›์— ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์„ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์ด๋ผ๊ณ  ํ•œ๋‹ค. ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์€ ๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๊ธฐ ๋•Œ๋ฌธ์— one-hot encoding๋ณด๋‹ค ํ•™์Šต ์„ฑ๋Šฅ์„ ๋†’์ผ ์ˆ˜ ์žˆ๋‹ค. ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์˜ ์ข…๋ฅ˜์—๋Š” LSA, Word2Vec, GloVe, FastText ๋“ฑ์ด ์žˆ๋‹ค.

 

http://doc.mindscale.kr/km/unstructured/11.html

11. ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ์ „ํ†ต์ ์œผ๋กœ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์—์„œ๋Š” ๋‹จ์–ด๋ฅผ ์˜๋ฏธ๋‚˜ ๋ฐœ์Œ์„ ๋ฌด์‹œํ•˜๊ณ  ๊ฐ๊ฐ์„ ๊ฐœ๋ณ„์ ์ธ ๊ธฐํ˜ธ๋กœ ์ทจ๊ธ‰ํ•œ๋‹ค. ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ๋‚˜ํƒ€๋‚ผ ๋•Œ๋Š” ์ด ๋‹จ์–ด ์ˆ˜๋งŒํผ์˜ ๊ธธ์ด์˜ ๋ฒกํ„ฐ์—์„œ ๋‹ค๋ฅธ ๋ชจ๋“  ๊ฐ’

doc.mindscale.kr

 

Word2Vec์€ ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•ด ์ค€๋‹ค. ์ €์ฐจ์› ๋ฒกํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ๋‹ค์ฐจ์› ๊ณต๊ฐ„์— ๋ฒกํ„ฐํ™”ํ•ด์„œ ์œ ์‚ฌ์„ฑ์„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค. 

ํ•˜์ง€๋งŒ ๋ฌธ๋งฅ์˜ ํ‘œํ˜„์€ ์•ˆ ๋œ๋‹ค. ์ฆ‰, ๋‹จ์–ด(๋ชจ์–‘)์˜ ์ถœํ˜„์ด์ง€ ๋‹จ์–ด(์˜๋ฏธ)์˜ ์ถœํ˜„์€ ์•„๋‹ˆ๋‹ค.

 

Word2Vec, Word to vector, ์›Œ๋“œํˆฌ๋ฒกํ„ฐ๋ž€?

์—„์ฒญ๋‚˜๊ฒŒ ์œ ์šฉํ•˜๋ฉด์„œ๋„ ๋ณ„๋กœ์ธ W2V์— ๋Œ€ํ•ด์„œ ์•Œ์•„๋ณด๋„๋ก ํ•ฉ์‹œ๋‹ค. ์˜์–ด๋ฅผ ์™œ ์ค‘๊ฐ„์— ๋งŽ์ด ๋„ฃ์—ˆ๋ƒ๋ฉด.. ์‹ค๋ฌด์—์„  ์˜์–ด์“ฐ๋‹ˆ๊นŒ..? ์šฐ๋ฆฌ๊ฐ€ ๊ฐ•๋Œ€๊ตญ์ด์—ˆ์œผ๋ฉด.. ํ•œ๊ธ€์ž๋ฃŒ๊ฐ€ ํ›จ ๋งŽ์•˜์„ํ…๋ฐ ํ๊ทœํ๊ทœ ์ถœ์ฒ˜: E

luv-n-interest.tistory.com

(์œ„์— ์ฒจ๋ถ€๋œ ๋ธ”๋กœ๊ทธ๋“ค์˜ ๊ธ€์„ ์ •๋ฆฌํ•˜์—ฌ ์ž‘์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค)

 

 

 


Word2Vec Modeling ์‹ค์Šต

1. ํŒจํ‚ค์ง€ ๋ฐ ๋ฐ์ดํ„ฐ ํŒŒ์ผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

- ํŒจํ‚ค์ง€ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import pandas as pd
import re

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

from gensim.models import Word2Vec
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

 

- ํŒŒ์ผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

en_data =  pd.read_csv('wos_ai_.csv', encoding='euc-kr')
en_data_abstract = en_data['ABSTRACT']

 

 

3. ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ 

- ๋นˆ ๋ฐ์ดํ„ฐ์…‹ ์ค€๋น„

en_doc = []
en_word_joined = []
en_word = []

 

- ๋ฌธ์ž๋ฉด - ๋Œ€์‹  ๊ณต๋ฐฑ ๋„ฃ๊ธฐ

for doc in en_data_abstract :  
    if type(doc) != float : 
        en_doc.append(doc.replace("-"," "))

 

- ๋ถˆ์šฉ์–ด, ์–ด๊ฐ„์ถ”์ถœ ์‚ฌ์ „ ์ •์˜

en_stopwords = set(stopwords.words("english"))
en_stemmer = PorterStemmer()

 

์•ŒํŒŒ๋ฒณ๋งŒ ๋‚จ๊ธฐ๊ธฐ์†Œ๋ฌธ์žํ™”, ํ† ํฐํ™”, ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ, ์–ด๊ฐ„์ถ”์ถœ 

for doc in en_doc :
    en_alphabet= re.sub(r"[^a-zA-Z]+", " ", str(doc)) 
    en_tokenized = word_tokenize(en_alphabet.lower()) 
    en_stopped = [ w for w in en_tokenized if w not in en_stopwords] 
    en_stemmed = [en_stemmer.stem(w) for w in en_stopped]
    en_word_joined.append(' '.join(en_stemmed))
    en_word.append(en_stemmed)

 

 

3. word2Vec ๋ถ„์„ ๋ฐ word ์ขŒํ‘œ ํ™•์ธ

- Word2Vec ์‹ค์‹œ

en_w2v_model = Word2Vec(en_word, vector_size=300, window=20, min_count=10, workers=4)

 

Word2Vec ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ

  • vector_size : ์›Œ๋“œ ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ (์ž„๋ฒ ๋”ฉ ๋œ ๋ฒกํ„ฐ์˜ ์ฐจ์›) -> Word2Vec, Doc2Vec์˜ ๊ฒฝ์šฐ ์ฐจ์› ์‚ฌ์ด์ฆˆ๋ฅผ 300~500 ์‚ฌ์ด๋กœ ํ•˜๋Š” ๊ฒƒ์ด ๊ฒฐ๊ณผ๊ฐ€ ๊น”๋”ํ•˜๊ฒŒ ๋‚˜์˜จ๋‹ค๊ณ  ๊ตฌ๊ธ€์ด ์ œ์•ˆํ–ˆ๋‹ค.
  • window : ์ปจํ…์ŠคํŠธ ์œˆ๋„์šฐ ํฌ๊ธฐ (๊ณ ๋ คํ•  ์•ž๋’ค ๋‹จ์–ด)
  • min_count : ๋‹จ์–ด ์ตœ์†Œ ๋นˆ๋„ ์ˆ˜ ์ œํ•œ
  • workers : ํ•™์Šต์„ ์œ„ํ•œ ํ”„๋กœ์„ธ์Šค ์ˆ˜ (๋™์‹œ์— ์ฒ˜๋ฆฌํ•  ์ž‘์—… ์ˆ˜)

 

- ๋ชจ๋ธ ๊ฒฐ๊ณผ๊ฐ’ ํ™•์ธ

print(en_w2v_model.wv['algorithm'])

๋ชจ๋ธ ๊ฒฐ๊ณผ์—์„œ ๋‹จ์–ด์˜ ์ƒ๋Œ€ ์ฐจ์› ๊ฐ’ ํ™•์ธ

(6 * 50 => 300์ฐจ์›)

 

 

print(en_w2v_model.wv.most_similar('learn'))

  • wv.most_similar() : ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ๋‹จ์–ด ์ถ”์ถœ

 

print(en_w2v_model.wv.most_similar(['learn', 'deep']))

๋‘ ๋‹จ์–ด์˜ ์กฐํ•ฉ๊ณผ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด 300์ฐจ์›์—์„œ ์œ„์น˜๊ฐ€ ์œ ์‚ฌ๋„๋กœ ๊ณ„์‚ฐ๋œ๋‹ค.

 

print(en_w2v_model.wv.similarity('deep', 'learn'))

  • wv.similarity : ๋‘ word vector์˜ ์œ ์‚ฌ๋„ ๋น„๊ต

 

4. TSNE ๋กœ ์ฐจ์› ์ถ•์†Œ ํ›„ ์‹œ๊ฐํ™”

def tsne_plot(model) :
    labels = [] # ์ด๋ฆ„
    tokens = [] # ์ขŒํ‘œ๊ฐ’
    
    for word in model.wv.vocab :
        tokens.append(model[word])
        labels.append(word)
        
    tsne_modle = TSNE(perplexity=30, n_components=2, init='random', n_iter=250, random_state=23)
    new_valus = tsne_model.fit_transform(tokens)
    
    x = []
    y = []
    for value in new_values :
        x.append(value[0])
        y.append(value[1])
        
    plt.figure(figsize = (16,16))
    for i in range(len(x)) :
        plt.scatter(x[i], y[i])
        plt.annotate(labels[i],
                    xy = (x[i], y[i]),
                    xytext =( 5,2),
                    textcoords = 'offset points',
                    ha = 'right',
                    va = 'bottom')
    plt.show()

 

TSNE

TSNE๋Š” ๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ๋ฅผ ์‹œ๊ฐํ™”ํ•˜๋Š” ๋„๊ตฌ์ด๋‹ค. ์‹œ๊ฐํ™”๊ฐ€ ํŽธ๋ฆฌํ•œ 2์ฐจ์›์ด๋‚˜ 3์ฐจ์›์œผ๋กœ ์ฐจ์› ์ถ•์†Œ๋ฅผ ์ง„ํ–‰ํ•œ ํ›„,  ์‹ค์ œ feature๊ฐ€ ์•„๋‹Œ ์ถ•์†Œ๋œ ์ฃผ์„ฑ๋ถ„์„ ๊ธฐ์ค€์œผ๋กœ ๋ถ„ํฌ๋ฅผ ๊ฐ„์ ‘์ ์œผ๋กœ ์‹œ๊ฐํ™”ํ•œ๋‹ค.

 

TSNE ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ

  • n_components : ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์˜ ์ฐจ์›
  • perplexity : ํ•™์Šต์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ์ ๋“ค์˜ ๊ฐœ์ˆ˜ ์กฐ์ ˆ
  • init : ์ดˆ๊ธฐํ™” ๋ฉ”์„œ๋“œ
  • n_iter : ์ตœ์ ํ™”์˜ ์ตœ๋Œ€ ๋ฐ˜๋ณต ํšŸ์ˆ˜
  • random_state : ๋‚œ์ˆ˜ ์ƒ์„ฑ๊ธฐ ๊ฒฐ์ •

 

- ๊ทธ๋ž˜ํ”„

tsne_plot(en_w2v_model)

 

300์ฐจ์›์˜ ๊ทธ๋ฆผ์„ 2์ฐจ์›์œผ๋กœ ์••์ถ•ํ•œ ๊ทธ๋ž˜ํ”„๋กœ ์ฃผ์ฐจ์›์— ์˜ํ•ด ๋‹จ์–ด๋“ค์ด ๋ชฐ๋ ค์žˆ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ธ๋‹ค.

area, defin, two ... ์•„์›ƒ๋ผ์ด์–ด, ์ฆ‰ ์ด์ƒ์น˜ ๊ฐ’์œผ๋กœ ๊ด€๊ณ„๊ฐ€ ์•ฝํ•œ ๋‹จ์–ด๋“ค์ด๋‹ค. ํŠน์ • ๋…ผ๋ฌธ์—๋งŒ ์“ฐ์˜€๋‹ค๊ณ  ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

 

 

 

 

 

 

 


์ฐธ๊ณ ๊ฐ•์˜ : ๋™์•„๋Œ€ INSPIRE - python ํ…์ŠคํŠธ๋งˆ์ด๋‹ 22๊ฐ• Word2Vec Modeling ์‹ค์Šต

๋ฐ˜์‘ํ˜•

BELATED ARTICLES

more