TIL.23 N-gram 만들기

TIL

TIL.23 N-gram 만들기

codermun 2020. 10. 31. 23:01

728x90

## N-gram 만들기

# N-gram은 문자열에서 N개의 연속된 요소를 추출하는 방법입니다.

# 만약 'Hello'라는 문자열을 문자(글자) 단위 2-gram으로 추출하면 다음과 같이 됩니다.

# he / el / ll / lo

## 반복문으로 N-gram 출력하기

# 2- gram으로 hello 출력하기

text = 'hello'

for i in range(len(text)-1): # 0 ~ 3
    print(text[i], text[i+1], sep = '') # 2-gram으로 hello 출력하기
    
::
he
el
ll
lo

# 3-gram으로 hello 출력하기

text = 'hello'

for i in range(len(text)-2): # 0 ~ 2
    print(text[i], text[i+1], text[i+2], sep='') # 3-gram 으로 hello 출력하기	
    
::
hel
ell
llo

# 2-gram으로 문자열 출력하기

text = 'this is python script'
words = text.split() # 공백을 기준으로 문자열을 분리하여 리스트로 만들어주는 split

for i in range(len(words)-1): # 0 ~ 3 // # 2-gram 으로 리스트의 마지막에서 요소 한 개 앞까지만 반복
    print(word[i], word[i+1], sep='') 

::
thisis
ispython
pythonscript ## sep =''때문에 붙어서 출력

## zip으로 2-gram 만들기

text = 'hello'

a = list(zip(text, text[1:])) # text[1:] = 인덱스 1번부터 끝까지 출력

for i in a:
    print(i[0], i[1], sep='')
    
::
he
el
ll
lo

a = zip(text, text[1:]) # h e l l / e l l o/

b = zip(text, text[2:]) # h e l / l l o/

c = zip(text, text[3:]) # h e / l o

print(list(a)) 

# [('h', 'e'), ('e', 'l'), ('l', 'l'), ('l', 'o')]

print(list(b))

# [('h', 'l'), ('e', 'l'), ('l', 'o')]

print(list(c))

# [('h', 'l'), ('e', 'o')]

## 지금까지의 zip 함수는 리스트 두 개를 딕셔너리로 만들어 줄때 사용하였다

## zip 함수는 기본적으로 반복 가능한 객체의 각 요소를 모두 튜플로 묶어 주는데 이를 이용해 zip 함수를 사용한 문자열을 통해서도 2-gram을 만들 수 있다.

## 또한 출력시 zip 객체로 바로 출력할 수 없어 list로 감싸고 출력 가능하다

## zip으로 2-gram 만들기_2

text = 'this is python script'

words = text.split()

a = list(zip(words, words[1:])) 
# zip은 서로 다른 2개의 리스트를 튜플로 묶어주는 역할을 한다.
# 이렇게 만들어진 자료는 zip 객체로 생성되어 우리가 확인하기 위해 list로 감싸준다.
# words = ['this', 'is', 'python'] , words[1:] = ['is', 'python', 'script']
# 이때 words[0] == words 는 뒤 words[1:]의 영향을 받는다.

print(a)

::
[('this', 'is'), ('is', 'python'), ('python', 'script')]


for i in range(len(words)): # 0 ~ 2
    print(words[i], words[i+1])
    
::
this is
is python
python script

## 3-gram 으로 만들고 싶다면 // zip(words, words[1:], words[2:])와 같이 word, word[1:], word[2:] 3개의 리스트를 넣으면 된다.

## 리스트 표현식으로 zip 사용하기

text = 'hello'

print([text[i:] for i in range(3)])
:: ['hello', 'ello', 'llo']
#  [text[0], text[1], text[2]] 와 같다

## 해당 결과는 3-gram을 표현하기 위한 슬라이싱이다.

a = list(zip(['hello', 'ello', 'llo']))
print(a)

:: [('hello',), ('ello',), ('llo',)]

## 위 결과는 우리가 원하는 3-gram으로 표시가 되지 않는다. 위 리스트는 요소 3개의 1개짜리의 리스트이기 떄문이다
## 위와 같은 식을 반복가능한 객체 여러개로 ,(콤마)를 기준으로 넣어주기 위해선 리스트 앞에 *를 사용하면 된다.

a = list(zip(*['hello', 'ello', 'llo']))
print(a)
:: [('h', 'e', 'l'), ('e', 'l', 'l'), ('l', 'l', 'o')]

b = list(zip(*[text[i:] for i in range(3)]))
print(b)
:: [('h', 'e', 'l'), ('e', 'l', 'l'), ('l', 'l', 'o')]

## a의 리스트값은 결국 b의 리스트 표현식과도 같기에 동일한 결과를 출력할 수 있다

## *를 붙히는 방법을 리스트 언패킹 (list unpacking)이라 부르며 추후 강의에서 자세히 알아보자.

728x90

저작자표시 동일조건 (새창열림)