- 자연어 전처리(1)

NLTK(Natural Language Toolkit)

nltk는 자연어 처리(NLP)를 위한 파이썬 패키지로 토큰화, 태깅, 구문 분석, 의미 분석, 분류 등 텍스트 처리 및 분석을 위한 다양한 알고리즘들을 제공해준다.

토큰화(Tokenization) 실습

이전시간에서도 살펴봤듯이, 토큰화의 중요성은 다음과 같다.

•

정확성 향상: 토큰화를 통해 데이터를 적절히 분리함으로써, 분석의 정확성을 높일 수 있다. 다만, 잘못된 토큰화는 의미 분석의 오류로 이어질 수 있다.

•

효율성 증가: 토큰화를 통해 데이터의 크기를 줄이고, 처리해야 할 단위를 명확히 함으로써, 전체 처리 과정의 효율성을 증가시킨다.

•

유연성: 다양한 유형의 토큰화 기법을 적용함으로써, 특정 언어나 도메인에 특화된 분석이 가능해진다. 예를 들어, 언어마다 다른 문장 구분 규칙을 고려할 수 있다.

토큰화의 주요 유형은 다음과 같다.

단어 토큰화 (Word Tokenization):

•

가장 흔히 사용되는 토큰화 방법 중 하나로, 텍스트를 공백, 구두점 등을 기준으로 개별 단어로 분리한다.

•

예: "I love natural language processing." → ["I", "love", "natural", "language", "processing"]

문장 토큰화 (Sentence Tokenization):

•

텍스트를 문장 단위로 분리한다. 이는 주로 마침표, 느낌표, 물음표 등의 문장 구분 기호를 사용하여 이루어진다.

•

예: "Hello world. Natural language processing is fascinating." → ["Hello world.", "Natural language processing is fascinating."]

서브워드 토큰화 (Subword Tokenization):

•

단어를 더 작은 의미를 가진 단위로 분리하는 방법이다. 이는 특히 언어 모델에서 단어의 내부 구조를 파악하고, 어휘 외 단어에 대처하는 데 유용하다.

•

예: "language" → ["lan", "gu", "age"]

실습코드

# 예제 데이터
sp1 = 'Deep learning is the subset of machine learning methods based on artificial neural networks with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.[2]'

import nltk
from nltk.tokenize import word_tokenize

words = word_tokenize(sp1)

print("단어 토큰화:", words)

from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(sp1)

print("문장 토큰화:", sentences)

from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(sp1)

print("Treebank 단어 토큰화:", tokens)
Python
복사

nltk.tokenize.regexp_tokenize 를 활용하여 사용자가 정의한 정규표현식 패턴에 따라 텍스트를 토큰화할 수 있다.

import nltk
from nltk.tokenize import regexp_tokenize

tokenizer = RegexpTokenizer('[\w"]+')

print(tokenizer.tokenize(sp1))
Python
복사

불용어 제거(Stop Word Removal) 실습

불용어는 문장에서 큰 의미를 가지지 않으며, 데이터 분석이나 처리 시 노이즈로 작용할 수 있다. 일반적으로 불용어에는 "is", "and", "the", "a" 등과 같은 대부분의 조사, 전치사, 접속사 등이 포함된다. 불용어를 제거하는 것은 텍스트 데이터를 전처리하는 과정에서 중요한 단계 중 하나로, 데이터의 중요한 의미를 담고 있는 핵심 단어들에 더 집중할 수 있게 한다.

다음은 nltk패키지를 활용한 실습코드이다.

우선 word_tokenize 함수를 사용하여 텍스트를 토큰화한다. 그런 다음, stopwords.words('english')를 통해 영어 불용어 목록을 불러오고, 이 목록에 없는 단어들만을 필터링하여 불용어를 제거한다. 결과적으로, 의미 있는 단어들만을 추출하여 데이터 분석의 정확도를 높일 수 있다.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords') # 불용어 데이터셋 다운로드
nltk.download('punkt') # 토큰화를 위한 데이터셋 다운로드

# 예제 텍스트
text = "This is an example showing off stop word filtration."

# 텍스트 토큰화
tokens = word_tokenize(text)

# NLTK에서 제공하는 영어 불용어 리스트
stop_words = set(stopwords.words('english'))

# 불용어 제거
filtered_tokens = [w for w in tokens if not w in stop_words]

print("원본 토큰:", tokens)
print("불용어 제거 후:", filtered_tokens)
Python
복사