nltk

字数 423 · 2019-09-10

Install

pip install nltk

Interactive installer

import nltk
nltk.download()

part-of-speech tags

Tag	Desc
CC	Coordinating conjunction
CD	Cardinal number
DT	Determiner
IN	Preposition or subordinating conjunction
JJ	Adjective
NN	Noun, singular or mass
NNS	Noun, plural
POS	Possessive ending
PRP	Personal pronoun
RB	Adverb
VB	Verb, base form
WP	Wh-pronoun
TO	to

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

https://www.nltk.org/

Hello World

import nltk
from collections import Counter

with open('Harry Potter and the Sorcerer\'s Stone.txt') as f:
  hp = f.read()

tokens = nltk.word_tokenize(hp)
tagged = nltk.pos_tag(tokens)

c = Counter()

nnps = list(filter(lambda t:t[1] in ['NNP', 'NNPS'], tagged))

for t in nnps:
  c[t] += 1

c.most_common(20)

tokenizing - word tokenizers, sentence tokenizers
corpora - body of text. ex: medical journals, presidential speeches
lexicon - words and their means

Stop Words

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
words = [w for w in tokens if not w in stop_words]

Stemming

from nltk.stem import PorterStemmer

ps = PorterStemmer()
words = list(map(lambda w: ps.stem(w), tokens))

nltk

Install

part-of-speech tags

Hello World

Stop Words

Stemming

Tagging