nltk

字数 423 · 2019-09-10

#python

Install

1
pip install nltk

Interactive installer

1
2
import nltk
nltk.download()

part-of-speech tags

Tag Desc
CC Coordinating conjunction
CD Cardinal number
DT Determiner
IN Preposition or subordinating conjunction
JJ Adjective
NN Noun, singular or mass
NNS Noun, plural
POS Possessive ending
PRP Personal pronoun
RB Adverb
VB Verb, base form
WP Wh-pronoun
TO to

https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

https://www.nltk.org/

Hello World

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import nltk
from collections import Counter

with open('Harry Potter and the Sorcerer\'s Stone.txt') as f:
  hp = f.read()

tokens = nltk.word_tokenize(hp)
tagged = nltk.pos_tag(tokens)

c = Counter()

nnps = list(filter(lambda t:t[1] in ['NNP', 'NNPS'], tagged))

for t in nnps:
  c[t] += 1

c.most_common(20)
  • tokenizing - word tokenizers, sentence tokenizers
  • corpora - body of text. ex: medical journals, presidential speeches
  • lexicon - words and their means

Stop Words

1
2
3
4
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
words = [w for w in tokens if not w in stop_words]

Stemming

1
2
3
4
from nltk.stem import PorterStemmer

ps = PorterStemmer()
words = list(map(lambda w: ps.stem(w), tokens))

Tagging