您的位置:首页 > 编程语言 > Python开发

Stemming the words and word lemmatization —— Python Data Science CookBook

2017-02-11 00:08 691 查看
English grammar dictates how certain words are used in sentences. For example, perform, performing, and performs indicate the same action; they appear in
different sentences based on the grammar rules. 

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Introduction to Information Retrieval By Christopher D. Manning, Prabhakar
Raghavan & Hinrich Schütze (中文是没有词的形态变化,但是中文分词是个难点

Stemming the words

Let’s look into how we can perform word stemming using Python NLTK. NLTK provides us with a rich set of functions that can help us do the stemming pretty easily:

>>> import nltk.stem
>>> dir(nltk.stem)
['ISRIStemmer', 'LancasterStemmer', 'PorterStemmer', 'RSLPStemmer',
'RegexpStemmer', 'SnowballStemmer', 'StemmerI', 'WordNetLemmatizer',
'__builtins__', '__doc__', '__file__', '__name__', '__package__',
'__path__', 'api', 'isri', 'lancaster', 'porter', 'regexp', 'rslp',
'snowball', 'wordnet']
we have the following stemmers:
Porter – porter stemmer
Lancaster – Lancaster stemmer
Snowball – snowball stemmer

Porter is the most commonly used stemmer. The algorithm is not very aggressive when moving words to their root form.

Snowball is an improvement over porter. It is also faster than porter in terms of the computational time.
Lancaster is the most aggressive stemmer. With porter and snowball, the final word tokens would still be readable by humans, but with Lancaster, it is not readable. It’s the fastest of the trio.

There’s more…

All the three algorithms are pretty involved; going into the details of these algorithms is beyond the scope of this book. I will recommend you to look to the web for more details on these algorithms. For details of the porter and snowball stemmers, refer to
the following link: http://snowball.tartarus.org/algorithms/porter/stemmer.html

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
@author: snaildove
# Load Libraries
from nltk import stem
#1. small input to figure out how the three stemmers perform.
input_words = ['movies','dogs','planes','flowers','flies','fries','fry','weeks','planted' ,'running','throttle']
#Let’s jump into the different stemming algorithms, as follows:
#2.Porter Stemming
porter = stem.porter.PorterStemmer()
p_words = [porter.stem(w) for w in input_words]
print p_words
#3.Lancaster Stemming
lancaster = stem.lancaster.LancasterStemmer()
l_words = [lancaster.stem(w) for w in input_words]
print l_words
#4.Snowball stemming
snowball = stem.snowball.EnglishStemmer()
s_words = [snowball.stem(w) for w in input_words]
print s_words
ouput : 

[u'movi', u'dog', u'plane', u'flower', u'fli', u'fri', u'fri', u'week', u'plant', u'run', u'throttl']
[u'movy', 'dog', 'plan', 'flow', 'fli', 'fri', 'fry', 'week', 'plant', 'run', 'throttle']
[u'movi', u'dog', u'plane', u'flower', u'fli', u'fri', u'fri', u'week', u'plant', u'run', u'throttl']

 word lemmatization

Stemming is a heuristic process, which goes about chopping the word suffixes in order to get to the root form of the word. In the previous recipe, we saw that it may  nd up chopping even the right words, that is, chopping
the derivational affixes. See the following Wikipedia link for the derivational patterns:

On the other hand, lemmatization uses a morphological analysis and vocabulary to get the lemma of a word. It tries to change only the inflectional endings and give the base word from a dictionary.See Wikipedia for more information
on inflection at

use NLTK’s WordNetLemmatizer.

# Load Libraries
from nltk import stem
#1. small input to figure out how the three stemmers perform.
input_words =['movies','dogs','planes','flowers','flies','fries','fry','weeks', 'planted','running','throttle']
#2.Perform lemmatization.
wordnet_lemm = stem.WordNetLemmatizer()
wn_words = [wordnet_lemm.lemmatize(w) for w in input_words]
print wn_words

output :

[u'movie', u'dog', u'plane', u'flower', u'fly', u'fry', 'fry', u'week', 'planted', 'running', 'throttle']

The word running should ideally be run and our lemmatizer should have gotten it right. We can see that it has not made any changes to running. However, our heuristic-based stemmers have got it right! 

>>> wordnet_lemm.lemmatize('running')
>>> porter.stem('running')
>>> lancaster.stem('running')
>>> snowball.stem('running')


By default, the lemmatizer assumes that the input is a noun; this can be rectified by passing the POS tag of the word to our lemmatizer, as follows:
>>> wordnet_lemm.lemmatize('running','v')
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  NLTK python