Stemming the words and word lemmatization —— Python Data Science CookBook
2017-02-11 00:08
691 查看
English grammar dictates how certain words are used in sentences. For example, perform, performing, and performs indicate the same action; they appear in
different sentences based on the grammar rules.
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Introduction to Information Retrieval By Christopher D. Manning, Prabhakar
Raghavan & Hinrich Schütze (中文是没有词的形态变化,但是中文分词是个难点)
Porter – porter stemmer
Lancaster – Lancaster stemmer
Snowball – snowball stemmer
Porter is the most commonly used stemmer. The algorithm is not very aggressive when moving words to their root form.
Snowball is an improvement over porter. It is also faster than porter in terms of the computational time.
Lancaster is the most aggressive stemmer. With porter and snowball, the final word tokens would still be readable by humans, but with Lancaster, it is not readable. It’s the fastest of the trio.
the following link: http://snowball.tartarus.org/algorithms/porter/stemmer.html
example:
the derivational affixes. See the following Wikipedia link for the derivational patterns:
http://en.wikipedia.org/wiki/Morphological_derivation#Derivational_patterns
On the other hand, lemmatization uses a morphological analysis and vocabulary to get the lemma of a word. It tries to change only the inflectional endings and give the base word from a dictionary.See Wikipedia for more information
on inflection at
http://en.wikipedia.org/wiki/Inflection.
use NLTK’s WordNetLemmatizer.
output :
The word running should ideally be run and our lemmatizer should have gotten it right. We can see that it has not made any changes to running. However, our heuristic-based stemmers have got it right!
different sentences based on the grammar rules.
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Introduction to Information Retrieval By Christopher D. Manning, Prabhakar
Raghavan & Hinrich Schütze (中文是没有词的形态变化,但是中文分词是个难点)
Stemming the words
Let’s look into how we can perform word stemming using Python NLTK. NLTK provides us with a rich set of functions that can help us do the stemming pretty easily:>>> import nltk.stem >>> dir(nltk.stem) ['ISRIStemmer', 'LancasterStemmer', 'PorterStemmer', 'RSLPStemmer', 'RegexpStemmer', 'SnowballStemmer', 'StemmerI', 'WordNetLemmatizer', '__builtins__', '__doc__', '__file__', '__name__', '__package__', '__path__', 'api', 'isri', 'lancaster', 'porter', 'regexp', 'rslp', 'snowball', 'wordnet'] >>>we have the following stemmers:
Porter – porter stemmer
Lancaster – Lancaster stemmer
Snowball – snowball stemmer
Porter is the most commonly used stemmer. The algorithm is not very aggressive when moving words to their root form.
Snowball is an improvement over porter. It is also faster than porter in terms of the computational time.
Lancaster is the most aggressive stemmer. With porter and snowball, the final word tokens would still be readable by humans, but with Lancaster, it is not readable. It’s the fastest of the trio.
There’s more…
All the three algorithms are pretty involved; going into the details of these algorithms is beyond the scope of this book. I will recommend you to look to the web for more details on these algorithms. For details of the porter and snowball stemmers, refer tothe following link: http://snowball.tartarus.org/algorithms/porter/stemmer.html
example:
#!/usr/bin/env python2 # -*- coding: utf-8 -*- """ @author: snaildove """ # Load Libraries from nltk import stem #1. small input to figure out how the three stemmers perform. input_words = ['movies','dogs','planes','flowers','flies','fries','fry','weeks','planted' ,'running','throttle'] #Let’s jump into the different stemming algorithms, as follows: #2.Porter Stemming porter = stem.porter.PorterStemmer() p_words = [porter.stem(w) for w in input_words] print p_words #3.Lancaster Stemming lancaster = stem.lancaster.LancasterStemmer() l_words = [lancaster.stem(w) for w in input_words] print l_words #4.Snowball stemming snowball = stem.snowball.EnglishStemmer() s_words = [snowball.stem(w) for w in input_words] print s_wordsouput :
[u'movi', u'dog', u'plane', u'flower', u'fli', u'fri', u'fri', u'week', u'plant', u'run', u'throttl'] [u'movy', 'dog', 'plan', 'flow', 'fli', 'fri', 'fry', 'week', 'plant', 'run', 'throttle'] [u'movi', u'dog', u'plane', u'flower', u'fli', u'fri', u'fri', u'week', u'plant', u'run', u'throttl']
word lemmatization
Stemming is a heuristic process, which goes about chopping the word suffixes in order to get to the root form of the word. In the previous recipe, we saw that it may nd up chopping even the right words, that is, choppingthe derivational affixes. See the following Wikipedia link for the derivational patterns:
http://en.wikipedia.org/wiki/Morphological_derivation#Derivational_patterns
On the other hand, lemmatization uses a morphological analysis and vocabulary to get the lemma of a word. It tries to change only the inflectional endings and give the base word from a dictionary.See Wikipedia for more information
on inflection at
http://en.wikipedia.org/wiki/Inflection.
use NLTK’s WordNetLemmatizer.
# Load Libraries from nltk import stem #1. small input to figure out how the three stemmers perform. input_words =['movies','dogs','planes','flowers','flies','fries','fry','weeks', 'planted','running','throttle'] #2.Perform lemmatization. wordnet_lemm = stem.WordNetLemmatizer() wn_words = [wordnet_lemm.lemmatize(w) for w in input_words] print wn_words
output :
[u'movie', u'dog', u'plane', u'flower', u'fly', u'fry', 'fry', u'week', 'planted', 'running', 'throttle']
The word running should ideally be run and our lemmatizer should have gotten it right. We can see that it has not made any changes to running. However, our heuristic-based stemmers have got it right!
>>> wordnet_lemm.lemmatize('running') 'running' >>> porter.stem('running') u'run' >>> lancaster.stem('running') 'run' >>> snowball.stem('running') u'run
Tip
By default, the lemmatizer assumes that the input is a noun; this can be rectified by passing the POS tag of the word to our lemmatizer, as follows:>>> wordnet_lemm.lemmatize('running','v') u'run'
相关文章推荐
- the bag of words representation —— Python Data Science CookBook
- Performing summary statistics and plots —— Python Data Science Cookbook
- Removing stop words —— Python Data Science CookBook
- Using scatter plots for multivariate data —— python data science cookbook
- data imputation —— Python Data Science Cookbook
- sampling brief —— python data science cookbook
- [Leetcode]211. Add and Search Word - Data structure design @python
- Recipe 1.7. Reversing a String by Words or Characters(Python Cookbook)
- Algorithms and Data Structures: The Science of Computing
- Recipe 1.2. Converting Between Characters and Numeric Codes(Python Cookbook)
- cookbook of python for data analysis
- Python Data Visualization Cookbook 2.2.2
- 【LeetCode】211. Add and Search Word - Data structure design 解题报告(Python)
- 评论数据库Win A Free Copy of Packt’s Managing Multimedia and Unstructured Data in the Oracle Database e-book
- 电子书下载:Microsoft Silverlight 4 Data and Services Cookbook
- Python Web-第六周-JSON and the REST Architecture(Using Python to Access Web Data)
- Computing the Relative Path from One Directory to Another(Python cookbook 2-22)
- Assembling Views and Animations(Chapter 6 of The iPhone™ Developer’s Cookbook)
- Automating the Creation of Data-Rich Business Documents with Word 2007 and Visual Basic 2005
- Gestures and Touches(Chapter 8 of The iPhone™ Developer’s Cookbook)