the bag of words representation —— Python Data Science CookBook
2017-02-11 00:17
441 查看
In order to do machine learning on text, we will need to convert the text to numerical feature vectors.The bag of words representation : the text is converted to numerical vectors and the column names are the underlying
words and values can be either of thw following points:
Binary, which indicates whether the word is present/absent in the given document
Frequency, which indicates the count of the word in the given document
TFIDF, which is a score that we will cover subsequently
Bag of words is the most frequent way of representing the text. As the name suggests, the order of words is ignored and only the presence/absence of words are key to this
representation. It is a two-step process, as follows:
1. For every word in the document that is present in the training set, we will assign an integer and store this as a dictionary.
2. For every document, we will create a vector. The columns of the vectors are the actual words itself. They form the features. The values of the cell are binary, frequency, or TFIDF.
of the web page separated by the <p> tags can also be treated as a document. In our case, we have 5 sentences, that's documents.
a collection of documents—in this case, a list of sentences—to a matrix, where the rows are sentences and the columns are the words in these sentences.The count of these words are inserted in the value of these cells. Count_v is a CountVectorizer object.
We had mentioned in the introduction that we need to build a dictionary of all the words in the given text.
The vocabulary_attribute of CountVectorizer object
provides us with the list of words and their associated IDs or feature indices.
output :
The vocabulary_attribute of CountVectorizer object is a map of the terms in order to feature indices. We can also use the following function to get the list of words (features):
The type of tdm is <class 'scipy.sparse.csr.csr_matrix'> refer to : https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html
csr 全称 Compressed Sparse Row matrix 压缩的稀疏行矩阵
binary : boolean, default=False
If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.
lowercase : boolean, True by default
Convert all characters to lowercase before tokenizing.
stop_words : string {‘english’}, list, or None (default)
If ‘english’, a built-in stop word list for English is used.
If a list, that list is assumed to contain
stop words, all of which will be removed from the resulting tokens.
Only applies if
If None, no stop words will be used. max_df can be set to a value in
the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.
ngram_range : tuple (min_n, max_n)
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
More adbout sklearn.feature_extraction.text.CountVectorizerrefer to : http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
Apply some paramters :
of using the plain vanilla occurrence, we will normalize it; we will divide the number of occurrences of a word in a document by the total number of words in the document. This metric is called term frequencies. Term frequency is also not without problems.
There are words that will occur in many documents. These words would dominate the feature vector but they are not informative enough to distinguish the documents in the corpus. Before welook into a new metric that can avoid this problem, let’s define document
frequency. Similar to word frequency, which is local with respect to a document, we can calculate a score called document frequency, which is the number of documents that the word occurs in the corpus divided by the total number of documents in the corpus.
The final metric that we will use for the words is the product of the term frequency and the inverse of the document frequency. This is called the TFIDF (wiki : https://en.wikipedia.org/wiki/Tf%E2%80%93idf) score.
exmaple :
words and values can be either of thw following points:
Binary, which indicates whether the word is present/absent in the given document
Frequency, which indicates the count of the word in the given document
TFIDF, which is a score that we will cover subsequently
Bag of words is the most frequent way of representing the text. As the name suggests, the order of words is ignored and only the presence/absence of words are key to this
representation. It is a two-step process, as follows:
1. For every word in the document that is present in the training set, we will assign an integer and store this as a dictionary.
2. For every document, we will create a vector. The columns of the vectors are the actual words itself. They form the features. The values of the cell are binary, frequency, or TFIDF.
Tip
Depending on your application, the notion of a document can change. In this case, our sentence is considered as a document. In some cases, we can also treat a paragraph as a document. In web page mining, a single web page can be treated as a document or partsof the web page separated by the <p> tags can also be treated as a document. In our case, we have 5 sentences, that's documents.
Example
In step 3 of source code , we will import CountVectorizer from thescikitlearn.feature_extraction text package. It convertsa collection of documents—in this case, a list of sentences—to a matrix, where the rows are sentences and the columns are the words in these sentences.The count of these words are inserted in the value of these cells. Count_v is a CountVectorizer object.
We had mentioned in the introduction that we need to build a dictionary of all the words in the given text.
The vocabulary_attribute of CountVectorizer object
provides us with the list of words and their associated IDs or feature indices.
#!/usr/bin/env python2 # -*- coding: utf-8 -*- """ @author: snaildove """ # Load Libraries from nltk.tokenize import sent_tokenize from sklearn.feature_extraction.text import CountVectorizer from nltk.corpus import stopwords # 1. Our input text, we use the same input which we had used in stop word removal recipe. text = "Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. Highquality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods.A typical application is to scan a set of documents written in a natural language and either model the document set for predictive classification purposes or populate a database or search index with the information extracted." #Let’s jump into how to transform the text into a bag of words representation: #2.Let us divide the given text into sentences sentences = sent_tokenize(text); print len(sentences); #3.Let us write the code to generate feature vectors. count_v = CountVectorizer(); tdm = count_v.fit_transform(sentences); print "num of features/vocabulary :" print len(count_v.vocabulary_); print "vocabulary: "; print count_v.vocabulary_; print "tdm : "; print tdm; print "type of tdm: "; print type(tdm); print "params of CountVectorizer class: "; print count_v._get_param_names();
output :
6 num of features/vocabulary : 123 vocabulary: {u'nlp': 66, u'named': 64, u'concept': 16, u'interpretation': 50, u'features': 33, u'classification': 13, u'text': 108, u'into': 51, u'within': 120, u'entity': 27, u'structuring': 99, u'via': 117, u'through': 110, u'statistical': 97, u'such': 102, u'quality': 82, u'linguistic': 57, u'clustering': 14, u'visualization': 118, u'categorization': 12, u'from': 37, u'to': 111, u'addition': 0, u'structured': 98, u'relations': 87, ............................................................................................................. , u'usually': 116, u'model': 62, u'typically': 115, u'or': 69, u'relation': 86, u'typical': 114} tdm : (0, 37) 1 (0, 46) 1 : : (0, 108) 4 (1, 55) 1 : : (5, 111) 2 (5, 108) 1 type of tdm: <class 'scipy.sparse.csr.csr_matrix'> params of CountVectorizer class: ['analyzer', 'binary', 'decode_error', 'dtype', 'encoding', 'input', 'lowercase', 'max_df', 'max_features', 'min_df', 'ngram_range', 'preprocessor', 'stop_words' , 'strip_accents', 'token_pattern', 'tokenizer', 'vocabulary']
The vocabulary_attribute of CountVectorizer object is a map of the terms in order to feature indices. We can also use the following function to get the list of words (features):
count_v.get_feature_names()output :
[u'addition', u'along', u'also', u'analysis', u'analytical', u'analytics', u'and', u'annotation', u'application', u'as', u'association', u'between', u'categorization', u'classification', u'clustering', u'combination', u'concept',u'data', u'database', u'derived', u'deriving', u'devising', u'distributions', u'document', u'documents', u'either', u'entities', u'entity', u'equivalent', u'essentially', u'evaluation', u'extracted' , u'extraction', u'features', u'finally', u'for', u'frequency', u'from', u'goal', u'granular', u'high', u'highquality', u'in', u'include', u'including', u'index', u'information', u'input', u'insertion', u'interestingness', u'interpretation', u'into', u'involves', u'is', u'language', u'learning', u'lexical', u'linguistic', u'link', u'means', u'methods', u'mining', u'model', u'modeling', u'named', u'natural', u'nlp', u'novelty', u'of', u'or', u'others', u'output', u'overarching', u'parsing', u'pattern', u'patterns', u'populate', u'predictive', u'process', u'processing', u'production',u'purposes', u'quality', u'recognition', u'referred', u'refers', u'relation', u'relations', u'relevance', u'removal', u'retrieval', u'roughly', u'scan', u'search', u'sentiment', u'set', u'some', u'statistical', u'structured', u'structuring', u'study', u'subsequent', u'such', u'summarization', u'tagging', u'tasks', u'taxonomies', u'techniques', u'text', u'the', u'through', u'to', u'trends', u'turn', u'typical', u'typically', u'usually', u'via', u'visualization', u'with', u'within', u'word', u'written']
The type of tdm is <class 'scipy.sparse.csr.csr_matrix'> refer to : https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html
csr 全称 Compressed Sparse Row matrix 压缩的稀疏行矩阵
sklearn.feature_extraction.text.CountVectorizer
The CountVectorizer class has a lot of other features/parameters to offer in order to transform the text into feature vectors. Let’s look at some of them:binary : boolean, default=False
If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.
lowercase : boolean, True by default
Convert all characters to lowercase before tokenizing.
stop_words : string {‘english’}, list, or None (default)
If ‘english’, a built-in stop word list for English is used.
If a list, that list is assumed to contain
stop words, all of which will be removed from the resulting tokens.
Only applies if
analyzer == 'word'.
If None, no stop words will be used. max_df can be set to a value in
the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.
ngram_range : tuple (min_n, max_n)
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
More adbout sklearn.feature_extraction.text.CountVectorizerrefer to : http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
Apply some paramters :
# While creating a mapping from words to feature indices, we can ignore # some words by providing a stop word list. stop_words = stopwords.words('english') count_v_sw = CountVectorizer(stop_words=stop_words) sw_tdm = count_v_sw.fit_transform(sentences) print "num of features/vocabulary :" print len(count_v_sw.get_feature_names()); print "new tdm which removed stop_words : " print sw_tdm; # Use ngrams count_v_ngram = CountVectorizer(stop_words=stop_words,ngram_range=(1,2)) ngram_tdm = count_v_ngram.fit_transform(sentences) print "num of features/vocabulary :" print len(count_v_ngram.get_feature_names()); print "ngram tdm which removed stop_words : " print ngram_tdm;output :
num of features/vocabulary : 107 new tdm which removed stop_words : (0, 40) 1 (0, 72) 1 : : (5, 99) 1 (5, 15) 1 (5, 40) 1 (5, 14) 1 (5, 96) 1 num of features/vocabulary : 250 ngram tdm which removed stop_words : (0, 96) 1 (0, 169) 1 : : (5, 92) 1 (5, 33) 1 (5, 219) 1
term frequencies and inverse document frequencies
Occurrences and counts are good as feature values, but they suffer from some problems.Let’s say that we have four documents of unequal length. This will give a higher weightage to the terms in the longer documents than those in the shorter ones. So, insteadof using the plain vanilla occurrence, we will normalize it; we will divide the number of occurrences of a word in a document by the total number of words in the document. This metric is called term frequencies. Term frequency is also not without problems.
There are words that will occur in many documents. These words would dominate the feature vector but they are not informative enough to distinguish the documents in the corpus. Before welook into a new metric that can avoid this problem, let’s define document
frequency. Similar to word frequency, which is local with respect to a document, we can calculate a score called document frequency, which is the number of documents that the word occurs in the corpus divided by the total number of documents in the corpus.
The final metric that we will use for the words is the product of the term frequency and the inverse of the document frequency. This is called the TFIDF (wiki : https://en.wikipedia.org/wiki/Tf%E2%80%93idf) score.
exmaple :
#!/usr/bin/env python2 # -*- coding: utf-8 -*- """ @author: snaildove """ # Load Libraries from nltk.tokenize import sent_tokenize from nltk.corpus import stopwords from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_extraction.text import CountVectorizer # 1. We create an input document as in the previous recipe. text = "Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. Highquality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods.A typical application is to scan a set of documents written in a natural language and either model the document set for predictive classification purposes or populate a database or search index with the information extracted." #Let’s see how to find the term frequency and inverse document frequency: # 2. Let us extract the sentences. sentences = sent_tokenize(text); print "num of sentences :"; print len(sentences); # 3. Create a matrix of term document frequency. stop_words = stopwords.words('english'); count_v = CountVectorizer(stop_words=stop_words); tdm = count_v.fit_transform(sentences); print "vocabulary: " print count_v.vocabulary_; print "tdm : " print tdm; #4. Calcuate the TFIDF score. tfidf = TfidfTransformer(); tdm_tfidf = tfidf.fit_transform(tdm); print "tf-idf :"; print tdm_tfidf.data;output :
num of sentences : 6 vocabulary: {u'nlp': 58, u'named': 56, u'concept': 13, u'interpretation': 44, u'features': 30, u'classification': 10, u'text': 96, .............................................................................................. ,u'usually': 101, u'model': 54, u'typically': 100, u'retrieval': 80, u'involves': 45, u'typical': 99} tdm : (0, 40) 1 (0, 72) 1 : : (0, 96) 4 (1, 47) 1 : : (5, 14) 1 (5, 96) 1 tf-idf : [ 0.54105639 0.31326362 0.26401921 ..., 0.15746858 0.15746858 0.15746858]
相关文章推荐
- Stemming the words and word lemmatization —— Python Data Science CookBook
- Removing stop words —— Python Data Science CookBook
- cookbook of python for data analysis
- Recipe 1.5. Trimming Space from the Ends of a String(Python Cookbook)
- Performing summary statistics and plots —— Python Data Science Cookbook
- data imputation —— Python Data Science Cookbook
- Using scatter plots for multivariate data —— python data science cookbook
- sampling brief —— python data science cookbook
- Algorithms and Data Structures: The Science of Computing
- 综述论文:图像标注中的BoW表示 Bag-of-Words Representation in Image Annotation: A Review
- 评论数据库Win A Free Copy of Packt’s Managing Multimedia and Unstructured Data in the Oracle Database e-book
- Building Your First Project(Chapter 2 of The iPhone™ Developer’s Cookbook)
- 深度学习:Hinton_Science_Reducing the dimensionality of data with neural networks
- Introducing the iPhone SDK(Chapter 1 of The iPhone™ Developer’s Cookbook)
- Working with view controllers(Chapter 5 of The iPhone™ Developer’s Cookbook)
- MIT Data Science Machine Becomes As Intuitive As Humans: Rise Of The Machines?
- Hacking Vim: A Cookbook to get the Most out of the Latest Vim Editor
- Assembling Views and Animations(Chapter 6 of The iPhone™ Developer’s Cookbook)
- Recipe 1.10. Filtering a String for a Set of Characters(Python Cookbook)
- 深度学习:Hinton_Science_Reducing the dimensionality of data with neural networks