Naive Bayes algorithm for spam classification (Matlab实现)
2014-07-10 02:35
411 查看
Materials, data, and algorithms comes from Stanford Andrew Ng Machine Learning course
Problem set 2 (Q3).
1. Preprocessing
(1)datatset只保留邮件的subject和正文
(2)所有单词转换成小写
(3)email address 替换成word EMAILADDR,类似的web address (HTTPADDR),currency (DOLLAR), numbers (NUMBER).
(4)set vocabulary. 使用standard stemming algorithm来 stemming, 然后consider only the medium frequency tokens into vocabulary (出现次数高的和低的都不要).
(5)build document-word matrices. the ith row represents the ith document/email, and the jth column represents the jth distinct token. Thus, the (i, j)-entry of this matrix represents thenumber of occurrences
of the jth token in the ith document.
下面就可以用matlab实现了 (注:下面程序采用的是另一篇博文Naive Bayes Classifier中的第二种方法)
nb_train.m
nb_test.m (test紧跟着train后执行)
Problem set 2 (Q3).
1. Preprocessing
(1)datatset只保留邮件的subject和正文
(2)所有单词转换成小写
(3)email address 替换成word EMAILADDR,类似的web address (HTTPADDR),currency (DOLLAR), numbers (NUMBER).
(4)set vocabulary. 使用standard stemming algorithm来 stemming, 然后consider only the medium frequency tokens into vocabulary (出现次数高的和低的都不要).
(5)build document-word matrices. the ith row represents the ith document/email, and the jth column represents the jth distinct token. Thus, the (i, j)-entry of this matrix represents thenumber of occurrences
of the jth token in the ith document.
下面就可以用matlab实现了 (注:下面程序采用的是另一篇博文Naive Bayes Classifier中的第二种方法)
nb_train.m
clear clc [spmatrix, tokenlist, trainCategory] = readMatrix('MATRIX.TRAIN'); trainMatrix = full(spmatrix);%行是document,列是tokens,里面的数值是tokens在document中出现的次数 numTrainDocs = size(trainMatrix, 1); numTokens = size(trainMatrix, 2); % trainMatrix is now a (numTrainDocs x numTokens) matrix. % Each row represents a unique document (email). % The j-th column of the row $i$ represents the number of times the j-th % token appeared in email $i$. % tokenlist is a long string containing the list of all tokens (words). % These tokens are easily known by position in the file TOKENS_LIST % trainCategory is a (1 x numTrainDocs) vector containing the true % classifications for the documents just read in. The i-th entry gives the % correct class for the i-th email (which corresponds to the i-th row in % the document word matrix). % Spam documents are indicated as class 1, and non-spam as class 0. % Note that for the SVM, you would want to convert these to +1 and -1. %----------------------------- V = size(trainMatrix, 2); % tokens总数 neg = trainMatrix(find(trainCategory == 0), :); % non-spam样本 pos = trainMatrix(find(trainCategory == 1), :); % spam样本 neg_words = sum(sum(neg));%negtive document中出现tokens中词的总数 ,而不是教程中的negtive document中的词汇总数 pos_words = sum(sum(pos)); neg_log_prior = log(size(neg,1) / numTrainDocs); %先验概率= non-spam样本个数/样本总数 pos_log_prior = log(size(pos,1) / numTrainDocs); %先验概率= spam样本个数/样本总数 for k=1:V, neg_log_phi(k) = log((sum(neg(:,k)) + 1) / (neg_words + V));%因为第k列是相应的token在所有document中出现的次数,所以直接按列求和 %从分子来看,就是求第k个token在所有negtive documents中出现的总次数 %从分母neg_words + V 来看,negtive documents中总的词数,并不是统计所有词汇,而是只统计字典中词出现的总次数。 pos_log_phi(k) = log((sum(pos(:,k)) + 1) / (pos_words + V)); end %---------------------------- %下面try to get an informal sense of how indicative token $i$ is for the SPAM %class compare_log=log(exp(pos_log_phi)./exp(neg_log_phi)); [i,j]=sort(compare_log);%i是从小到大排的结果,j是sort后每个值在原来序列中的位置 j(:,length(j)-4:length(j));%取j的后5位数,就是compare_log最大值所在的位置。在token list中找到排在前5的,说明这5个词对分类影响最大。
nb_test.m (test紧跟着train后执行)
[spmatrix, tokenlist, category] = readMatrix('MATRIX.TEST'); testMatrix = full(spmatrix); numTestDocs = size(testMatrix, 1); numTokens = size(testMatrix, 2); % Assume nb_train.m has just been executed, and all the parameters computed/needed % by your classifier are in memory through that execution. You can also assume % that the columns in the test set are arranged in exactly the same way as for the % training set (i.e., the j-th column represents the same token in the test data % matrix as in the original training data matrix). % Write code below to classify each document in the test set (ie, each row % in the current document word matrix) as 1 for SPAM and 0 for NON-SPAM. % Construct the (numTestDocs x 1) vector 'output' such that the i-th entry % of this vector is the predicted class (1/0) for the i-th email (i-th row % in testMatrix) in the test set. output = zeros(numTestDocs, 1); %--------------- for k=1:numTestDocs, [i,j,v] = find(testMatrix(k,:));%找出其中的非零值,(i,j)是位置,v是相应位置的数值 %由于p(y=1|x)和p(y=0|x)计算式的分母是一样的,所以只需要比较分子的大小 neg_posterior = sum(v .* neg_log_phi(j)) + neg_log_prior;%因为在train的时候求概率都加了log处理,所以这里就直接求和 pos_posterior = sum(v .* pos_log_phi(j)) + pos_log_prior; if (neg_posterior > pos_posterior) output(k) = 0; else output(k) = 1; end end %--------------- % Compute the error on the test set error=0; for i=1:numTestDocs if (category(i) ~= output(i)) error=error+1; end end %Print out the classification error on the test set error/numTestDocs
相关文章推荐
- TEXT CLASSIFICATION FOR SENTIMENT ANALYSIS – NAIVE BAYES CLASSIFIER
- EM algorithm for GMM in MATLAB
- Twenty Newsgroups Classification实例任务之TrainNaiveBayesJob(一)
- Naive Bayes Classification
- SDM For Face Alignment 流程介绍及Matlab代码实现之训练篇
- 朴素贝叶斯 垃圾邮件检测 Naive Bayes Spam detection
- Perceptron Algorithm for Classification
- 论文PCANet: A Simple Deep Learning Baseline for Image Classification?的matlab源码解读(四)
- A Seismic-Based Feature Extraction Algorithm for Robust Ground Target Classification
- 6 Easy Steps to Learn Naive Bayes Algorithm (with code in Python)
- 垃圾邮件二分类 NaiveBayes v.s SVM (matlab)
- Naive Bayes Algorithm
- Naive Bayes在mapreduce上的实现
- 【十大算法实现之naive bayes】朴素贝叶斯算法之文本分类算法的理解与实现
- Naive Bayes在mapreduce上的实现(转)
- SDM For Face Alignment流程介绍及Matlab代码实现之测试篇
- Naive Bayes text classification
- 对这个运动目标检测方法实现的结果A Hybrid Algorithm for Moving Object Detection
- SDM For Face Alignment 流程介绍及Matlab代码实现之预处理篇
- 用matlab训练数字分类的深度神经网络Training a Deep Neural Network for Digit Classification