吴恩达机器学习 EX6 第二部分 支持向量机 垃圾邮件分类
2019-04-10 17:11
459 查看
2 垃圾邮件分类
如今,许多电子邮件服务提供垃圾邮件过滤器,能够将电子邮件精确地分类为垃圾邮件和非垃圾邮件。在本部分练习中,您将使用SVMs构建自己的垃圾邮件过滤器。
2.1导入模块
加载模块
import matplotlib.pyplot as plt import numpy as np import scipy.io as scio from sklearn import svm import processEmail as pe # 邮件关键词提取函数 import emailFeatures as ef # 邮件特征向量提取函数 import imp imp.reload(ef) # 重新加载模块,jupyter开发过程看调试比较方便,但加载模块修改后不能直接调用,通过该函数重新加载模块
plt.ion() np.set_printoptions(formatter={'float': '{: 0.6f}'.format})
2.2 processEmail 函数
该函数提取电子邮件中的关键词,当然数据做了特殊处理,再将关键词转换成词条库中的索引
import numpy as np import re import nltk, nltk.stem.porter def process_email(email_contents): vocab_list = get_vocab_list() word_indices = np.array([], dtype=np.int64) # ===================== Preprocess Email ===================== # 邮件全部文字转换成小写 email_contents = email_contents.lower() # 去除邮件中的HTML格式 email_contents = re.sub('<[^<>]+>', ' ', email_contents) # Any numbers get replaced with the string 'number' 将数字全部转换成单词number email_contents = re.sub('[0-9]+', 'number', email_contents) # Anything starting with http or https:// replaced with 'httpaddr' 将url全部转换成httpaddr email_contents = re.sub('(http|https)://[^\s]*', 'httpaddr', email_contents) # Strings with "@" in the middle are considered emails --> 'emailaddr' 将email全部转换成emailaddr email_contents = re.sub('[^\s]+@[^\s]+', 'emailaddr', email_contents) # The '$' sign gets replaced with 'dollar' 将美元符号$转换成dollar email_contents = re.sub('[$]+', 'dollar', email_contents) # ===================== Tokenize Email ===================== # Output the email print('==== Processed Email ====') stemmer = nltk.stem.porter.PorterStemmer() # print('email contents : {}'.format(email_contents)) tokens = re.split('[@$/#.-:&*+=\[\]?!(){\},\'\">_<;% ]', email_contents) for token in tokens: # 去除字母数字 token = re.sub('[^a-zA-Z0-9]', '', token) # 获取单词前缀 token = stemmer.stem(token) if len(token) < 1: continue print(token) for k, v in vocab_list.items(): if token == v: # 单词在词库中存在则加入 word_indices = np.append(word_indices, k) print('==================') return word_indices def get_vocab_list(): vocab_dict = {} with open('vocab.txt') as f: for line in f: (val, key) = line.split() vocab_dict[int(val)] = key return vocab_dict
调用processEmail 提取邮件关键词
# ===================== Part 1: Email Preprocessing ===================== print('Preprocessing sample email (emailSample1.txt) ...') file_contents = open('emailSample1.txt', 'r').read() word_indices = pe.process_email(file_contents)
Preprocessing sample email (emailSample1.txt) ... ==== Processed Email ==== anyon know how much it cost to host a web portal well it depend on how mani visitor you re expect thi can be anywher from less than number buck a month to a coupl of dollarnumb you should checkout httpaddr or perhap amazon ecnumb if your run someth big to unsubscrib yourself from thi mail list send an email to emailaddr ==================
显示该邮件成功提取的单词对应的key
# Print stats print('Word Indices: ') print(word_indices)
Word Indices: [ 86 916 794 1077 883 370 1699 790 1822 1831 883 431 1171 794 1002 1893 1364 592 1676 238 162 89 688 945 1663 1120 1062 1699 375 1162 479 1893 1510 799 1182 1237 810 1895 1440 1547 181 1699 1758 1896 688 1676 992 961 1477 71 530 1699 531]
将提取的单词转换成特征向量:
# ===================== Part 2: Feature Extraction ===================== print('Extracting Features from sample email (emailSample1.txt) ... ') # Extract features features = ef.email_features(word_indices) # Print stats print('Length of feature vector: {}'.format(features.size)) print('Number of non-zero entries: {}'.format(np.flatnonzero(features).size))# np.sum(features)
Extracting Features from sample email (emailSample1.txt) ... Length of feature vector: 1900 Number of non-zero entries: 45
2.3 支持向量机线性回归训练垃圾邮件分类器
c=0.1训练集的准确率达99.825%
# ===================== Part 3: Train Linear SVM for Spam Classification ===================== # Load the Spam Email dataset # You will have X, y in your environment data = scio.loadmat('spamTrain.mat') X = data['X'] y = data['y'].flatten() print('X.shape: ', X.shape, '\ny.shape: ', y.shape) print('Training Linear SVM (Spam Classification)') print('(this may take 1 to 2 minutes)') c = 0.1 clf = svm.SVC(c, kernel='linear') clf.fit(X, y) p = clf.predict(X) print('Training Accuracy: {}'.format(np.mean(p == y) * 100))
X.shape: (4000, 1899) y.shape: (4000,) Training Linear SVM (Spam Classification) (this may take 1 to 2 minutes) Training Accuracy: 99.825
2.4 支持向量机线性回归训练模型在测试集上验证
测试集上验证准确率达98.9,效果还不错
# ===================== Part 4: Test Spam Classification ===================== # After training the classifier, we can evaluate it on a test set. We have # included a test set in spamTest.mat # Load the test dataset data = scio.loadmat('spamTest.mat') Xtest = data['Xtest'] ytest = data['ytest'].flatten() print('Xtest.shape: ', Xtest.shape, '\nytest.shape: ', ytest.shape) print('Evaluating the trained linear SVM on a test set ...') p = clf.predict(Xtest) print('Test Accuracy: {}'.format(np.mean(p == ytest) * 100))
Xtest.shape: (1000, 1899) ytest.shape: (1000,) Evaluating the trained linear SVM on a test set ... Test Accuracy: 98.9
2.5 查看哪些单词最可能被认为是垃圾邮件
#由于我们所训练的模型是一个线性SVM,我们可以通过检验模型学习到的w权值来更好地理解它是如何判断一封邮件是否是垃圾邮件的。下面的代#码将找到分类器中权重最大的单词。非正式地,分类器“认为”这些单词是垃圾邮件最有可能的指示器。 vocab_list = pe.get_vocab_list() indices = np.argsort(clf.coef_).flatten()[::-1] print(indices) for i in range(15): print('{} ({:0.6f})'.format(vocab_list[indices[i]], clf.coef_.flatten()[indices[i]]))
[1190 297 1397 ... 1764 1665 1560] otherwis (0.500614) clearli (0.465916) remot (0.422869) gt (0.383622) visa (0.367710) base (0.345064) doesn (0.323632) wife (0.269724) previous (0.267298) player (0.261169) mortgag (0.257298) natur (0.253941) ll (0.253467) futur (0.248297) hot (0.246404)
前一篇 吴恩达机器学习 EX6 作业 第一部分 了解支持向量机 高斯核函数
后一篇 吴恩达机器学习 EX7 作业 第一部分 K均值聚类
相关文章推荐
- 吴恩达机器学习 EX7 第二部分 主成分分析(PCA)
- 机器学习之朴素贝叶斯(附垃圾邮件分类)
- COM编程入门第二部分——深入COM服务器 分类: com技术 2013-09-30 08:20 643人阅读 评论(0) 收藏
- 机器学习与深度学习系列连载: 第二部分 深度学习(二十五) 递归神经网络Resursive Network
- 吴恩达机器学习第三次作业(python实现):多分类与神经网络
- 机器学习(九):CS229ML课程笔记(5)——支持向量机(SVM),最优间隔分类,拉格朗日对偶性,坐标上升法,SMO
- 【Todo】【转载】Spark学习 & 机器学习(实战部分)-监督学习、分类与回归
- 机器学习与深度学习系列连载: 第二部分 深度学习(十八) Seq2Seq 模型
- 机器学习与深度学习系列连载: 第二部分 深度学习(十七)深度神经网络调参之道(learn to learn)
- 以垃圾邮件判定方法探索机器学习中的二分类判定问题
- 机器学习与深度学习系列连载: 第二部分 深度学习(十四)循环神经网络 2(Gated RNN - LSTM )
- 吴恩达机器学习笔记——支持向量机
- 【机器学习】推导支持向量机SVM二分类
- 机器学习与深度学习系列连载: 第二部分 深度学习(十三)循环神经网络 1(Recurre Neural Network 基本概念 )
- 机器学习与深度学习系列连载: 第二部分 深度学习(十五)循环神经网络 3(Gated RNN - GRU)
- 机器学习与深度学习系列连载: 第二部分 深度学习(二十二) 机器记忆 Machine Memory
- 吴恩达机器学习笔记(十一)支持向量机
- 使用机器学习预测天气(第二部分)
- 【读书笔记】软件工程·实践者的研究方法第7版 第二部分 建模 第6章 需求建模:场景、信息与类分类
- 机器学习与深度学习系列连载: 第二部分 深度学习(二)梯度下降