tf-idf、逻辑回归和SVM on spark
2017-02-22 17:08
274 查看
1、tf-idf
IDF的主要思想是:如果包含词条t的文档越少,也就是n越小,IDF越大,则说明词条t具有很好的类别区分能力。如果某一类文档C中包含词条t的文档数为m,而其它类包含t的文档总数为k,显然所有包含t的文档数n=m+k,当m大的时候,n也大,按照IDF公式得到的IDF的值会小,就说明该词条t类别区分能力不强。但是实际上,如果一个词条在一个类的文档中频繁出现,则说明该词条能够很好代表这个类的文本的特征,这样的词条应该给它们赋予较高的权重,并选来作为该类文本的特征词以区别与其它类文档。这就是IDF的不足之处.在一份给定的文件里,词频(termfrequency,TF)指的是某一个给定的词语在该文件中出现的频率。这个数字是对词数(termcount)的归一化,以防止它偏向长的文件。(同一个词语在长文件里可能会比短文件有更高的词数,而不管该词语重要与否。)词频-逆文档频率(TF-IDF),是广泛应用于文本挖掘的用来反映一个词对于语料库中文档的重要性的生成特征向量的方法,用t来表示一个词,d表示一个文档,D表示文档库,词频TF(t,d)就是词t在文档d中出现的次数,文档频率DF(t,D)表示有多少个文档包含词t,如果我们只用词频来衡量重要性,将容易导致过度强调某些出现非常频繁但只装载少量信息的词,比如:"a","the"和"of",如果一个词在文档库中出现非常频繁,它意味着它不装载关于特定文档的特殊信息,逆文档频率是一个词装载信息量的一个数字化的衡量,TF-IDF展示了一个词与特定文档的相关联程度。当你构建后词频向量后,就可以使用IDF来计算逆文档频率,然后将它们与词频相乘来计算TF-IDF实例:import org.apache.Spark.ml.feature.{HashingTF,IDF, Tokenizer}val sentenceData = spark.createDataFrame(Seq( (0, "Hi I heard about Spark"), (0, "I wish Java coulduse case classes"), (1, "Logistic regression models are neat"))).toDF("label", "sentence")val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")val wordsData = tokenizer.transform(sentenceData)val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)val featurizedData = hashingTF.transform(wordsData)// alternatively, CountVectorizer can also be used to get term frequency vectorsval idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")val idfModel = idf.fit(featurizedData)val rescaledData = idfModel.transform(featurizedData)rescaledData.select("features", "label").take(3).foreach(println)2、逻辑回归
情感分类例子:import org.apache.spark.mlib.classification.LogisticRegressionWithSGDval negetive = sc.textFile("lvyou_comment_negitive.txt")val normal = sc.textFile("lvyou_comment_passive.txt")//创建一个HashingTF实例来把评价文本映射为包含10000个特征的向量val tf = new HashingTF(numFeatures = 10000)//各评价都被切分为单词,每个单词映射为一个特征valnegetiveFeatures =negetive.map(comment=>tf.transform(comment.split(" ")))val noramFeatures = normal.map(comment=>tf.transform(comment.split("")))//创建LabeledPoint数据集,分别存放负面评价和正面评价的例子val positiveExamples =negetiveFeatures.map(features=>LabeledPoint(1,features))val negativeExamples = noramFeatures.map(features=>LabeledPoint(0,features))val trainingData = positiveExamples.union(negativeExamples)trainingData.cache() //因为逻辑回归是迭代算法,所以缓存训练数据//使用SGD算法运行逻辑回归val model = new LogisticRegressionWithSGD().run(trainingData)//以负面评价和正面评价的例子分别进行测试val posTest = tf.transform("0 M G GET cheap stuff...".split(" "))val negTest = tf.transform("".split(" "))model.predict(posTest)3、SVM
package classification import com.huaban.analysis.jieba.JiebaSegmenter import com.huaban.analysis.jieba.JiebaSegmenter.SegMode import org.apache.spark.mllib.feature.{HashingTF, IDF} import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.{SparkConf, SparkContext} import scala.collection.mutable import org.apache.spark.mllib.classification.SVMWithSGD import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics import scala.collection.JavaConversions._ object SVMWithSGDForComment { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("SVMWithSGDExample") val sc = new SparkContext(conf) if (args.length < 4) { println("Please input 4 args: datafile numIterations train_percent(0.6)!") System.exit(1) } val datafile = args.head.toString val numIterations = Integer.parseInt(args(1)) val train_percent = args(2).toDouble val test_percent = 1.0 - train_percent val model_file = args(3) // 数据预处理 // 数据载入到 Spark 系统,抽象成为一个 RDD val originData = sc.textFile(datafile) // distinct 方法对数据去重 val originDistinctData = originData.distinct() // 将每一行文本变成一个 list,并且只保留长度大于2 的数据。 val rateDocument = originDistinctData.map(line => line.split('\t')).filter(line => line.length > 2) // 打五分的毫无疑问是好评;考虑到不同人对于评分的不同偏好,对于打四分、三分的数据,本文无法得知它是好评还是坏评;对于三分以下的是坏评 val fiveRateDocument = rateDocument.filter(arrline => arrline(0).equalsIgnoreCase("5")) System.out.println("************************5 score num:" +fiveRateDocument.count()) val fourRateDocument = rateDocument.filter(arrline => arrline(0).equalsIgnoreCase("4")) val threeRateDocument = rateDocument.filter(arrline => arrline(0).equalsIgnoreCase("3")) val twoRateDocument = rateDocument.filter(arrline => arrline(0).equalsIgnoreCase("2")) val oneRateDocument = rateDocument.filter(arrline => arrline(0).equalsIgnoreCase("1")) // 合并负样本数据 1.2星 val negRateDocument = oneRateDocument.union(twoRateDocument) negRateDocument.repartition(1) // 生̧成训练数̧据集 val posRateDocument = sc.parallelize(fiveRateDocument.take(negRateDocument.count().toInt)).repartition(1) val allRateDocument = negRateDocument.union(posRateDocument) allRateDocument.repartition(1) val rate = allRateDocument.map(s => ReduceRate(s(0))) val document = allRateDocument.map(s => s(1)) // 文本的向量表示和文本特征提取 每一句评论转化为词 val words = document.map(sentence => cut_for_calc(sentence)).map(line => line.split("/").toSeq) words.foreach(seq =>{ val arr = seq.toList val line = new StringBuilder arr.foreach(item => { line ++= (item +' ') }) })
// 训练词频矩阵val hashingTF = new HashingTF()val tf = hashingTF.transform(words)tf.cache()// 计算 TF-IDF 矩阵val idf = new IDF().fit(tf)val tfidf = idf.transform(tf)
// 生成训练集和测试集val zipped = rate.zip(tfidf)val data = zipped.map(tuple => LabeledPoint(tuple._1,tuple._2))val splits = data.randomSplit(Array(train_percent, test_percent), seed = 11L)val training = splits(0).cache()val test = splits(1)val model = SVMWithSGD.train(training, numIterations)model.clearThreshold()// Compute raw scores on the test set.val topicsArray = new mutable.MutableList[String]val scoreAndLabels = test.map { point =>val score = model.predict(point.features)(score, point.label)}scoreAndLabels.coalesce(1).saveAsTextFile("file:///data/1/usr/local/services/spark/helh/comment_test_predic/")// Get evaluation metrics.val metrics = new BinaryClassificationMetrics(scoreAndLabels)val auROC = metrics.areaUnderROC()println("Area under ROC = " + auROC)}def ReduceRate(rate_str:String):Int = {if (rate_str.toInt > 4)return 1else return 0;}def cut_for_calc(str:String):String = {val jieba = new JiebaSegmenter();val lword_info = jieba.process(str, SegMode.SEARCH);lword_info.map(item => item.word).mkString("/")}}// scalastyle:on printlnclass SVMWithSGDForComment
相关文章推荐
- 逻辑回归和SVM的比较
- Spark-特征抽取(TF-IDF)
- tf-idf + svm 文本分类
- Todd.log - a place to keep my thoughts on programming TF-IDF模型的概率解释
- spark2.0中逻辑回归模型
- Spark-MLlib实例——逻辑回归
- spark实现下的逻辑回归(logistic regression)
- sklearn 数据加载,数据归一,特征选择,逻辑回归,贝叶斯,k近邻,决策树,SVM
- 逻辑回归和SVM的区别
- 基于spark构建逻辑回归
- 根据已给字符数据,训练逻辑回归、随机森林、SVM,生成ROC和箱线图
- Spark 机器学习逻辑回归demo
- Spark成长之路(8)-TFIDF
- xbgoost svm 逻辑回归 梯度下降等推导过程
- Spark中组件Mllib的学习26之逻辑回归-简单数据集,带预测
- Spark MLlib回归算法------线性回归、逻辑回归、SVM和ALS
- 【scikit-learn】07:数据加载,数据归一,特征选择,逻辑回归,贝叶斯,k近邻,决策树,SVM
- Spark特征提取---TF-IDF
- Spark MLlib java TF-IDF计算 (spark 1.5.2)
- spark学习逻辑回归