您的位置:首页 > 其它

从新闻数据组中提取TF-IDF特征

2017-05-16 16:20 423 查看
为了练习特征提取,我将使用一个非常有名的数据集,叫做20 Newsgroups;这个数据集一般用来文本分类。

1.分析数据内容

查看目录结构和数据结构

val sc = new SparkContext("local[2]","TF-IDF")
val path = "data/20news-bydate-train/*"
val rdd = sc.wholeTextFiles(path)
val text = rdd.map{case (file,text)=> text}
println(text.count())


2.应用基本的分词方法

切分每个文档的原始内容为多个单词,组成集合,下面实现简单的空格分词。及时相对于较小的文本集,不同单词的个数(也就是特征向量的维度)也可能非常高。

val newsgroups = rdd.map{case (file,text)=> file.split("/").takeRight(2).head}
val countByGroup = newsgroups.map(n => (n,1)).reduceByKey(_+_).collect.sortBy(-_._2).mkString("\n")
println(countByGroup)


3.改进分词效果

上面的处理当中出现很多不是单词的字符(标点符号、数字)利用正则表达式来移除这些字符

val nonWordSplit = text.flatMap(t => t.split("""\W+""").map(_.toLowerCase))
val regex = """[^0-9]*""".r
val filterNumbers = nonWordSplit.filter(token => regex.pattern.matcher(token).matches())


4.移除停用词

val tokenCounts = filterNumbers.map(t => (t,1)).reduceByKey(_+_)
val oreringDesc = Ordering.by[(String,Int),Int](_._2)
println(tokenCounts.top(20)(oreringDesc).mkString("\n"))

val stopWords = Set("the","a","an","of","in","or","for","by","on","but","is","not","with",
"as","was","if","they","are","this","that","and","it","have","from","at","my","be","to")
val tokenCountsFilteredStopWords = tokenCounts.filter{case (k,v) => !stopWords.contains(k)}
println(tokenCountsFilteredStopWords.top(20)(oreringDesc).mkString("\n"))

val tokenCountsFilteredSize = tokenCountsFilteredStopWords.filter{case (k,v) => k.size >=2}
println(tokenCountsFilteredSize.top(20)(oreringDesc).mkString("\n"))


5.基于频率去除单词

频率很低的单词也要去掉,因为其没有价值,没有足够的训练数据

val tokenCountsFilteredSize = tokenCountsFilteredStopWords.filter{case (k,v) => k.size >=2}
println(tokenCountsFilteredSize.top(20)(oreringDesc).mkString("\n"))

val rareTokens = tokenCounts.filter{case (k,v) => v < 2}.map{case (k,v) => k}.collect.toSet
val tokenCountsFiltereAll = tokenCountsFilteredSize.filter{case (k,v) => !rareTokens.contains(k)}
println(tokenCountsFiltereAll.top(20)(oreringDesc).mkString("\n"))


6.提取词干

利用NLP方法或者NLTK、OpenNLP和Lucenc

7.训练TF-IDF模型

val takens = text.map(doc => tokenize(doc))
val dim = math.pow(2,18).toInt
val hashingTF = new HashingTF(dim)
val tf = hashingTF.transform(takens)
tf.cache()

val v = tf.first.asInstanceOf[SparseVector]
println(v.size)


8.分析TF-IDF权重
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: