您的位置：首页 > 其它

R语言学习笔记——使用tm包挖掘文本中的频繁词

2016-05-11 15:16 423 查看

使用数据：

DBLP公共数据集，http://dblp.uni-trier.de/ ，DBLP数据集记录了大量文献的记录，在这里我们选取ICCS（International Conference of Computational Science）会议的论文集作为应用对象。数据示例如下：

方法：

利用R语言中的文本挖掘tm包发现该论文集中的频繁词。

代码&注释：

# load tm package

library(tm)

# load RODBC package to extract data fromMySQL database

library(RODBC)

# build a connection to DBLP database

channel <- odbcConnect("dblp",uid="root", pwd="admin")

# copy data from “paper” table

paper<-sqlFetch(channel,"paper")

# select ICCS conference subset dataset

tmsample<-subset(paper,conference =="ICCS")

# view the dataset which includes 2041paper for ICCS conference

View(tmsample)

# save title data for text mining

title<-as.vector(tmsample$title)

# establish corpus for title data

tm<-VCorpus(VectorSource(title))

# data cleaning

tm<-tm_map(tm, content_transformer(tolower))

tm<-tm_map(tm, removeWords,stopwords("english"))

# create document term matrix

dtm<-DocumentTermMatrix(tm,control=list(removePunctuation= TRUE,stopwords=TRUE))

# the smaller value of sparse lead to lessfrequent words, 0.98 means if a word has a probability less than ( 1 - 0.98 ),it will not exist in document term matrix

dtm2 <-removeSparseTerms(dtm, sparse=0.98)

# check frequentwords in dtm2, the figure below shows the result ( 35 frequent words in all )

dtm2$dimnames$Terms

从上图可以看到，我们发现了ICCS会议论文中的一些频繁词例如：algorithm; modeling; optimization等等，从实际角度考虑是比较符合会议的主题的。但频繁词中会包含一些重意的复数项例如：model与models，希望各位大神解答下如何用tm包在挖掘频繁词中可以去掉这些复数项。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： r语言文本挖掘

相关文章推荐

新的分享

章节导航