您的位置:首页 > 其它

spark scala word2vec 和多层分类感知器在情感分析中的实际应用

2017-07-24 11:23 399 查看
/**
*Createdbylklon2017/7/21.
*/
//importcom.ibm.spark.exercise.util.LogUtils
//importcom.ibm.spark.exercise.util.LogUtils
importorg.apache.spark.ml.Pipeline
importorg.apache.spark.ml.classification.MultilayerPerceptronClassifier
importorg.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
importorg.apache.spark.ml.feature.{IndexToString,StringIndexer,Word2Vec}
importorg.apache.spark.sql.SQLContext
importorg.apache.spark.{SparkContext,SparkConf}
importorg.apache.spark.{SparkConf,SparkContext}
importorg.apache.spark.sql.SQLContext
objectmllib{

finalvalVECTOR_SIZE=1000
//defmain(args:Array[String]){
//if(args.length<1){
//println("Usage:SMSClassifierSMSTextFile")
//sys.exit(1)
//}
defmain(args:Array[String]){
valconf=newSparkConf().setMaster("local").setAppName("test")
valsc=newSparkContext(conf)
valsqlContext=neworg.apache.spark.sql.SQLContext(sc)

//valrole="jdbc:mysql://192.168.0.37:3306/emotional?user=root&password=123456&useUnicode=true&characterEncoding=utf8&autoReconnect=true&failOverReadOnly=false"
//importsqlContext.implicits._
//valdf=sc.textFile("hdfs://192.168.0.211:9000/user/hadoop/emotion/SMS.txt").map(line=>(line.split("")(0),line.split("")(1),line.split("")(2),line.split("")(3))).toDF("id","innserSessionid","words","value")
//df.printSchema()
//df.insertIntoJDBC(role,"SMS",true)
valsqlCtx=neworg.apache.spark.sql.SQLContext(sc)
importsqlContext.implicits._
//读取hdfs数据源,格式如下:以空格隔开,最后一列数字列是分析标题后,人为打上的标签,
值是按照情绪程度,值选择于【-1,-0.75,-0.5,-0.25,,0.25,0.50,0.75,1】其中之一。

//10090C779C882AA39436A89C463BCB406B838涨停板,复盘,全,靠,新,股,撑,门面,万科,A,尾盘,封板0.75
//10091519A9C6AD0A845298B0B3924117C0B4F一,行业,再现,重大,利好,板块,反弹,仍,将,继续0.75
//10092C86CEC7DB9794311AF386C3D7B0B7CBD藁城区,3,大,项目,新,获,规划证,开发,房企,系,同,一家0
//10093FCEA2FFC1C2F4D6C808F2CBC2FF18A8C完善,对,境外,企业,和,对外,投资,统计,监测0.5
//10094204A77847F03404986331810E039DFC2财联社,电报0
//10095E571B9EF451F4D5F8426A1FA06CD9EE6审计署,部分,央企,业绩,不,实-0.5
//10096605264A2F6684CC4BB4B2A0B6A8FA078厨卫,品牌,新,媒体,榜,看看,谁家,的,官微,最,爱,卖萌0.25

valparsedRDD=sc.textFile("hdfs://192.168.0.211:9000/user/hadoop/emotion/SMS.txt").map(line=>{
vala=line.split("")
if(a.length==4){
(line.split("")(3),line.split("")(2).split(","))
}else{
("","".split(","))
}
})

valmsgDF=sqlCtx.createDataFrame(parsedRDD).toDF("label","message")
vallabelIndexer=newStringIndexer().setInputCol("label").setOutputCol("indexedLabel").fit(msgDF)
valword2Vec=newWord2Vec().setInputCol("message").setOutputCol("features").setVectorSize(VECTOR_SIZE).setMinCount(1)

vallayers=Array[Int](VECTOR_SIZE,250,500,200)
valmlpc=newMultilayerPerceptronClassifier().setLayers(layers).setBlockSize(512).setSeed(1234L).setMaxIter(128).setFeaturesCol("features").setLabelCol("indexedLabel").setPredictionCol("prediction")

vallabelConverter=newIndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)

valArray(trainingData,testData)=msgDF.randomSplit(Array(0.8,0.2))
valpipeline=newPipeline().setStages(Array(labelIndexer,word2Vec,mlpc,labelConverter))
valmodel=pipeline.fit(trainingData)
valpredictionResultDF=model.transform(testData)
//below2linesarefordebuguse
predictionResultDF.printSchema
predictionResultDF.select("message","label","predictedLabel").show(30)
valevaluator=newMulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction").setMetricName("precision")
valpredictionAccuracy=evaluator.evaluate(predictionResultDF)
println("TestingAccuracyis%2.4f".format(predictionAccuracy*100)+"%")
//sc.stop

}
}



结果如下:

+--------------------+-----+--------------+
|message|label|predictedLabel|
+--------------------+-----+--------------+
|[价格会,一飞,冲天,神秘,...|0.5|0.5|
|[审计署,部分,央企,业绩,...|-0.5|0.5|
|[广电,总局,新浪,微博,...|-0.5|0.5|
|[叶檀,若,粤,港澳湾区,...|0.25|0.5|
|[万达,崩,万科,起]|0|0.5|
|[外汇,小白,必,看,视频...|0.25|0.5|
|[乐视,回,应发,不,出,...|-0.75|0.5|
|[万达,电影,高开,1.69...|0.5|0.5|
|[万科,A,股,6月,23...|0.75|0.5|
|[金价,周一,反弹,扭转,...|0.5|0.5|
|[收评,两,市,震荡,沪指...|0.25|0.5|
|[点睛,军工,混改,加速,...|0.5|0.5|
|[棉花,日报,棉花,短期,...|0.25|0.5|
|[探秘,巴铁,试验线,部分,...|-0.75|0.5|
|[万达,复星,股价,暴跌,...|-0.75|0.5|
|[油价,迎,年内,最,大,...|-0.25|0.5|
|[2017年,IPO,被,否...|-0.75|0.5|
|[股,转,监事长,邓映翎,...|-0.5|0.5|
|[发改委,国内,汽,柴油,...|-0.25|0.5|
|[周报,明晟,MSCI,宣布...|0.5|0.5|
|[夏季,达沃斯,共识,中国,...|0.5|0.5|
|[重磅,又,一,家,公司,...|-0.75|0.5|
|[麦格里,重磅,警告,OPE...|-0.5|0.5|
|[韩国,娱乐,公司,TO-W...|0.5|0.5|
|[新,三,板,周报]|0|0.5|
|[分享,华尔街,对,美国,...|0.5|0.5|
|[盛和,资源,2015年,公...|0|0.5|
|[交易,实况,黄金,两,连...|-0.5|0.5|
|[徽商,银行,内斗戏,第二,...|-0.5|0.5|
|[2017,夏季,达沃斯,论...|0.25|0.5|
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐