spark Using MLLib in Scala/Java/Python
2015-09-06 17:43
751 查看
[b]Using MLLib in Scala[/b]
Following code snippets can be executed in spark-shell.
[b]Binary Classification[/b]
The following code snippet illustrates how to load a sample dataset, execute a training algorithm on this training data using a static method in the algorithm object, and make predictions with the resulting model to compute the training error.
The SVMWithSGD.train() method by default performs L2 regularization with the regularization parameter set to 1.0. If we want to configure this algorithm, we can customize SVMWithSGD further by creating a new object directly and calling setter methods. All other MLlib algorithms support customization in this way as well. For example, the following code produces an L1 regularized variant of SVMs with regularization parameter set to 0.1, and runs the training algorithm for 200 iterations.
[b]Linear Regression[/b]
The following example demonstrate how to load training data, parse it as an RDD of LabeledPoint. The example then uses LinearRegressionWithSGD to build a simple linear model to predict label values. We compute the Mean Squared Error at the end to evaluate goodness of fit
Similarly you can use RidgeRegressionWithSGD and LassoWithSGD and compare training Mean Squared Errors.
[b]Clustering[/b]
In the following example after loading and parsing data, we use the KMeans object to cluster the data into two clusters. The number of desired clusters is passed to the algorithm. We then compute Within Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing k. In fact the optimal k is usually one where there is an “elbow” in the WSSSE graph.
[b]Collaborative Filtering[/b]
In the following example we load rating data. Each row consists of a user, a product and a rating. We use the default ALS.train() method which assumes ratings are explicit. We evaluate the recommendation model by measuring the Mean Squared Error of rating prediction.
If the rating matrix is derived from other source of information (i.e., it is inferred from other signals), you can use the trainImplicit method to get better results.
[b]Using MLLib in Java[/b]
All of MLlib’s methods use Java-friendly types, so you can import and call them there the same way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the Spark Java API uses a separate JavaRDD class. You can convert a Java RDD to a Scala one by calling .rdd() on your JavaRDD object.
[b]Using MLLib in Python[/b]
Following examples can be tested in the PySpark shell.
[b]Binary Classification[/b]
The following example shows how to load a sample dataset, build Logistic Regression model, and make predictions with the resulting model to compute the training error.
[b]Linear Regression[/b]
The following example demonstrate how to load training data, parse it as an RDD of LabeledPoint. The example then uses LinearRegressionWithSGD to build a simple linear model to predict label values. We compute the Mean Squared Error at the end to evaluate goodness of fit
[b]Clustering[/b]
In the following example after loading and parsing data, we use the KMeans object to cluster the data into two clusters. The number of desired clusters is passed to the algorithm. We then compute Within Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing k. In fact the optimal k is usually one where there is an “elbow” in the WSSSE graph.
Similarly you can use RidgeRegressionWithSGD and LassoWithSGD and compare training Mean Squared Errors.
[b]Collaborative Filtering[/b]
In the following example we load rating data. Each row consists of a user, a product and a rating. We use the default ALS.train() method which assumes ratings are explicit. We evaluate the recommendation by measuring the Mean Squared Error of rating prediction.
If the rating matrix is derived from other source of information (i.e., it is inferred from other signals), you can use the trainImplicit method to get better results.
Following code snippets can be executed in spark-shell.
[b]Binary Classification[/b]
The following code snippet illustrates how to load a sample dataset, execute a training algorithm on this training data using a static method in the algorithm object, and make predictions with the resulting model to compute the training error.
import org.apache.spark.SparkContext import org.apache.spark.mllib.classification.SVMWithSGD import org.apache.spark.mllib.regression.LabeledPoint // Load and parse the data file val data = sc.textFile("mllib/data/sample_svm_data.txt") val parsedData = data.map { line => val parts = line.split(' ') LabeledPoint(parts(0).toDouble, parts.tail.map(x => x.toDouble).toArray) } // Run training algorithm to build the model val numIterations = 20 val model = SVMWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error val labelAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction) } val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedData.count println("Training Error = " + trainErr)
The SVMWithSGD.train() method by default performs L2 regularization with the regularization parameter set to 1.0. If we want to configure this algorithm, we can customize SVMWithSGD further by creating a new object directly and calling setter methods. All other MLlib algorithms support customization in this way as well. For example, the following code produces an L1 regularized variant of SVMs with regularization parameter set to 0.1, and runs the training algorithm for 200 iterations.
import org.apache.spark.mllib.optimization.L1Updater val svmAlg = new SVMWithSGD() svmAlg.optimizer.setNumIterations(200) .setRegParam(0.1) .setUpdater(new L1Updater) val modelL1 = svmAlg.run(parsedData)
[b]Linear Regression[/b]
The following example demonstrate how to load training data, parse it as an RDD of LabeledPoint. The example then uses LinearRegressionWithSGD to build a simple linear model to predict label values. We compute the Mean Squared Error at the end to evaluate goodness of fit
import org.apache.spark.mllib.regression.LinearRegressionWithSGD import org.apache.spark.mllib.regression.LabeledPoint // Load and parse the data val data = sc.textFile("mllib/data/ridge-data/lpsa.data") val parsedData = data.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, parts(1).split(' ').map(x => x.toDouble).toArray) } // Building the model val numIterations = 20 val model = LinearRegressionWithSGD.train(parsedData, numIterations) // Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction) } val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.reduce(_ + _)/valuesAndPreds.count println("training Mean Squared Error = " + MSE)
Similarly you can use RidgeRegressionWithSGD and LassoWithSGD and compare training Mean Squared Errors.
[b]Clustering[/b]
In the following example after loading and parsing data, we use the KMeans object to cluster the data into two clusters. The number of desired clusters is passed to the algorithm. We then compute Within Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing k. In fact the optimal k is usually one where there is an “elbow” in the WSSSE graph.
import org.apache.spark.mllib.clustering.KMeans // Load and parse the data val data = sc.textFile("kmeans_data.txt") val parsedData = data.map( _.split(' ').map(_.toDouble)) // Cluster the data into two classes using KMeans val numIterations = 20 val numClusters = 2 val clusters = KMeans.train(parsedData, numClusters, numIterations) // Evaluate clustering by computing Within Set Sum of Squared Errors val WSSSE = clusters.computeCost(parsedData) println("Within Set Sum of Squared Errors = " + WSSSE)
[b]Collaborative Filtering[/b]
In the following example we load rating data. Each row consists of a user, a product and a rating. We use the default ALS.train() method which assumes ratings are explicit. We evaluate the recommendation model by measuring the Mean Squared Error of rating prediction.
import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.Rating // Load and parse the data val data = sc.textFile("mllib/data/als/test.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble) }) // Build the recommendation model using ALS val numIterations = 20 val model = ALS.train(ratings, 1, 20, 0.01) // Evaluate the model on rating data val usersProducts = ratings.map{ case Rating(user, product, rate) => (user, product)} val predictions = model.predict(usersProducts).map{ case Rating(user, product, rate) => ((user, product), rate) } val ratesAndPreds = ratings.map{ case Rating(user, product, rate) => ((user, product), rate) }.join(predictions) val MSE = ratesAndPreds.map{ case ((user, product), (r1, r2)) => math.pow((r1- r2), 2) }.reduce(_ + _)/ratesAndPreds.count println("Mean Squared Error = " + MSE)
If the rating matrix is derived from other source of information (i.e., it is inferred from other signals), you can use the trainImplicit method to get better results.
val model = ALS.trainImplicit(ratings, 1, 20, 0.01)
[b]Using MLLib in Java[/b]
All of MLlib’s methods use Java-friendly types, so you can import and call them there the same way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the Spark Java API uses a separate JavaRDD class. You can convert a Java RDD to a Scala one by calling .rdd() on your JavaRDD object.
[b]Using MLLib in Python[/b]
Following examples can be tested in the PySpark shell.
[b]Binary Classification[/b]
The following example shows how to load a sample dataset, build Logistic Regression model, and make predictions with the resulting model to compute the training error.
from pyspark.mllib.classification import LogisticRegressionWithSGD from numpy import array # Load and parse the data data = sc.textFile("mllib/data/sample_svm_data.txt") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) model = LogisticRegressionWithSGD.train(parsedData) # Build the model labelsAndPreds = parsedData.map(lambda point: (int(point.item(0)), model.predict(point.take(range(1, point.size))))) # Evaluating the model on training data trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count()) print("Training Error = " + str(trainErr))
[b]Linear Regression[/b]
The following example demonstrate how to load training data, parse it as an RDD of LabeledPoint. The example then uses LinearRegressionWithSGD to build a simple linear model to predict label values. We compute the Mean Squared Error at the end to evaluate goodness of fit
from pyspark.mllib.regression import LinearRegressionWithSGD from numpy import array # Load and parse the data data = sc.textFile("mllib/data/ridge-data/lpsa.data") parsedData = data.map(lambda line: array([float(x) for x in line.replace(',', ' ').split(' ')])) # Build the model model = LinearRegressionWithSGD.train(parsedData) # Evaluate the model on training data valuesAndPreds = parsedData.map(lambda point: (point.item(0), model.predict(point.take(range(1, point.size))))) MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y)/valuesAndPreds.count() print("Mean Squared Error = " + str(MSE))
[b]Clustering[/b]
In the following example after loading and parsing data, we use the KMeans object to cluster the data into two clusters. The number of desired clusters is passed to the algorithm. We then compute Within Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing k. In fact the optimal k is usually one where there is an “elbow” in the WSSSE graph.
from pyspark.mllib.clustering import KMeans from numpy import array from math import sqrt # Load and parse the data data = sc.textFile("kmeans_data.txt") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) # Build the model (cluster the data) clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=30, initialization_mode="random") # Evaluate clustering by computing Within Set Sum of Squared Errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y) print("Within Set Sum of Squared Error = " + str(WSSSE))
Similarly you can use RidgeRegressionWithSGD and LassoWithSGD and compare training Mean Squared Errors.
[b]Collaborative Filtering[/b]
In the following example we load rating data. Each row consists of a user, a product and a rating. We use the default ALS.train() method which assumes ratings are explicit. We evaluate the recommendation by measuring the Mean Squared Error of rating prediction.
from pyspark.mllib.recommendation import ALS from numpy import array # Load and parse the data data = sc.textFile("mllib/data/als/test.data") ratings = data.map(lambda line: array([float(x) for x in line.split(',')])) # Build the recommendation model using Alternating Least Squares model = ALS.train(ratings, 1, 20) # Evaluate the model on training data testdata = ratings.map(lambda p: (int(p[0]), int(p[1]))) predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2])) ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions) MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).reduce(lambda x, y: x + y)/ratesAndPreds.count() print("Mean Squared Error = " + str(MSE))
If the rating matrix is derived from other source of information (i.e., it is inferred from other signals), you can use the trainImplicit method to get better results.
# Build the recommendation model using Alternating Least Squares based on implicit ratings model = ALS.trainImplicit(ratings, 1, 20)
相关文章推荐
- Python编译器的选择与比较(入门必备)
- Python Exception 注意点
- python中的lambda表达式
- Python 库之 os 源码
- Python学习----模块
- Spark1.4.1 快速入门
- Google Python Class 学习笔记(2) 正则表达式
- Python+OpenCV学习(1)---图像的读取与保存
- Google Python Class 之——正则表达式提取html网页数据字段
- python unittest源码解析三----loader.py之_get_name_from_path(self, path)
- wxpython基本控件
- Python Thread related
- python SyntaxError: Non-ASCII character '\xd5' in file
- python 数组新增或删除元素
- Python操作Mysql数据库
- [转] 强大的python字符串解析
- Python线程指南
- Python正则表达式指南
- speed up performance of python
- python代码片段