您的位置:首页 > 运维架构 > Shell

Spark Shell With Python

2016-06-15 17:39 381 查看
配置Spark环境

   1) 下载spark:http://spark.apache.org/downloads.html

   2) 进入spark-1.6.1-bin-hadoop2.4,为当前目录


打开Python Spark Shell:

[root@Master spark-1.6.1-bin-hadoop2.4]#./bin/pyspark

读取文件,生成RDD格式

>>> textFile = sc.textFile("README.md")

输出RDD文件特定信息

>>> textFile.count() # Number of items in this RDD

126

>>>textFile.first() # First item in this RDD

u'# Apache Spark

使用过滤函数filter选择一个子集


>>>linesWithSpark = textFile.filter(lambda line: "Spark" in line)

>>> textFile.filter(lambda line: "Spark" in line).count() # How many lines contain "Spark"?

15

找到最长的行

>>>textFile.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b)

15

Spark可以很容易地实现MapReduce

>>> wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)

>>> wordCounts.collect()

[(u'and', 9), (u'A', 1), (u'webpage', 1), (u'README', 1), (u'Note', 1), (u'"local"', 1), (u'variable', 1), ...]

使用python API编写应用程序

"""SimpleApp.py"""
from pyspark import SparkContext

logFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()

numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()

print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
[root@Master bin]# ./spark-submit --master local[4] ../SimpleApp.py

Lines with a: 46, Lines with b: 23

Congratulations on running your first Spark application!
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: