您的位置：首页 > 运维架构 > Shell

Spark Shell With Python

2016-06-15 17:39 381 查看

配置Spark环境

1）下载spark：http://spark.apache.org/downloads.html

2）进入spark-1.6.1-bin-hadoop2.4，为当前目录

打开Python Spark Shell:

[root@Master spark-1.6.1-bin-hadoop2.4]#./bin/pyspark

读取文件，生成RDD格式

>>> textFile = sc.textFile("README.md")

输出RDD文件特定信息

>>> textFile.count() # Number of items in this RDD

126

>>>textFile.first() # First item in this RDD

u'# Apache Spark

使用过滤函数filter选择一个子集

>>>linesWithSpark = textFile.filter(lambda line: "Spark" in line)

>>> textFile.filter(lambda line: "Spark" in line).count() # How many lines contain "Spark"?

15

找到最长的行

>>>textFile.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b)

15

Spark可以很容易地实现MapReduce

>>> wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)

>>> wordCounts.collect()

[(u'and', 9), (u'A', 1), (u'webpage', 1), (u'README', 1), (u'Note', 1), (u'"local"', 1), (u'variable', 1), ...]

使用python API编写应用程序

"""SimpleApp.py"""
from pyspark import SparkContext

logFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()

numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()

print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
[root@Master bin]# ./spark-submit --master local[4] ../SimpleApp.py

Lines with a: 46, Lines with b: 23

Congratulations on running your first Spark application!

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航