Spark Shell With Python
2016-06-15 17:39
381 查看
配置Spark环境
1) 下载spark:http://spark.apache.org/downloads.html
2) 进入spark-1.6.1-bin-hadoop2.4,为当前目录
打开Python Spark Shell:
[root@Master spark-1.6.1-bin-hadoop2.4]#./bin/pyspark
读取文件,生成RDD格式
>>> textFile = sc.textFile("README.md")
输出RDD文件特定信息
>>> textFile.count() # Number of items in this RDD
126
>>>textFile.first() # First item in this RDD
u'# Apache Spark
>>>linesWithSpark = textFile.filter(lambda line: "Spark" in line)
>>> textFile.filter(lambda line: "Spark" in line).count() # How many lines contain "Spark"?
15
找到最长的行
>>>textFile.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b)
15
Spark可以很容易地实现MapReduce
>>> wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
>>> wordCounts.collect()
[(u'and', 9), (u'A', 1), (u'webpage', 1), (u'README', 1), (u'Note', 1), (u'"local"', 1), (u'variable', 1), ...]
使用python API编写应用程序
"""SimpleApp.py"""
from pyspark import SparkContext
logFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
[root@Master bin]# ./spark-submit --master local[4] ../SimpleApp.py
Lines with a: 46, Lines with b: 23
Congratulations on running your first Spark application!
1) 下载spark:http://spark.apache.org/downloads.html
2) 进入spark-1.6.1-bin-hadoop2.4,为当前目录
打开Python Spark Shell:
[root@Master spark-1.6.1-bin-hadoop2.4]#./bin/pyspark
读取文件,生成RDD格式
>>> textFile = sc.textFile("README.md")
输出RDD文件特定信息
>>> textFile.count() # Number of items in this RDD
126
>>>textFile.first() # First item in this RDD
u'# Apache Spark
使用过滤函数filter选择一个子集
>>>linesWithSpark = textFile.filter(lambda line: "Spark" in line)
>>> textFile.filter(lambda line: "Spark" in line).count() # How many lines contain "Spark"?
15
找到最长的行
>>>textFile.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b)
15
Spark可以很容易地实现MapReduce
>>> wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
>>> wordCounts.collect()
[(u'and', 9), (u'A', 1), (u'webpage', 1), (u'README', 1), (u'Note', 1), (u'"local"', 1), (u'variable', 1), ...]
使用python API编写应用程序
"""SimpleApp.py"""
from pyspark import SparkContext
logFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your system
sc = SparkContext("local", "Simple App")
logData = sc.textFile(logFile).cache()
numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()
print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
[root@Master bin]# ./spark-submit --master local[4] ../SimpleApp.py
Lines with a: 46, Lines with b: 23
Congratulations on running your first Spark application!
相关文章推荐
- eval
- Shell脚本eval、``和$()、[[和[、 $(( ))和(())、${}
- bash小命令-
- unix shell命令
- -bash: pod: command not found 解决办法
- linux shell 指令 诸如-d, -f, -e之类的判断表达式
- Bash Shell 中数字运算
- 利用JMeter的beanshell进行接口的加密处理
- HDFS基本Shell命令
- RHEL6.5 更改Shell版本
- shell写多行到文件中
- Linux中shell文件操作大全
- 【shell】read
- leetcode-shell-195. Tenth Line
- Shell多线程脚本
- shell脚本采用crontab定时备份数据库日志
- 【shell】变量
- Xshell秘钥登录Linux服务器: root && 普通用户
- PowerShell Script Analyzer, Script browser 和 Pester
- PowerShell Script Analyzer, Script browser 和 Pester