您的位置：首页 > 编程语言 > Python开发

Spark英中对照翻译(PySpark中文版新手快速入门-Quick Start)-中文指南,教程(Python版)-20161115

2016-11-15 13:26 1381 查看

[源：http://spark.apache.org/docs/latest/quick-start.html]

[译：李文，清华园]

QuickStart

快速入门

Interactive
AnalysiswiththeSpark
Shell

通过Spark
Shell交互式分析

Basics

基础知识

More
onRDDOperations

有关RDD操作的更多知识

Caching

缓存

Self-Contained
Applications

自包含应用

Whereto
GofromHere

由此去哪儿

Thistutorialprovides
aquickintroductiontousingSpark.WewillfirstintroducetheAPIthroughSpark’sinteractiveshell(inPythonorScala),thenshowhowtowriteapplicationsinJava,Scala,andPython.Seethe programming
guide foramorecompletereference.
本教程给出使用Spark的简要介绍。我们将首先通过Spark的交互式shell（Python或Scala中）介绍API，然后演示如何在Java、Scala和Python中编写应用。有关更全面而完整的参考，请参见programming
guide（编程指南）。
Tofollowalongwiththisguide,
firstdownloadapackagedreleaseofSparkfromthe Spark
website.Sincewewon’tbeusingHDFS,youcandownloadapackageforanyversionof
Hadoop.
要跟随本指南，请先从Spark
website（Spark网站）下载Spark的打包发行版。由于我们将不会使用HDFS，因而您可以下载任何Hadoop版本的包。

Interactive
AnalysiswiththeSparkShell

通过SparkShell交互式分析

Basics

基础知识

Spark’sshellprovidesasimple
waytolearntheAPI,aswellasapowerfultooltoanalyzedatainteractively.ItisavailableineitherScala(whichrunsontheJavaVMandisthusagoodwaytouseexistingJavalibraries)orPython.StartitbyrunningthefollowingintheSparkdirectory:
Spark的shell提供了一种简单的方式来学习API，同时也是交互式分析数据的强大工具。该工具在Scala（运行在Java虚拟机上，因而是使用现有Java库的好方式）或Python中均可用。在Spark目录下运行以下命令即可启动该工具：

Scala

Python

./bin/pyspark

[/code]

Spark’sprimaryabstractionis
adistributedcollectionofitemscalledaResilientDistributedDataset(RDD).RDDscanbecreatedfromHadoopInputFormats(suchasHDFSfiles)orbytransformingotherRDDs.Let’smakeanewRDDfromthetextoftheREADMEfileintheSparksourcedirectory:
Spark的主要抽象是称作弹性分布式数据集(RDD)的分布式项目集合。RDD可以从Hadoop
InputFormats（如HDFS文件）创建，也可以通过变换其他RDD来创建。下面我们从Spark源目录下的README文件中的文本来生成新的RDD：

>>>

textFile

sc

textFile(

"README.md"

[/code]
RDDshave actions,which
returnvalues,and transformations,which
returnpointerstonewRDDs.Let’sstartwithafewactions:
RDD有actions（动作），动作会返回值，还有transformations（变换），变换会返回指向新的RDD的指针。下面我们通过几个动作来开始：

>>>

textFile

count()

#NumberofitemsinthisRDD

>>>

textFile

first()

#FirstiteminthisRDD

u'#ApacheSpark'

[/code]
Nowlet’suseatransformation.
Wewillusethe

filter

transformation
toreturnanewRDDwithasubsetoftheitemsinthefile.
现在，我们使用变换。我们将使用

filter

变换来返回包含文件中项目子集的新的RDD。

>>>

linesWithSpark

textFile

filter(

lambda

line:

"Spark"

in

line)

[/code]
Wecanchaintogethertransformations
andactions:
我们可以将变换与动作链在一起：

>>>

textFile

filter(

lambda

line:

"Spark"

in

line)

count()

#Howmanylinescontain"Spark"?

[/code]

More
onRDDOperations

有关

RDD

操作的更多知识

RDDactionsandtransformations
canbeusedformorecomplexcomputations.Let’ssaywewanttofindthelinewiththemostwords:
RDD动作和变换可用于更为复杂的计算。譬如说，我们想要找出词数最多的行：

Scala

Python

>>>

textFile

map(

lambda

line:

len

(line

split()))

reduce(

lambda

a,b:a

if

(a

b)

else

b)

[/code]
Thisfirstmapsalinetoan
integervalue,creatinganewRDD.

reduce

is
calledonthatRDDtofindthelargestlinecount.Theargumentsto

map

and

reduce

are
Python anonymous
functions(lambdas),butwecanalsopassanytop-levelPythonfunctionwewant.Forexample,
we’lldefinea

max

function
tomakethiscodeeasiertounderstand:
此操作首先将行映射为整数值，创建一个新的RDD。在该RDD上调用

reduce

来找出最大的行长计数。

map

和

reduce

的参数是Python(匿名函数) anonymous
functions(lambdas)，不过我们也可以传递自己所需的任何顶级Python函数。例如，我们将要定义一个

max

函数，

以使此代码更容易理解：

>>>

def

max

(a,b):

...

if

b:

...

return

...

else

...

return

...

>>>

textFile

map(

lambda

line:

len

(line

split()))

reduce(

max

[/code]
Onecommondataflowpattern
isMapReduce,aspopularizedbyHadoop.SparkcanimplementMapReduceflowseasily:
一种常见的数据流模式是MapReduce，如Hadoop所普及推广的。Spark可以轻松实现MapReduce流：

>>>

wordCounts

textFile

flatMap(

lambda

line:line

split())

map(

lambda

word:(word,

))

reduceByKey(

lambda

a,b:a

b)

[/code]
Here,wecombinedthe

flatMap

map

,and

reduceByKey

transformations
tocomputetheper-wordcountsinthefileasanRDDof(string,int)pairs.Tocollectthewordcountsinourshell,wecanusethe

collect

action:
此处，我们结合运用了

flatMap

、

map

和

reduceByKey

变换来计算文件中每个词的计数，以作为(string,int)对儿的RDD。要在我们的shell中收集这些词计数，我们可以使用

collect

动作：

>>>

wordCounts

collect()

[(

u'and'

),(

u'A'

),(

u'webpage'

),(

u'README'

),(

u'Note'

),(

u'"local"'

),(

u'variable'

),

...

[/code]

Caching

缓存

Sparkalsosupportspullingdata
setsintoacluster-widein-memorycache.Thisisveryusefulwhendataisaccessedrepeatedly,suchaswhenqueryingasmall“hot”datasetorwhenrunninganiterativealgorithmlikePageRank.Asasimpleexample,let’smarkour

linesWithSpark

dataset
tobecached:
Spark还支持将数据集拖入集群范围的内存中的缓存。这在数据被反复访问时非常有用，比如在查询一个小的“热”数据集时，或在运行像PageRank这样的迭代算法时。作为一个简单的示例，下面我们将linesWithSpark
数据集标记为要进行缓存：

Scala

Python

>>>

linesWithSpark

cache()

>>>

linesWithSpark

count()

>>>

linesWithSpark

count()

[/code]
ItmayseemsillytouseSpark
toexploreandcachea100-linetextfile.Theinterestingpartisthatthesesamefunctionscanbeusedonverylargedatasets,evenwhentheyarestripedacrosstensorhundredsofnodes.Youcanalsodothisinteractivelybyconnecting

bin/pyspark

to
acluster,asdescribedinthe programming
guide.
使用Spark来探查和缓存100行的文本文件可能貌似愚蠢。有意思之处在于所用的这些函数可以在非常大的数据集上使用，即使这些数据集分布在成百上千个节点上。您也可以通过将

bin/pyspark

连接到集群来交互式执行此操作，如programming
guide（编程指南）中所述。

Self-Contained
Applications

自包含应用

Supposewewishtowriteaself-contained
applicationusingtheSparkAPI.WewillwalkthroughasimpleapplicationinScala(withsbt),Java(withMaven),andPython.
假定我们想要使用Spark
API来编写自包含应用。在Scala（通过sbt）、Java（通过Maven）和Python中，应用编写比较简单。

Scala

Java

Python

NowwewillshowhowtowriteanapplicationusingthePythonAPI(PySpark).
Asanexample,we’llcreatea
simpleSparkapplication,

SimpleApp.py

:
现在，我们将演示如何使用Python
API(PySpark)来编写应用。
作为示例，我们将创建一个简单的Spark应用

SimpleApp.py

：

"""SimpleApp.py"""

from

pyspark

import

SparkContext

logFile

"YOUR_SPARK_HOME/README.md"

#Shouldbesomefileonyoursystem

sc

SparkContext(

"local"

"SimpleApp"

logData

sc

textFile(logFile)

cache()

numAs

logData

filter(

lambda

s:

'a'

in

s)

count()

numBs

logData

filter(

lambda

s:

'b'

in

s)

count()

print

"Lineswitha:

%i

,lineswithb:

%i

(numAs,numBs))

[/code]
This
programjustcountsthenumberoflinescontaining‘a’andthenumbercontaining‘b’inatextfile.Notethatyou’llneedtoreplaceYOUR_SPARK_HOMEwiththelocationwhereSparkisinstalled.AswiththeScalaandJavaexamples,weuseaSparkContexttocreate
RDDs.WecanpassPythonfunctionstoSpark,whichareautomaticallyserializedalongwithanyvariablesthattheyreference.Forapplicationsthatusecustomclassesorthird-partylibraries,wecanalsoaddcodedependenciesto

spark-submit

through
its

--py-files

argument
bypackagingthemintoa.zipfile(see

spark-submit--help

for
details).

SimpleApp

is
simpleenoughthatwedonotneedtospecifyanycodedependencies.
此程序只统计文本文件中包含‘a’的行的数目和包含‘b’的行的数目。请注意，您需要将YOUR_SPARK_HOME替换为您的Spark安装位置。如同Scala和Java示例，我们使用SparkContext来创建RDD。我们可以将Python函数传递给Spark，这些函数将自动随其引用的任何变量一同序列化。对于使用自定义类或第三方库的应用，我们还可以通过其

--py-files

参数向

spark-submit

添加代码依赖，方法是将这些代码依赖打包到.zip文件中（详情请参见

spark-submit--help

）。

SimpleApp

足够简单，我们无需指定任何代码依赖。
Wecanrunthisapplicationusing
the

bin/spark-submit

script:
我们可以使用

bin/spark-submit

脚本运行此应用：

#Usespark-submittorunyourapplication

YOUR_SPARK_HOME/bin/spark-submit

--master

local

SimpleApp.py

...

Lineswitha:46,Lineswithb:23

[/code]

WheretoGo
fromHere

由此去哪儿（后续事项）

Congratulationsonrunningyour
firstSparkapplication!
恭喜您运行您的第一个Spark应用！

Foranin-depth
overviewoftheAPI,startwiththe Spark
programmingguide,orsee“ProgrammingGuides”menuforothercomponents.

如要深度概览API，请开始学习 Spark
programmingguide（Spark编程指南），或参见其他组件的“编程指南”菜单。

Forrunning
applicationsonacluster,headtothe deployment
overview.

如要在集群上运行应用，请前往 deployment
overview（部署概览）。

Finally,Spark
includesseveralsamplesinthe

examples

directory(Scala, Java, Python, R).You
canrunthemasfollows:

最后，Spark在

examples

目录（Scala、Java、Python、R）中包含有多个示例。您可以如下运行这些示例：

#ForScalaandJava,userun-example:

./bin/run-exampleSparkPi

#ForPythonexamples,usespark-submitdirectly:

./bin/spark-submitexamples/src/main/python/pi.py

#ForRexamples,usespark-submitdirectly:

./bin/spark-submitexamples/src/main/r/dataframe.R

[/code]

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： spark 中文教程指南快速入门

相关文章推荐

新的分享

章节导航

Spark英中对照翻译(PySpark中文版新手快速入门-Quick Start)-中文指南,教程(Python版)-20161115

[源：http://spark.apache.org/docs/latest/quick-start.html]

QuickStart

快速入门

InteractiveAnalysiswiththeSparkShell

通过SparkShell交互式分析

Basics

基础知识

MoreonRDDOperations

有关RDD操作的更多知识

Caching

缓存

Self-ContainedApplications

自包含应用

WheretoGofromHere

由此去哪儿（后续事项）

Interactive
AnalysiswiththeSparkShell

More
onRDDOperations

有关
RDD
操作的更多知识

Self-Contained
Applications

WheretoGo
fromHere