How to chain multiple MapReduce jobs in Hadoop
2013-12-19 16:10
309 查看
When running MapReduce jobs it is possible to have several MapReduce steps with overall job scenarios means the last reduce output will be used as input for the next map job.
Map1 -> Reduce1 -> Map2 -> Reduce2 -> Map3...
While searching for an answer to my MapReduce job, I stumbled upon several cool new ways to achieve my objective. Here are some of the ways:
Using Map/Reduce JobClient.runJob() Library to chain jobs:
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
You can easily chain jobs together in this fashion by writing multiple driver methods, one for each job. Call the first driver method, which uses JobClient.runJob() to run the job and wait for it to complete. When that job has
completed, then call the next driver method, which creates a new JobConf object referring to different instances of Mapper and Reducer, etc. The first job in the chain should write its output to a path which is then used as the input path for the second job.
This process can be repeated for as many jobs are necessary to arrive at a complete solution to the problem
Method 1:
First create the JobConf object "job1" for the first job and set all the parameters with "input" as inputdirectory and "temp" as output directory. Execute this job: JobClient.run(job1).
Immediately below it, create the JobConf object "job2" for the second job and set all the parameters with "temp" as inputdirectory and "output" as output directory. Finally execute second job: JobClient.run(job2).
Method 2:
Create two JobConf objects and set all the parameters in them just like (1) except that you don't use JobClient.run.
Then create two Job objects with jobconfs as parameters: Job job1=new Job(jobconf1); Job job2=new Job(jobconf2);
Using the jobControl object, you specify the job dependencies and then run the jobs: JobControl jbcntrl=new JobControl("jbcntrl"); jbcntrl.addJob(job1); jbcntrl.addJob(job2); job2.addDependingJob(job1); jbcntrl.run();
Using Oozie which is Hadoop Workflow Service described as below:
https://issues.apache.org/jira/secure/attachment/12400686/hws-v1_0_2009FEB22.pdf
3.1.5 Fork and Join Control Nodes
A fork node splits one path of execution into multiple concurrent paths of execution. A join node waits until every concurrent execution path of a previous fork node arrives to it. fork and join nodes must be used in pairs. The
join node assumes concurrent execution paths are children of the same fork node.
The name attribute in the fork node is the name of the workflow fork node. The to attribute in the transition elements in the fork node indicate the name of the workflow node that will be part of the concurrent execution. The
name attribute in the join node is the name of the workflow join node. The to attribute in the transition element in the join node indicates the name of the workflow node that will executed after all 3.1.5 Fork and Join Control Nodes 5concurrent execution
paths of the corresponding fork arrive to the join node.
Example:
<hadoop−workflow name="sample−wf">
...
<fork name="forking">
<transition to="firstparalleljob"/>
<transition to="secondparalleljob"/>
</fork>
<hadoop name="firstparalleljob">
<job−xml>job1.xml</job−xml>
<transition name="OK" to="joining"/>
<transition name="ERROR" to="fail"/>
</hadoop>
<hadoop name="secondparalleljob">
<job−xml>job2.xml</job−xml>
<transition name="OK" to="joining"/>
<transition name="ERROR" to="fail"/>
</hadoop>
<join name="joining">
<transition to="nextaction"/>
</join>
...
</hadoop−workflow>
* 注:多个连续的job会产生的中间的文件,在调用完成job之后可以用shell中删除中间的文件
Map1 -> Reduce1 -> Map2 -> Reduce2 -> Map3...
While searching for an answer to my MapReduce job, I stumbled upon several cool new ways to achieve my objective. Here are some of the ways:
Using Map/Reduce JobClient.runJob() Library to chain jobs:
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
You can easily chain jobs together in this fashion by writing multiple driver methods, one for each job. Call the first driver method, which uses JobClient.runJob() to run the job and wait for it to complete. When that job has
completed, then call the next driver method, which creates a new JobConf object referring to different instances of Mapper and Reducer, etc. The first job in the chain should write its output to a path which is then used as the input path for the second job.
This process can be repeated for as many jobs are necessary to arrive at a complete solution to the problem
Method 1:
First create the JobConf object "job1" for the first job and set all the parameters with "input" as inputdirectory and "temp" as output directory. Execute this job: JobClient.run(job1).
Immediately below it, create the JobConf object "job2" for the second job and set all the parameters with "temp" as inputdirectory and "output" as output directory. Finally execute second job: JobClient.run(job2).
Method 2:
Create two JobConf objects and set all the parameters in them just like (1) except that you don't use JobClient.run.
Then create two Job objects with jobconfs as parameters: Job job1=new Job(jobconf1); Job job2=new Job(jobconf2);
Using the jobControl object, you specify the job dependencies and then run the jobs: JobControl jbcntrl=new JobControl("jbcntrl"); jbcntrl.addJob(job1); jbcntrl.addJob(job2); job2.addDependingJob(job1); jbcntrl.run();
Using Oozie which is Hadoop Workflow Service described as below:
https://issues.apache.org/jira/secure/attachment/12400686/hws-v1_0_2009FEB22.pdf
3.1.5 Fork and Join Control Nodes
A fork node splits one path of execution into multiple concurrent paths of execution. A join node waits until every concurrent execution path of a previous fork node arrives to it. fork and join nodes must be used in pairs. The
join node assumes concurrent execution paths are children of the same fork node.
The name attribute in the fork node is the name of the workflow fork node. The to attribute in the transition elements in the fork node indicate the name of the workflow node that will be part of the concurrent execution. The
name attribute in the join node is the name of the workflow join node. The to attribute in the transition element in the join node indicates the name of the workflow node that will executed after all 3.1.5 Fork and Join Control Nodes 5concurrent execution
paths of the corresponding fork arrive to the join node.
Example:
<hadoop−workflow name="sample−wf">
...
<fork name="forking">
<transition to="firstparalleljob"/>
<transition to="secondparalleljob"/>
</fork>
<hadoop name="firstparalleljob">
<job−xml>job1.xml</job−xml>
<transition name="OK" to="joining"/>
<transition name="ERROR" to="fail"/>
</hadoop>
<hadoop name="secondparalleljob">
<job−xml>job2.xml</job−xml>
<transition name="OK" to="joining"/>
<transition name="ERROR" to="fail"/>
</hadoop>
<join name="joining">
<transition to="nextaction"/>
</join>
...
</hadoop−workflow>
* 注:多个连续的job会产生的中间的文件,在调用完成job之后可以用shell中删除中间的文件
相关文章推荐
- RunningMapReduceExampleTFIDF - hadoop-clusternet - This document describes how to run the TF-IDF MapReduce example against ascii books. - This project is for those who wants to experiment hadoop as a skunkworks in a small cluster (1-10 nodes) - Google Pro
- RunningMapReduceExampleTFIDF - hadoop-clusternet - This document describes how to run the TF-IDF MapReduce example against ascii books. - This project is for those who wants to experiment hadoop as a skunkworks in a small cluster (1-10 nodes) - Google Pro
- A good blog about how to write an Hadoop MapReduce program in Python
- How to check invalid objects and broken job in multiple database
- How to attach multiple files in the Send Mail Task in SSIS
- How to effectively work with multiple files in Vim?
- How do I add multiple arguments to my custom template filter in a django template? - Stack Overflow
- How To contain multiple fileds in the querystrig, DataNavigateUrlFormatString=xxx.asp?ID={0}&Name={1}
- SharePoint 2013 Step by Step—— How to Upload Multiple Documents in Document Library
- how to implement WaitForMultipleObjects in linux
- How-to rename multiple files in Linux How-to rename multiple files
- SharePoint 2013 Step by Step—— How to Upload Multiple Documents in Document Library
- how to implement the WaitForMultipleObjects in linux
- How To Cleanup Orphaned DataPump Jobs In DBA_DATAPUMP_JOBS ?
- How to prevent multiple clicks of a submit button in ASP.NET
- 【引】How to Choose the Best Way to Pass Multiple Models in ASP.NET MVC
- MVVM :How to select multiple items in listbox
- How-to: enable fair scheduler in hadoop
- [转]how to inserting multiple rows in one step
- How to define multiple controllers for one view in angularJS?