Apache Hadoop Mapreduce作业执行前篇之任务执行前准备(上)
一.创建Job作业过程
1.获取Job作业对象
方式一:
Configuration conf = new Configuration();Job job=Job.getInstance(conf);
public static Job getInstance(Configuration conf) throws IOException { JobConf jobConf = new JobConf(conf); return new Job(jobConf); }
这种方式可以自定义配置,可以通过Configuration类的对象来设置文件系统,而不是使用本地文件系统(当然你可以通过创建配置文件,自动加载的形式来使用hfds分布式文件系统),但是上述代码没有对conf做任何设置,因此默认使用的是本地文件系统。可以通过如下设置来设置为hdfs文件系统。
conf.set("fs.defaultFS", "http://mycat01:9000");
方式二:
Job job=Job.getInstance();
public static Job getInstance() throws IOException { // create with a null Cluster return getInstance(new Configuration()); }
这种方式与方式一不应用set方法设置一样,毕竟底层还是调用的方式的获取实例的方法。此时,两者都是使用
Configuration conf = new Configuration();来确定配置,而这个配置,我们知道,当创建conf的时候就会加载很多默认的配置,例如:hdfs-default.xml中的配置等。而这个配置应用的是本地文件系统,在windows上默认是
file:///,当然你可以在项目的
src目录下创建对应的类似core-default.xml这种文件,让系统自动加载这个配置文件。那么这样的话都可以应用自定义配置了。只不过方式一更灵活一点。
2.Job实例的动作(一)之普通方法与构造器
使用自定义配置来创建一个Job(作业实例),创建过程中,job会复制一份conf的配置,以至于间歇性的修改不会影响正在配置的参数,有必要的时候,会根据conf的参数配置创建一个集群。
public static Job getInstance(Configuration conf) throws IOException { // create with a null Cluster JobConf jobConf = new JobConf(conf); return new Job(jobConf); }
接下来我们看看上面第一行代码
new JobConf(conf):
public JobConf(Configuration conf) { super(conf); if (conf instanceof JobConf) {// 从配置文件中获取秘钥和表示的读写对象的实例,并进行设置 JobConf that = (JobConf)conf; credentials = that.credentials; } checkAndWarnDeprecation(); //对配置文件中过时属性的日志warn级别的输出 }
重要的是第一行:调用父类Configuration的构造方法:
public Configuration(Configuration other) { this.resources = (ArrayList<Resource>) other.resources.clone(); synchronized(other) { if (other.properties != null) { this.properties = (Properties)other.properties.clone(); } if (other.overlay!=null) { this.overlay = (Properties)other.overlay.clone(); } this.restrictSystemProps = other.restrictSystemProps; this.updatingResource = new ConcurrentHashMap<String, String[]>( other.updatingResource); this.finalParameters = Collections.newSetFromMap( new ConcurrentHashMap<String, Boolean>()); this.finalParameters.addAll(other.finalParameters); } synchronized(Configuration.class) { REGISTRY.put(this, null); } this.classLoader = other.classLoader; this.loadDefaults = other.loadDefaults; setQuietMode(other.getQuietMode()); }
从代码中三个clone方法的调用,我想你应该已经明白了,这里只是对你从Job中传递过来的conf做了一个克隆,然后放到resources,properties等容器里面。
读到这里我们发现,我们大体上只干了一件事,那就是复制一份用户自定义的conf对象的配置信息到Configuration的一些属性中。
3.Job实例动作(二)之静态代码块
单看Job类的静态代码块:
static { ConfigUtil.loadResources(); }
在往内部:
public static void loadResources() { addDeprecatedKeys(); // 添加过时的配置属性 Configuration.addDefaultResource("mapred-default.xml"); Configuration.addDefaultResource("mapred-site.xml"); Configuration.addDefaultResource("yarn-default.xml"); Configuration.addDefaultResource("yarn-site.xml"); }
后面几个则是加载对应的xml配置文件中的配置,这个是在Job作业对象创建前就加载的
另外在JobConf中静态代码块中执行的动作和Job中的一样,这里不做说明。
再来看看Configuration类的:
static{ //print deprecation warning if hadoop-site.xml is found in classpath ClassLoader cL = Thread.currentThread().getContextClassLoader(); if (cL == null) { cL = Configuration.class.getClassLoader(); } if(cL.getResource("hadoop-site.xml")!=null) { LOG.warn("DEPRECATED: hadoop-site.xml found in the classpath. " + "Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, " + "mapred-site.xml and hdfs-site.xml to override properties of " + "core-default.xml, mapred-default.xml and hdfs-default.xml " + "respectively"); } addDefaultResource("core-default.xml"); addDefaultResource("core-site.xml"); }
上述代码意思:去获取当前线程的类加载器对象,如果没有获得的话,就是用当前Configuration的类加载器。因为hadoop-site.xml是之前的版本支持的默认配置文件,所以在此判断,然后日志输出相关过时信息。最后的话会加载类路径下
core-default.xml和
core-site.xml两个配置文件.
4.创建Job实例的过程中执行的动作总结
1.
加载Job,JobConf,Configuration三大核心的类2.加载默认的配置文件中的配置(包括
mapred-default.xml,mapred-site.xml,yarn-site.xml,yarn-default.xml,core-site.xml和core-default.xml)3.对job构造器传递过来的新的Configuration的对象的配置信息进行拷贝,存到Configuration的某些属性里。(相当于用户传递过来的新的配置并没有应用,只是拷贝了一份然后存下来)
二.Mapper类与Reducer类的配置
1.Mapper类的设置
Driver中代码:
job.setMapperClass(WordCountMapper.class);
job中的setMapperClass方法:
public void setMapperClass(Class<? extends Mapper> cls) throws IllegalStateException { ensureState(JobState.DEFINE); // 确认Job的状态 conf.setClass(MAP_CLASS_ATTR, cls, Mapper.class); //设置Mapper类 }
其中
ensureState(JobState.DEFINE):
private void ensureState(JobState state) throws IllegalStateException { if (state != this.state) { throw new IllegalStateException("Job in state "+ this.state + " instead of " + state); } if (state == JobState.RUNNING && cluster == null) { throw new IllegalStateException ("Job in state " + this.state + ", but it isn't attached to any job tracker!"); } }
可见
ensureState用于确认当时时候是未提交(非运行)状态。只有处于
DEFINE状态的作业才可以设置
MapperClass或者
ReducerClass。
作业未提交状态为:
DEFINE作业提交状态为:
RUNING
再来看看
conf.setClass(MAP_CLASS_ATTR, cls, Mapper.class):
MAP_CLASS_ATTR# mapreduce.job.map.classcls# 自己传入的Mapper类型Mapper.class# 框架本身自带的Mapper默认类
public void setClass(String name, Class<?> theClass, Class<?> xface) { if (!xface.isAssignableFrom(theClass)) throw new RuntimeException(theClass+" not "+xface.getName()); set(name, theClass.getName()); }
即如果传过来的自定义Mapper类不是Mapper的子类,那么会跑一个运行时异常,否则的话则通过Configuration中属性设置的
MAP_CLASS_ATTR即
mapreduce.job.map.class的key来设置运行需要的Mapper类。
2.Reducer类的设置
job.setReducerClass(WordCountReducer.class)
对于job来说传递的类不同,但是在Configuration中两者调用的都是一个方法,即如果不设置Reducer类的话,会使用默认的
Reducer类作为Reducer类。
3.配置Map或者Reducer输出类型
//指定自定义的mapper类的输出键值类型 job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); //指定自定义reducer类的输出键值类型 job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);
因为java泛型存在类型擦除的原因,所以需要手动指定。当然这些配置都是设置Configuration的key属性的值的。
4.定义mapreduce的输入输出
FileInputFormat.addInputPath(job, new Path(("D:\\mktest\\wordCountdemo"))); FileSystem fs = FileSystem.get(conf); Path path=new Path("D://mktest/wordcount"); //输出目录要求不能存在,不然会报错,下面判断为:如果该目录存在该目录直接级联删除(方便测试) if (fs.exists(path)) { fs.delete(path, true); } FileOutputFormat.setOutputPath(job, path);
一般两行搞定即可:
FileInputFormat.addInputPath(job, new Path(("D:\\mktest\\wordCountdemo"))); FileOutputFormat.setOutputPath(job, path);
5.作业提交
但是我们需要注意的是,从创建Job示例开始到现在为止,一直是在添加新的配置,例如加载默认配置文件的配置,Mapper端Mapper类及其输出键值类型,Reducer端输出类型等。事实上,最终让整个作业运行起来的是:
job.waitForCompletion(true);//是否为用户打印进度,以提供监控
这个才是进行作业的提交的核心方法。
public boolean waitForCompletion(boolean verbose ) throws IOException, InterruptedException, ClassNotFoundException { if (state == JobState.DEFINE) {//确认作业状态,仅为DEFINE状态才能进行作业的提交 submit(); //提交作业 } if (verbose) { //监控并打印作业 monitorAndPrintJob(); } else { // get the completion poll interval from the client. int completionPollIntervalMillis = Job.getCompletionPollInterval(cluster.getConf()); while (!isComplete()) { try { Thread.sleep(completionPollIntervalMillis); } catch (InterruptedException ie) { } } } return isSuccessful(); //成功则返回成功状态(此时是RUNING状态) }
这里最核心的是submit方法,执行提交动作。
public void submit() throws IOException, InterruptedException, ClassNotFoundException { ensureState(JobState.DEFINE); setUseNewAPI(); //设置使用心得API,而不是老版本的API,除非老版本的属性或者方法被应用时 connect(); final JobSubmitter submitter = getJobSubmitter(cluster.getFileSystem(), cluster.getClient()); status = ugi.doAs(new PrivilegedExceptionAction<JobStatus>() { public JobStatus run() throws IOException, InterruptedException, ClassNotFoundException { return submitter.submitJobInternal(Job.this, cluster); } }); state = JobState.RUNNING; LOG.info("The url to track the job: " + getTrackingURL()); }
首先,提交作业前,必须确保作业状态是DEFINE状态,而
setUseNewAPI()则是根据你程序中使用的是心得API还是老版本的API,进行相关的载入,最明显一点是使用
org.apache.hadoop.mapreduce包下的API而不是
org.apache.hadoop.mapred包下面的。
private synchronized void connect() throws IOException, InterruptedException, ClassNotFoundException { if (cluster == null) { cluster = ugi.doAs(new PrivilegedExceptionAction<Cluster>() { public Cluster run() throws IOException, InterruptedException, ClassNotFoundException { return new Cluster(getConfiguration()); } }); } }
public Cluster(InetSocketAddress jobTrackAddr, Configuration conf) throws IOException { this.conf = conf; this.ugi = UserGroupInformation.getCurrentUser(); initialize(jobTrackAddr, conf); }
connect`方法会根据传递的新的Configuration对象产生一个Cluster即集群对象。新对象里将获得客户端的协议,以实现客户端与作业追踪器间的通信。
final JobSubmitter submitter = getJobSubmitter(cluster.getFileSystem(), cluster.getClient()); status = ugi.doAs(new PrivilegedExceptionAction<JobStatus>() { public JobStatus run() throws IOException, InterruptedException, ClassNotFoundException { return submitter.submitJobInternal(Job.this, cluster); } });
实现了与client的通信后,就实例化一个包装client的提交者客户端示例对象。
ugi中保存了当前用户所在组的信息,
public class UserGroupInformation { private static final Log LOG = LogFactory.getLog(UserGroupInformation.class); /** * Percentage of the ticket window to use before we renew ticket. */ private static final float TICKET_RENEW_WINDOW = 0.80f; private static boolean shouldRenewImmediatelyForTests = false; static final String HADOOP_USER_NAME = "HADOOP_USER_NAME"; static final String HADOOP_PROXY_USER = "HADOOP_PROXY_USER"; .................
在这里值得一提的是,加入你在windows本地运行mapreduce作业,输入源为hdfs分布式文件系统,输出也为hdfs文件系统(hadoop集群),那么需要注意的是,mapreduce作业会在此处做一个权限校验的工作,
HADOOP_USER_NAME则是保存在Configuration中的你本地系统的用户名,一般来说,假如你的windows机器当前用户名为Admin用户,那么提交作业时就是以Admin用户的身份去提交的,对于hdfs来说,如果Admin没有对hdfs上文件的操作权限时,你的mapreduce程序就会抛出异常。
ugi.doAs方法也会在提交后以日志的形式打印提交者的权限动作。而该方法底层还是调用的Subject类的doAs方法。
public static <T> T doAs(final Subject subject, final java.security.PrivilegedExceptionAction<T> action) throws java.security.PrivilegedActionException { java.lang.SecurityManager sm = System.getSecurityManager();// 获取系统安全管理器 if (sm != null) {//检查权限 sm.checkPermission(AuthPermissionHolder.DO_AS_PERMISSION); } if (action == null) throw new NullPointerException (ResourcesMgr.getString("invalid.null.action.provided")); // set up the new Subject-based AccessControlContext for doPrivileged final AccessControlContext currentAcc = AccessController.getContext(); // call doPrivileged and push this new context on the stack return java.security.AccessController.doPrivileged (action, createContext(subject, currentAcc)); }
这里主要是获取系统安全管理器检查权限,然后通过AccessController(用来访问控制操作和决定,包括在使用当前生效的安全策略下,对校验性的文件系统资源的是否可被访问,获取当前可访问权限的一个快照,对拥有权限的代码进行授权标识等)的getContext()方法,此方法获取当前调用上下文的“快照”。
包括当前线程继承的AccessControlContext和受限的权限范围,并将其放置在AccessControlContext对象中。然后可以在以后的某个点检查这个上下文,可能在另一个线程中。
之后就是权限的本地方法的调用了(好累,不想继续往里看了)
直接看Job中那段代码中调用的
submitJobInternal方法吧:
JobStatus submitJobInternal(Job job, Cluster cluster) throws ClassNotFoundException, InterruptedException, IOException { //validate the jobs output specs checkSpecs(job); Configuration conf = job.getConfiguration(); addMRFrameworkToDistributedCache(conf); Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf); //configure the command line options correctly on the submitting dfs InetAddress ip = InetAddress.getLocalHost(); if (ip != null) { submitHostAddress = ip.getHostAddress(); submitHostName = ip.getHostName(); conf.set(MRJobConfig.JOB_SUBMITHOST,submitHostName); conf.set(MRJobConfig.JOB_SUBMITHOSTADDR,submitHostAddress); } JobID jobId = submitClient.getNewJobID(); job.setJobID(jobId); Path submitJobDir = new Path(jobStagingArea, jobId.toString()); JobStatus status = null; try { conf.set(MRJobConfig.USER_NAME, UserGroupInformation.getCurrentUser().getShortUserName()); conf.set("hadoop.http.filter.initializers", "org.apache.hadoop.yarn.server.webproxy.amfilter.AmFilterInitializer"); conf.set(MRJobConfig.MAPREDUCE_JOB_DIR, submitJobDir.toString()); LOG.debug("Configuring job " + jobId + " with " + submitJobDir + " as the submit dir"); // get delegation token for the dir TokenCache.obtainTokensForNamenodes(job.getCredentials(), new Path[] { submitJobDir }, conf); populateTokenCache(conf, job.getCredentials()); // generate a secret to authenticate shuffle transfers if (TokenCache.getShuffleSecretKey(job.getCredentials()) == null) { KeyGenerator keyGen; try { keyGen = KeyGenerator.getInstance(SHUFFLE_KEYGEN_ALGORITHM); keyGen.init(SHUFFLE_KEY_LENGTH); } catch (NoSuchAlgorithmException e) { throw new IOException("Error generating shuffle secret key", e); } SecretKey shuffleKey = keyGen.generateKey(); TokenCache.setShuffleSecretKey(shuffleKey.getEncoded(), job.getCredentials()); }if (CryptoUtils.isEncryptedSpillEnabled(conf)) { conf.setInt(MRJobConfig.MR_AM_MAX_ATTEMPTS, 1); LOG.warn("Max job attempts set to 1 since encrypted intermediate" + "data spill is enabled"); } copyAndConfigureFiles(job, submitJobDir); Path submitJobFile = JobSubmissionFiles.getJobConfPath(submitJobDir); // Create the splits for the job LOG.debug("Creating splits at " + jtFs.makeQualified(submitJobDir)); int maps = writeSplits(job, submitJobDir); conf.setInt(MRJobConfig.NUM_MAPS, maps); LOG.info("number of splits:" + maps); // write "queue admins of the queue to which job is being submitted" // to job file. String queue = conf.get(MRJobConfig.QUEUE_NAME, JobConf.DEFAULT_QUEUE_NAME); AccessControlList acl = submitClient.getQueueAdmins(queue); conf.set(toFullPropertyName(queue, QueueACL.ADMINISTER_JOBS.getAclName()), acl.getAclString()); // removing jobtoken referrals before copying the jobconf to HDFS // as the tasks don't need this setting, actually they may break // because of it if present as the referral will point to a // different job. TokenCache.cleanUpTokenReferral(conf); if (conf.getBoolean( MRJobConfig.JOB_TOKEN_TRACKING_IDS_ENABLED, MRJobConfig.DEFAULT_JOB_TOKEN_TRACKING_IDS_ENABLED)) { // Add HDFS tracking ids ArrayList<String> trackingIds = new ArrayList<String>(); for (Token<? extends TokenIdentifier> t : job.getCredentials().getAllTokens()) { trackingIds.add(t.decodeIdentifier().getTrackingId()); } conf.setStrings(MRJobConfig.JOB_TOKEN_TRACKING_IDS, trackingIds.toArray(new String[trackingIds.size()])); } // Set reservation info if it exists ReservationId reservationId = job.getReservationId(); if (reservationId != null) { conf.set(MRJobConfig.RESERVATION_ID, reservationId.toString()); } // Write job file to submit dir writeConf(conf, submitJobFile); // // Now, actually submit the job (using the submit name) // printTokens(jobId, job.getCredentials()); status = submitClient.submitJob( jobId, submitJobDir.toString(), job.getCredentials()); if (status != null) { return status; } else { throw new IOException("Could not launch job"); } } finally { if (status == null) { LOG.info("Cleaning up the staging area " + submitJobDir); if (jtFs != null && submitJobDir != null) jtFs.delete(submitJobDir, true); } } }
其中第一个
checkSpecs(job)用来校验输出的详情信息:
private void checkSpecs(Job job) throws ClassNotFoundException, InterruptedException, IOException { JobConf jConf = (JobConf)job.getConfiguration(); // Check the output specification if (jConf.getNumReduceTasks() == 0 ? jConf.getUseNewMapper() : jConf.getUseNewReducer()) { org.apache.hadoop.mapreduce.OutputFormat<?, ?> output = ReflectionUtils.newInstance(job.getOutputFormatClass(), job.getConfiguration()); output.checkOutputSpecs(job);} else { jConf.getOutputFormat().checkOutputSpecs(jtFs, jConf);} }
先获取Configuration配置对象,然后是判断reducetask的并行度,如果reducetask的并行度为零,即没有reduce阶段,则判断是否使用新的Mapper的API,否则判断是否使用新的ReducerAPI,以上两个分支不满足时则:
jConf.getOutputFormat().checkOutputSpecs(jtFs, jConf);
即会对输出目录详情信息进行校验,例如如果输出目录存在的时候则会抛出IOException异常。
加入使用心的MapperAPI或者ReducerAPI时:
org.apache.hadoop.mapreduce.OutputFormat<?, ?> output = ReflectionUtils.newInstance(job.getOutputFormatClass(), job.getConfiguration()); output.checkOutputSpecs(job);
这里会通过反射的形式去创建用于文件输出操作的对象,如果没有设置自定义的输出对象时,一般默认调用的是FileOutputFormat的子类TextOutputFormat进行相关的操作管理。并且通过LineRecordWriter实现对数据的IO写操作。此处同样会调用父类
FileOutputFormat类的checkOutputSpecs来校验job的输出:
public void checkOutputSpecs(FileSystem ignored, JobConf job) throws FileAlreadyExistsException, InvalidJobConfException, IOException { // Ensure that the output directory is set and not already there Path outDir = getOutputPath(job); if (outDir == null && job.getNumReduceTasks() != 0) { throw new InvalidJobConfException("Output directory not set in JobConf."); } if (outDir != null) { FileSystem fs = outDir.getFileSystem(job); // normalize the output directory outDir = fs.makeQualified(outDir); setOutputPath(job, outDir); // get delegation token for the outDir's file system TokenCache.obtainTokensForNamenodes(job.getCredentials(), new Path[] {outDir}, job); // check its existence if (fs.exists(outDir)) { throw new FileAlreadyExistsException("Output directory " + outDir + " already exists"); } } }
执行的动作包括:确认输出目录被设置了(没设置则会报输出目录没有设置异常),并且该目录不存在(如果目录存在的话,会报该目录已存在异常)
在这些校验进行完毕后:
Configuration conf = job.getConfiguration(); addMRFrameworkToDistributedCache(conf);
private static void addMRFrameworkToDistributedCache(Configuration conf) throws IOException { String framework = conf.get(MRJobConfig.MAPREDUCE_APPLICATION_FRAMEWORK_PATH, ""); if (!framework.isEmpty()) { URI uri; try { uri = new URI(framework); } catch (URISyntaxException e) { throw new IllegalArgumentException("Unable to parse '" + framework + "' as a URI, check the setting for " + MRJobConfig.MAPREDUCE_APPLICATION_FRAMEWORK_PATH, e); } String linkedName = uri.getFragment(); // resolve any symlinks in the URI path so using a "current" symlink // to point to a specific version shows the specific version // in the distributed cache configuration FileSystem fs = FileSystem.get(uri, conf); Path frameworkPath = fs.makeQualified( new Path(uri.getScheme(), uri.getAuthority(), uri.getPath())); FileContext fc = FileContext.getFileContext(frameworkPath.toUri(), conf); frameworkPath = fc.resolvePath(frameworkPath); uri = frameworkPath.toUri(); try { uri = new URI(uri.getScheme(), uri.getAuthority(), uri.getPath(), null, linkedName); } catch (URISyntaxException e) { throw new IllegalArgumentException(e); } DistributedCache.addCacheArchive(uri, conf); } }
大概就是从job的Configuration配置中获取mapreduce应用框架的路径(提交的文件系统信息,也会对这些信息做一个校验),之后将返回的权限资源定位符和Configuration配置对象进行分布式缓存。然后就是为作业设置提交客户端主机信息,并且生成新的作业ID(jobId)
为作业对象配置作业提交者信息,并设置作业的提交目录,然后为namenode进行提交作业目录的授权(根据Configuration的配置),之后将作业的授权信息和Configuration配置进行封装缓存。
然后是对shuffle极端数据转移的认证秘钥的生成:
// generate a secret to authenticate shuffle transfers if (TokenCache.getShuffleSecretKey(job.getCredentials()) == null) { KeyGenerator keyGen; try { keyGen = KeyGenerator.getInstance(SHUFFLE_KEYGEN_ALGORITHM); keyGen.init(SHUFFLE_KEY_LENGTH); } catch (NoSuchAlgorithmException e) { throw new IOException("Error generating shuffle secret key", e); } SecretKey shuffleKey = keyGen.generateKey(); TokenCache.setShuffleSecretKey(shuffleKey.getEncoded(), job.getCredentials()); }
针对shuffle阶段加密溢写的开启判断:
if (CryptoUtils.isEncryptedSpillEnabled(conf)) { conf.setInt(MRJobConfig.MR_AM_MAX_ATTEMPTS, 1); LOG.warn("Max job attempts set to 1 since encrypted intermediate" + "data spill is enabled"); }
接下来:
copyAndConfigureFiles(job, submitJobDir);
即通过命令行的形式来配置用户的作业配置,例如:-libjars, -files, -archives
private void copyAndConfigureFiles(Job job, Path jobSubmitDir) throws IOException { JobResourceUploader rUploader = new JobResourceUploader(jtFs); rUploader.uploadFiles(job, jobSubmitDir); // Get the working directory. If not set, sets it to filesystem working dir // This code has been added so that working directory reset before running // the job. This is necessary for backward compatibility as other systems // might use the public API JobConf#setWorkingDirectory to reset the working // directory. job.getWorkingDirectory(); }
通过资源加载器
JobResourceUploader加载用户的配置,作业jar包和依赖的jar包,库文件等,再就是获取工作目录。如果没有设置的话,就会设置文件系统的工作目录,而且,在任务运行前,该工作目录会被重置。
开始在提交目录创建切片,并设置map数量(即
maptask个数)到配置文件中,然注意:maptask个数与split的个数是相同的。
// Create the splits for the job LOG.debug("Creating splits at " + jtFs.makeQualified(submitJobDir)); int maps = writeSplits(job, submitJobDir); conf.setInt(MRJobConfig.NUM_MAPS, maps); LOG.info("number of splits:" + maps);
我们具体来看看获取maps的计算(实际上是逻辑切片个数):
private int writeSplits(org.apache.hadoop.mapreduce.JobContext job, Path jobSubmitDir) throws IOException, InterruptedException, ClassNotFoundException { JobConf jConf = (JobConf)job.getConfiguration(); int maps; if (jConf.getUseNewMapper()) { //如果使用新的API maps = writeNewSplits(job, jobSubmitDir); } else {// 如果使用就API maps = writeOldSplits(jConf, jobSubmitDir); } return maps; }
我们来看看新API中计算吧:
private <T extends InputSplit> int writeNewSplits(JobContext job, Path jobSubmitDir) throws IOException, InterruptedException, ClassNotFoundException { Configuration conf = job.getConfiguration(); InputFormat<?, ?> input = ReflectionUtils.newInstance(job.getInputFormatClass(), conf); List<InputSplit> splits = input.getSplits(job); T[] array = (T[]) splits.toArray(new InputSplit[splits.size()]); // sort the splits into order based on size, so that the biggest // go first Arrays.sort(array, new SplitComparator()); JobSplitWriter.createSplitFiles(jobSubmitDir, conf, jobSubmitDir.getFileSystem(conf), array); return array.length; }
具体的获取将在后续详细讲解。
接下来把正在向其提交作业的队列的队列管理员信息写入job作业文件中
String queue = conf.get(MRJobConfig.QUEUE_NAME, JobConf.DEFAULT_QUEUE_NAME); AccessControlList acl = submitClient.getQueueAdmins(queue); conf.set(toFullPropertyName(queue, QueueACL.ADMINISTER_JOBS.getAclName()), acl.getAclString());
在将jobconf复制到hdfs之前删除jobtoken引用,由于任务不需要此设置,实际上它们可能会中断因为它如果作为推荐人出现将指向不同的作业。
// removing jobtoken referrals before copying the jobconf to HDFS // as the tasks don't need this setting, actually they may break // because of it if present as the referral will point to a // different job. TokenCache.cleanUpTokenReferral(conf);
为作业添加作业追踪ID:
if (conf.getBoolean( MRJobConfig.JOB_TOKEN_TRACKING_IDS_ENABLED, MRJobConfig.DEFAULT_JOB_TOKEN_TRACKING_IDS_ENABLED)) { // Add HDFS tracking ids ArrayList<String> trackingIds = new ArrayList<String>(); for (Token<? extends TokenIdentifier> t : job.getCredentials().getAllTokens()) { trackingIds.add(t.decodeIdentifier().getTrackingId()); } conf.setStrings(MRJobConfig.JOB_TOKEN_TRACKING_IDS, trackingIds.toArray(new String[trackingIds.size()])); }
设置保留信息(如果存在)
ReservationId reservationId = job.getReservationId(); if (reservationId != null) { conf.set(MRJobConfig.RESERVATION_ID, reservationId.toString()); }
写作业任务到提交作业文件
writeConf(conf, submitJobFile);
接下在开始真正的作业提交:(提交之后会将作业的状态信息返回
status==>这里指的是status不是State)
printTokens(jobId, job.getCredentials()); //打印作业授权相关信息 //客户端执行作业提交 status = submitClient.submitJob(jobId, submitJobDir.toString(), job.getCredentials()); if (status != null) { return status; } else { throw new IOException("Could not launch job"); }
其中
submitClient是客户端jobClient与作业追踪器jobTracker通信协议的实例。JobClient可以借助该实现类的方法进行作业的提交和任务的执行。但是
submitClient的类型是
ClientProtocol类型。实际的调用类型是其实现类
LocalJobRunner。
public org.apache.hadoop.mapreduce.JobStatus submitJob( org.apache.hadoop.mapreduce.JobID jobid, String jobSubmitDir, Credentials credentials) throws IOException { Job job = new Job(JobID.downgrade(jobid), jobSubmitDir); job.job.setCredentials(credentials); return job.status; }
这个
Job类是该
LocalJobRunner类的内部类:
public Job(JobID jobid, String jobSubmitDir) throws IOException { this.systemJobDir = new Path(jobSubmitDir); this.systemJobFile = new Path(systemJobDir, "job.xml"); this.id = jobid; JobConf conf = new JobConf(systemJobFile); this.localFs = FileSystem.getLocal(conf); String user = UserGroupInformation.getCurrentUser().getShortUserName(); this.localJobDir = localFs.makeQualified(new Path( new Path(conf.getLocalPath(jobDir), user), jobid.toString())); this.localJobFile = new Path(this.localJobDir, id + ".xml"); // Manage the distributed cache. If there are files to be copied, // this will trigger localFile to be re-written again. localDistributedCacheManager = new LocalDistributedCacheManager(); localDistributedCacheManager.setup(conf); // Write out configuration file. Instead of copying it from // systemJobFile, we re-write it, since setup(), above, may have // updated it. OutputStream out = localFs.create(localJobFile); try { conf.writeXml(out); } finally { out.close(); } this.job = new JobConf(localJobFile); // Job (the current object) is a Thread, so we wrap its class loader. if (localDistributedCacheManager.hasLocalClasspaths()) { setContextClassLoader(localDistributedCacheManager.makeClassLoader( getContextClassLoader())); } profile = new JobProfile(job.getUser(), id, systemJobFile.toString(), "http://localhost:8080/", job.getJobName()); status = new JobStatus(id, 0.0f, 0.0f, JobStatus.RUNNING, profile.getUser(), profile.getJobName(), profile.getJobFile(), profile.getURL().toString()); jobs.put(id, this); this.start(); 开启一个子线程 }
细心的你会发现这个
Job类继承了
Thread类,即一个线程类,所以在上面最后一行开启了一个线程,即执行run方法就是执行核心的提交任务。关于真正的提交动作的分析将在后续继续更新。
- 精通HADOOP(九) - MAPREDUCE任务的基础知识 - 执行作业
- 【Hadoop】MapReduce笔记(一):MapReduce作业运行过程、任务执行
- 一个MapReuce作业的从开始到结束--第6章Hadoop以Jar包的方式执行MapReduce任务
- hadoop MapReduce - 从作业、任务(task)、管理员角度调优
- Hadoop - Map/Reduce 通过理解org.apache.hadoop.mapreduce.Job类来学习hadoop的执行逻辑
- Hadoop MapReduce之MapTask任务执行(二)
- hadoop MapReduce - 从作业、任务(task)、管理员角度调优
- Hadoop MapReduce 任务执行流程源代码详细解析
- 精通HADOOP(八) - MAPREDUCE任务的基础知识 - 配置作业
- Hadoop MapReduce 任务执行流程源代码详细解析
- 记Hadoop2.5.0线上mapreduce任务执行map任务划分的一次问题解决
- Hadoop MapReduce之ReduceTask任务执行(四)
- Hadoop MapReduce之ReduceTask任务执行(二):GetMapEventsThread线程
- Hadoop MapReduce 任务执行流程源代码详细解析
- hadoop 8088 看不到mapreduce 任务的执行状态
- Hadoop MapReduce之ReduceTask任务执行(一):远程拷贝map输出
- Hadoop MapReduce之ReduceTask任务执行(三):Merger线程分析
- hadoop执行mapreduce任务,能够map,不能reduce,Shuffle阶段报错
- Hadoop MapReduce之ReduceTask任务执行(五)
- Hadoop MapReduce 任务执行流程源代码详细解析