用CombineFileInputFormat处理小文件的mapreduce程序
2012-04-13 16:59
501 查看
Dealing with lots of small files in Hadoop MapReduce with CombineFileInputFormat
Input to Hadoop MapReduce process is abstracted by InputFormat. FileInputFormat is a default implementation that deals with files in HDFS. With FileInputFormat, each file is splited into one or more InputSplits typically upper bounded by block size. This meansthe number of input splits are lower bounded by number of input files. This is not an ideal environment for MapReduce process when it's dealing with large number of small files, because overhead of coordinating distributed processes is far greater than when
there are relatively small number of large files. Note here when the input split spills over block boundaries this could work against the general rule of 'having process close to data', because blocks could be at different network locations.
Enter CombineFileInputFormat, it packs many files into each split so that each mapper has more to process. CombineFileInputFormat takes node and rack locality into account when deciding which blocks to place in the same split so it doesn't suffer from the same
problem of simply having a big split size.
public class MyCombineFileInputFormat extends CombineFileInputFormat { public static class MyKeyValueLineRecordReader implements RecordReader { private final KeyValueLineRecordReader delegate; public MyKeyValueLineRecordReader( CombineFileSplit split, Configuration conf, Reporter reporter, Integer idx) throws IOException { FileSplit fileSplit = new FileSplit( split.getPath(idx), split.getOffset(idx), split.getLength(idx), split.getLocations()); delegate = new KeyValueLineRecordReader(conf, fileSplit); } @Override public boolean next(Text key, Text value) throws IOException { return delegate.next(key, value); } @Override public Text createKey() { return delegate.createKey(); } @Override public Text createValue() { return delegate.createValue(); } @Override public long getPos() throws IOException { return delegate.getPos(); } @Override public void close() throws IOException { delegate.close(); } @Override public float getProgress() throws IOException { return delegate.getProgress(); } } @Override public RecordReader getRecordReader( InputSplit split, JobConf job, Reporter reporter) throws IOException { return new CombineFileRecordReader( job, (CombineFileSplit) split, reporter, (Class) MyKeyValueLineRecordReader.class); } }
CombineFileInputFormat is an abstract class that you need to extend and override getRecordReader method. CombineFileRecordReader manages multiple input splits in CombineFileSplit simply by constructing new RecordReader for each input split within. MyKeyValueLineRecordReader
creates a KeyValueLineRecordReader to delegate operations to.
Remember to set mapred.max.split.size to a small multiple of block size in bytes as otherwise there will be no split at all.
相关文章推荐
- Hadoop MapReduce处理海量小文件:基于CombineFileInputFormat
- MapReduce小文件处理之CombineFileInputFormat实现
- Hadoop MapReduce处理小的压缩文件:基于CombineFileInputFormat
- Hadoop MapReduce处理海量小文件:基于CombineFileInputFormat(整个小文件读入到map中)
- MapReduce小文件处理之CombineFileInputFormat实现
- Hadoop MapReduce处理海量小文件:基于CombineFileInputFormat(每次往map中读入1行)
- Hadoop MapReduce处理海量小文件(每次整个小文件整体读入到map):基于FileInputFormat
- Hadoop使用CombineFileInputFormat处理大量小文件接口实现(Hadoop-1.0.4)
- CombineFileinputFormat处理大批量小文件
- 利用CombineFileInputFormat处理小文件
- hadoop编程小技巧(6)---处理大量小数据文件CombineFileInputFormat应用
- Hadoop中CombineFileInputFormat详解——处理大量小文件
- MapReduce的inputformat为CombineFileInputFormat的相关实验
- Hadoop MapReduce处理海量小文件:自定义InputFormat和RecordReader
- MapReduce的CombineFileInputFormat使用
- MapReduce应用中CombineFileInputFormat原理与用法
- MapReduce程序中的万能输入FileInputFormat.addInputPaths
- MapReduce应用中CombineFileInputFormat原理与用法
- input(type="file")+Handler(一般处理程序)上传文件简单Demo
- MapReduce-定制Partitioner-使用NLineInputFormat处理大文件-求文件奇偶数行之和