您的位置:首页 > 其它

用CombineFileInputFormat处理小文件的mapreduce程序

2012-04-13 16:59 501 查看


Dealing with lots of small files in Hadoop MapReduce with CombineFileInputFormat

Input to Hadoop MapReduce process is abstracted by InputFormat. FileInputFormat is a default implementation that deals with files in HDFS. With FileInputFormat, each file is splited into one or more InputSplits typically upper bounded by block size. This means
the number of input splits are lower bounded by number of input files. This is not an ideal environment for MapReduce process when it's dealing with large number of small files, because overhead of coordinating distributed processes is far greater than when
there are relatively small number of large files. Note here when the input split spills over block boundaries this could work against the general rule of 'having process close to data', because blocks could be at different network locations.

Enter CombineFileInputFormat, it packs many files into each split so that each mapper has more to process. CombineFileInputFormat takes node and rack locality into account when deciding which blocks to place in the same split so it doesn't suffer from the same
problem of simply having a big split size.

public class MyCombineFileInputFormat extends CombineFileInputFormat {

  public static class MyKeyValueLineRecordReader implements RecordReader {
    private final KeyValueLineRecordReader delegate;

    public MyKeyValueLineRecordReader(
      CombineFileSplit split, Configuration conf, Reporter reporter, Integer idx) throws IOException {
      FileSplit fileSplit = new FileSplit(
        split.getPath(idx), split.getOffset(idx), split.getLength(idx), split.getLocations());
      delegate = new KeyValueLineRecordReader(conf, fileSplit);
    }

    @Override
    public boolean next(Text key, Text value) throws IOException {
      return delegate.next(key, value);
    }

    @Override
    public Text createKey() {
      return delegate.createKey();
    }

    @Override
    public Text createValue() {
      return delegate.createValue();
    }

    @Override
    public long getPos() throws IOException {
      return delegate.getPos();
    }

    @Override
    public void close() throws IOException {
      delegate.close();
    }

    @Override
    public float getProgress() throws IOException {
      return delegate.getProgress();
    }
  }

  @Override
  public RecordReader getRecordReader(
    InputSplit split, JobConf job, Reporter reporter) throws IOException {
    return new CombineFileRecordReader(
      job, (CombineFileSplit) split, reporter, (Class) MyKeyValueLineRecordReader.class);
  }
}


CombineFileInputFormat is an abstract class that you need to extend and override getRecordReader method. CombineFileRecordReader manages multiple input splits in CombineFileSplit simply by constructing new RecordReader for each input split within. MyKeyValueLineRecordReader
creates a KeyValueLineRecordReader to delegate operations to.

Remember to set mapred.max.split.size to a small multiple of block size in bytes as otherwise there will be no split at all.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: