您的位置:首页 > 编程语言

spark源码阅读一-spark读写hbase代码分析

2017-08-01 22:36 691 查看
1.读取hbase代码
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
   classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],  
   classOf[org.apache.hadoop.hbase.client.Result])  
看看newAPIHadoopRDD函数实现
def newAPIHadoopRDD[K, V, F <: NewInputFormat[K, V]](
    conf: Configuration = hadoopConfiguration,
    fClass: Class[F],
    kClass: Class[K],
    vClass: Class[V]): RDD[(K, V)] = withScope {
  assertNotStopped()

  // This is a hack to enforce loading hdfs-site.xml.  // See SPARK-11227 for details.  FileSystem.getLocal(conf)

  // Add necessary security credentials to the JobConf. Required to access secure HDFS. val jconf =
new JobConf(conf)
  SparkHadoopUtil.get.addCredentials(jconf)
  new NewHadoopRDD(this, fClass, kClass, vClass, jconf)
}
是新的hadoop rdd接口和类。我们可以看到input format是TableInputFormat,那么读出来数据的分区也是根据TableInputFormat来划分的。
TableInputFormat继承于TableInputFormatBase,我们来看如何分区的
/**
   * Calculates the splits that will serve as input for the map tasks. The
   * number of splits matches the number of regions in a table.
   *
   * @param context  The current job context.
   * @return The list of input splits.
   * @throws IOException When creating the list of splits fails.
   * @see org.apache.hadoop.mapreduce.InputFormat#getSplits(
   *   org.apache.hadoop.mapreduce.JobContext)
   */
  @Override
  public List<InputSplit> getSplits(JobContext context) throws IOException {
    if (table == null) {
        throw new IOException("No table was provided.");
    }
    // Get the name server address and the default value is null.
    this.nameServer =
      context.getConfiguration().get("hbase.nameserver.address", null);
     
    Pair<byte[][], byte[][]> keys = table.getStartEndKeys();
    if (keys == null || keys.getFirst() == null ||
        keys.getFirst().length == 0) {
      HRegionLocation regLoc = table.getRegionLocation(HConstants.EMPTY_BYTE_ARRAY, false);
      if (null == regLoc) {
        throw new IOException("Expecting at least one region.");
      }
      List<InputSplit> splits = new ArrayList<InputSplit>(1);
      InputSplit split = new TableSplit(table.getTableName(),
          HConstants.EMPTY_BYTE_ARRAY, HConstants.EMPTY_BYTE_ARRAY, regLoc
              .getHostnamePort().split(Addressing.HOSTNAME_PORT_SEPARATOR)[0]);
      splits.add(split);
      return splits;
    }
    List<InputSplit> splits = new ArrayList<InputSplit>(keys.getFirst().length);
    for (int i = 0; i < keys.getFirst().length; i++) {
      if ( !includeRegionInSplit(keys.getFirst()[i], keys.getSecond()[i])) {
        continue;
      }
      HRegionLocation location = table.getRegionLocation(keys.getFirst()[i], false);
      // The below InetSocketAddress creation does a name resolution.
      InetSocketAddress isa = new InetSocketAddress(location.getHostname(), location.getPort());
      if (isa.isUnresolved()) {
        LOG.warn("Failed resolve " + isa);
      }
      InetAddress regionAddress = isa.getAddress();
      String regionLocation;
      try {
        regionLocation = reverseDNS(regionAddress);
      } catch (NamingException e) {
        LOG.error("Cannot resolve the host name for " + regionAddress + " because of " + e);
        regionLocation = location.getHostname();
      }
 
      byte[] startRow = scan.getStartRow();
      byte[] stopRow = scan.getStopRow();
      // determine if the given start an stop key fall into the region
      if ((startRow.length == 0 || keys.getSecond()[i].length == 0 ||
          Bytes.compareTo(startRow, keys.getSecond()[i]) < 0) &&
          (stopRow.length == 0 ||
           Bytes.compareTo(stopRow, keys.getFirst()[i]) > 0)) {
        byte[] splitStart = startRow.length == 0 ||
          Bytes.compareTo(keys.getFirst()[i], startRow) >= 0 ?
            keys.getFirst()[i] : startRow;
        byte[] splitStop = (stopRow.length == 0 ||
          Bytes.compareTo(keys.getSecond()[i], stopRow) <= 0) &&
          keys.getSecond()[i].length > 0 ?
            keys.getSecond()[i] : stopRow;
        InputSplit split = new TableSplit(table.getTableName(),
          splitStart, splitStop, regionLocation);
        splits.add(split);
        if (LOG.isDebugEnabled()) {
          LOG.debug("getSplits: split -> " + i + " -> " + split);
        }
      }
    }
    return splits;
  }
从代码可以看出来,hbase数据分区是按照region进行的,分区的location就是各个region的location。那么后续分配executor时可以
按照region所在机器分配对应executor,直接在本机读取数据计算。

2.数据写入hbase
使用的代码是rdd.saveAsNewAPIHadoopDataset(jobConf),我们看看代码:
def saveAsNewAPIHadoopDataset(conf: Configuration): Unit = self.withScope {
  val config =
new HadoopMapR
4000
educeWriteConfigUtil[K, V](new SerializableConfiguration(conf))
  SparkHadoopWriter.write(
    rdd = self,
    config = config)
}
直接调用write接口写入,写入本地文件/hdfs文件都是使用的这个接口。 
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  hbase spark