Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-12222

Document and test BlockLocation for erasure-coded files

    XMLWordPrintableJSON

Details

    Description

      HDFS applications query block location information to compute splits. One example of this is FileInputFormat:

      https://github.com/apache/hadoop/blob/d4015f8628dd973c7433639451a9acc3e741d2a2/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L346

      You see bits of code like this that calculate offsets as follows:

          long bytesInThisBlock = blkLocations[startIndex].getOffset() + 
                                blkLocations[startIndex].getLength() - offset;
      

      EC confuses this since the block locations include parity block locations as well, which are not part of the logical file length. This messes up the offset calculation and thus topology/caching information too.

      Applications can figure out what's a parity block by reading the EC policy and then parsing the schema, but it'd be a lot better if we exposed this more generically in BlockLocation instead.

      Attachments

        1. HDFS-12222.001.patch
          9 kB
          Huafeng Wang
        2. HDFS-12222.002.patch
          4 kB
          Huafeng Wang
        3. HDFS-12222.003.patch
          13 kB
          Huafeng Wang
        4. HDFS-12222.004.patch
          31 kB
          Huafeng Wang
        5. HDFS-12222.005.patch
          21 kB
          Huafeng Wang
        6. HDFS-12222.006.patch
          23 kB
          Huafeng Wang

        Activity

          People

            HuafengWang Huafeng Wang
            andrew.wang Andrew Wang
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: