Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-24541

Add support to run LoadIncrementalHFiles in a distributed manner

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Patch Available
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.4.0
    • Fix Version/s: None
    • Component/s: mapreduce, Performance
    • Labels:
      None

      Description

      LoadIncrementalHFiles takes a very long time to complete when running HBase on top of S3 and attempting to bulkload 500K-700K files.

      The root cause of this is a combination of the higher latency of S3 (as compared to HDFS) as well as the calls made by LoadIncrementalHFiles to the underlying filesystem(each file is opened, seeked to the trailer offset at the end, and then the trailer is read).

      Increasing the parallelism does not yield any significant improvement. This seems to stem from the fact that once the trailer is read the stream is not consumed to the end. This causes the underlying HTTP connection to be aborted and it cannot be re-used.

       

      The proposed solution would be to also add support to run LoadIncrementalHFiles on multiple machines as a map reduce job. 

        Attachments

        1. HBASE_24541-1.4.0.patch
          22 kB
          Constantin-Catalin Luca

          Issue Links

            Activity

              People

              • Assignee:
                catalin.luca Constantin-Catalin Luca
                Reporter:
                catalin.luca Constantin-Catalin Luca
              • Votes:
                1 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated: