Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-21286

Parallelize computeHDFSBlocksDistribution when getting splits of a HBaseSnapshot

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Patch Available
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.4.0
    • Fix Version/s: None
    • Component/s: snapshots
    • Labels:
      None

      Description

      Even if this step is called computeHDFSBlocksDistribution, this is executed no matter the file system of the snapshot. For example, we have observed an important slowness when we have a snapshot in s3 (~26k regions, 5column families, 2 files per column family) the getsplits time is ~40min due to the calls in s3 for listing the files to get the best locations.

      Parallelizing this operation can reduce the overall setup time. The thread pool should be configurable and a good choice could be "hbase.snapshot.thread.pool.max" that is also used in RestoreSnapshotHelper.

        Attachments

        1. HBASE-21286.branch-1.4.001.patch
          4 kB
          Lavinia-Stefania Sirbu
        2. HBASE-21286.branch-1.4.002.patch
          4 kB
          Lavinia-Stefania Sirbu
        3. HBASE-21286.branch-1.4.003.patch
          4 kB
          Lavinia-Stefania Sirbu

          Activity

            People

            • Assignee:
              lavinia.sirbu Lavinia-Stefania Sirbu
              Reporter:
              lavinia.sirbu Lavinia-Stefania Sirbu
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: