HBase
  1. HBase
  2. HBASE-9815

Add Histogram representative of row key distribution inside a region.

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Not A Problem
    • Affects Version/s: 0.89-fb
    • Fix Version/s: 0.89-fb
    • Component/s: HFile
    • Labels:
      None

      Description

      Using histogram information, users can parallelize the scan workload into equal sized scans based on the estimated size from the Histogram information. This will help in enabling systems which are trying to perform queries on top of HBase to do cost based optimization while scanning. The Idea is to keep this histogram information in the HFile in the trailer and populate this on compaction and flush.

      The HRegionInterface can expose an API to return the Histogram information of a region, which can be generated by merging histograms of all the hfiles.

      Implementing the histogram on the basis of
      http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
      http://dl.acm.org/citation.cfm?id=1951376
      and NumericHistogram from hive.

      1. Histogram-9815.diff
        68 kB
        Manukranth Kolloju

        Activity

        Manukranth Kolloju created issue -
        Hide
        Lars Hofhansl added a comment -
        Show
        Lars Hofhansl added a comment - Jesse Yates , FYI.
        Manukranth Kolloju made changes -
        Field Original Value New Value
        Description Using histogram information, users can parallelize the scan workload into equal sized scans based on the estimated size from the Histogram information. This will help in enabling systems which are trying to perform queries on top of HBase to do cost based optimization while scanning. The Idea is to keep this histogram information into the HFile in the trailer and populate this on compaction and/or flush.

        The HRegionInterface can expose an API to return the Histogram information of a region, which can be generated by merging histograms of all the hfiles.

        Implementing the histogram on the basis of
        http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
        http://dl.acm.org/citation.cfm?id=1951376
        and NumericHistogram from hive.
        Using histogram information, users can parallelize the scan workload into equal sized scans based on the estimated size from the Histogram information. This will help in enabling systems which are trying to perform queries on top of HBase to do cost based optimization while scanning. The Idea is to keep this histogram information in the HFile in the trailer and populate this on compaction and flush.

        The HRegionInterface can expose an API to return the Histogram information of a region, which can be generated by merging histograms of all the hfiles.

        Implementing the histogram on the basis of
        http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
        http://dl.acm.org/citation.cfm?id=1951376
        and NumericHistogram from hive.
        Manukranth Kolloju made changes -
        Status Open [ 1 ] In Progress [ 3 ]
        Hide
        Manukranth Kolloju added a comment -

        Attaching the implementation based on the above paper.

        Show
        Manukranth Kolloju added a comment - Attaching the implementation based on the above paper.
        Manukranth Kolloju made changes -
        Attachment Histogram-9815.diff [ 12623053 ]
        Hide
        Elliott Clark added a comment -

        0.89-fb is no longer being actively maintained. If issues persist open an issue against the current master or stable versions.

        Show
        Elliott Clark added a comment - 0.89-fb is no longer being actively maintained. If issues persist open an issue against the current master or stable versions.
        Elliott Clark made changes -
        Status In Progress [ 3 ] Resolved [ 5 ]
        Assignee Manukranth Kolloju [ manukranthk ] Elliott Clark [ eclark ]
        Resolution Not A Problem [ 8 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open In Progress In Progress
        85d 3h 38m 1 Manukranth Kolloju 15/Jan/14 03:46
        In Progress In Progress Resolved Resolved
        446d 16h 5m 1 Elliott Clark 06/Apr/15 20:52

          People

          • Assignee:
            Elliott Clark
            Reporter:
            Manukranth Kolloju
          • Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

            • Due:
              Created:
              Updated:
              Resolved:

              Development