Pig
  1. Pig
  2. PIG-2934

HBaseStorage filter optimizations

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.10.0
    • Fix Version/s: 0.11
    • Component/s: None
    • Labels:
    • Release Note:
      HBaseStorage filter performance improvements

      Description

      Our HBase pal/guru Gary Helmling was kind enough to do a code review of HBaseStorage. He suggested some good filter optimizations:

      • when using the "lt*" and "gt*" options, set the start/stop rows on the Scan instance, at least in addition to the RowFilters. Without this you're doing a full table scan, regardless of the RowFilters.
      • when selecting specific columns or entire families to return, it would be more efficient to set the family + columns on the Scan object (addFamily(), addColumn()), instead of using a FilterList. I'm not familiar with the family:prefix handling you mention, but that would still seem to require filters. But if that's not being used, it would be better to avoid the FilterList for columns. At minimum, we should probably call Scan.addFamily() with the distinct families, so we can skip entire column families that are not being used. In the case of a table with 4 CFs, if, say, only 1 is being used, this could be a big gain.

        Activity

        Bill Graham created issue -
        Hide
        Christoph Bauer added a comment -

        I'm starting on a patch for HBase Storage here at my company.

        Regarding your first issue you're totally right. It seems weird that it was implemented with filters at all.
        The second issue is different. In HBaseStorage.setLocation those Families are added to the scan object. I don't understand why it's done there though.

        Show
        Christoph Bauer added a comment - I'm starting on a patch for HBase Storage here at my company. Regarding your first issue you're totally right. It seems weird that it was implemented with filters at all. The second issue is different. In HBaseStorage.setLocation those Families are added to the scan object. I don't understand why it's done there though.
        Hide
        Bill Graham added a comment -

        Initialization happens in the setLocation method often, since that's the first time the class has a conf object on the cluster. That can happen elsewhere in the initialization process though if it works.

        Show
        Bill Graham added a comment - Initialization happens in the setLocation method often, since that's the first time the class has a conf object on the cluster. That can happen elsewhere in the initialization process though if it works.
        Bill Graham made changes -
        Field Original Value New Value
        Status Open [ 1 ] In Progress [ 3 ]
        Hide
        Bill Graham added a comment -

        Attaching patch that reduces the number of filters used and improves how range scans are done.

        Show
        Bill Graham added a comment - Attaching patch that reduces the number of filters used and improves how range scans are done.
        Bill Graham made changes -
        Attachment PIG-2934.1.patch [ 12553619 ]
        Hide
        Bill Graham added a comment -

        We uncovered a significant performance problem with HBaseStorage > 0.9 when used with a long list of columns on a tall table. The previous use of filters is too hard hitting on HBase and it pegs HBase cluster CPU. We should consider this patch to be included in Pig 0.11.

        Show
        Bill Graham added a comment - We uncovered a significant performance problem with HBaseStorage > 0.9 when used with a long list of columns on a tall table. The previous use of filters is too hard hitting on HBase and it pegs HBase cluster CPU. We should consider this patch to be included in Pig 0.11.
        Bill Graham made changes -
        Status In Progress [ 3 ] Patch Available [ 10002 ]
        Bill Graham made changes -
        Affects Version/s 0.10.0 [ 12316246 ]
        Hide
        Dmitriy V. Ryaboy added a comment -

        +1 thanks Bill.

        Show
        Dmitriy V. Ryaboy added a comment - +1 thanks Bill.
        Bill Graham made changes -
        Fix Version/s 0.11 [ 12318878 ]
        Hide
        Bill Graham added a comment -

        Committed to both trunk and Pig 0.11 branch

        Show
        Bill Graham added a comment - Committed to both trunk and Pig 0.11 branch
        Bill Graham made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Release Note HBaseStorage filter performance improvements
        Resolution Fixed [ 1 ]
        Bill Graham made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open In Progress In Progress
        44d 3h 4m 1 Bill Graham 10/Nov/12 01:02
        In Progress In Progress Patch Available Patch Available
        5d 4h 17m 1 Bill Graham 15/Nov/12 05:20
        Patch Available Patch Available Resolved Resolved
        5d 2h 26m 1 Bill Graham 20/Nov/12 07:47
        Resolved Resolved Closed Closed
        93d 21h 6m 1 Bill Graham 22/Feb/13 04:53

          People

          • Assignee:
            Bill Graham
            Reporter:
            Bill Graham
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development