Pig
  1. Pig
  2. PIG-1870

HBaseStorage doesn't project correctly

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.1
    • Fix Version/s: 0.8.1, 0.9.0
    • Component/s: None
    • Labels:
      None

      Description

      Projecting columns after LOAD via HBaseStorage produces unexpected results. This is related to the loadKey functionality and how the pushProjection method in HBaseStorage has to offset to build a column list that aligns with the tuple (the column list doesn't contain the row key).

      This shift appears to create an inconsistency with the FieldSchema for the tuple which results in the wrong tuple value being fetched for a given column. I'll attach a patch with unit tests that illustrate the problem.

      1. PIG_1870.patch
        26 kB
        Dmitriy V. Ryaboy
      2. PIG_1870.4.patch
        27 kB
        Dmitriy V. Ryaboy
      3. PIG_1870.3.patch
        26 kB
        Dmitriy V. Ryaboy
      4. PIG_1870.2.patch
        26 kB
        Dmitriy V. Ryaboy
      5. PIG_1870_for0.8.patch
        27 kB
        Dmitriy V. Ryaboy
      6. PIG_1870_for0.8.final.patch
        26 kB
        Dmitriy V. Ryaboy
      7. PIG_1870_for0.8.2.patch
        27 kB
        Dmitriy V. Ryaboy
      8. PIG_1870_1.patch
        10 kB
        Bill Graham

        Activity

        Hide
        Bill Graham added a comment -

        Here's a patch with 5 new projection tests added to TestHBaseStorage, of which 4 fail. It's intended to be applied over PIG_1680 and it was generated from the git branch show here fyi:

        https://github.com/billonahill/pig/commit/55561c23f209ca2f27e13ddd93146d7b2a2492e3

        Show
        Bill Graham added a comment - Here's a patch with 5 new projection tests added to TestHBaseStorage , of which 4 fail. It's intended to be applied over PIG_1680 and it was generated from the git branch show here fyi: https://github.com/billonahill/pig/commit/55561c23f209ca2f27e13ddd93146d7b2a2492e3
        Hide
        Dmitriy V. Ryaboy added a comment -

        The problem isn't loadKey, it's that we set up a (static) TableInputFormat.SCAN in setLocation, which gets called multiple times, and not always after pushProjection; we wind up overwriting SCAN and "forgetting" we are pushing things. I am working on a patch.

        Show
        Dmitriy V. Ryaboy added a comment - The problem isn't loadKey, it's that we set up a (static) TableInputFormat.SCAN in setLocation, which gets called multiple times, and not always after pushProjection; we wind up overwriting SCAN and "forgetting" we are pushing things. I am working on a patch.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Attached patch for 0.8

        Show
        Dmitriy V. Ryaboy added a comment - Attached patch for 0.8
        Hide
        Dmitriy V. Ryaboy added a comment -

        Attaching patch for trunk.

        Show
        Dmitriy V. Ryaboy added a comment - Attaching patch for trunk.
        Hide
        Dmitriy V. Ryaboy added a comment -

        This is ready for review.

        Show
        Dmitriy V. Ryaboy added a comment - This is ready for review.
        Hide
        Bill Graham added a comment -

        Dmitriy, I was able to build with the new patches and the TestHBaseStorage test suite ran successfully with both trunk and 0.8.0. I'm getting failures when trying to run an HBase job against a distributed cluster though (version 0.90.0). This is similar to the issue I ran into in PIG-1782 that caused me to mess with how configs were initialized at one point.

        These are the only values that I've overriden in $HBASE_CONF_DIR/hbase-site.xml:

        hbase.rootdir
        hbase.cluster.distributed
        hbase.tmp.dir
        hbase.zookeeper.quorum
        hbase.zookeeper.property.dataDir
        

        And this is the error I get trying to run any Pig job against HBase:

        2011-04-11 13:36:58,659 [main] ERROR org.apache.hadoop.hbase.zookeeper.ZKConfig - no clientPort found in zoo.cfg
        2011-04-11 13:36:58,665 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2017: Internal error creating job configuration.
        

        I don't get this error running the PIG-1782 patch.

        Show
        Bill Graham added a comment - Dmitriy, I was able to build with the new patches and the TestHBaseStorage test suite ran successfully with both trunk and 0.8.0. I'm getting failures when trying to run an HBase job against a distributed cluster though (version 0.90.0). This is similar to the issue I ran into in PIG-1782 that caused me to mess with how configs were initialized at one point. These are the only values that I've overriden in $HBASE_CONF_DIR/hbase-site.xml : hbase.rootdir hbase.cluster.distributed hbase.tmp.dir hbase.zookeeper.quorum hbase.zookeeper.property.dataDir And this is the error I get trying to run any Pig job against HBase: 2011-04-11 13:36:58,659 [main] ERROR org.apache.hadoop.hbase.zookeeper.ZKConfig - no clientPort found in zoo.cfg 2011-04-11 13:36:58,665 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2017: Internal error creating job configuration. I don't get this error running the PIG-1782 patch.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Thanks, Bill.
        Attaching a patch that fixes the config confusion.

        Show
        Dmitriy V. Ryaboy added a comment - Thanks, Bill. Attaching a patch that fixes the config confusion.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Forgot to click the apache license button – i hereby submit this patch under the apache license, etc etc etc.

        Show
        Dmitriy V. Ryaboy added a comment - Forgot to click the apache license button – i hereby submit this patch under the apache license, etc etc etc.
        Hide
        Bill Graham added a comment -

        Dymitriy, can you check that latest patch? PIG_1870.patch and PIG_1870.2.patch are the same.

        Show
        Bill Graham added a comment - Dymitriy, can you check that latest patch? PIG_1870.patch and PIG_1870.2.patch are the same.
        Hide
        Dmitriy V. Ryaboy added a comment -

        whoops

        Show
        Dmitriy V. Ryaboy added a comment - whoops
        Hide
        Dmitriy V. Ryaboy added a comment -

        patch for 0.8

        Show
        Dmitriy V. Ryaboy added a comment - patch for 0.8
        Hide
        Dmitriy V. Ryaboy added a comment -

        Patch for trunk

        Show
        Dmitriy V. Ryaboy added a comment - Patch for trunk
        Hide
        Bill Graham added a comment -

        Either someone ran alias diff='echo -n ""' on my machine on April 1st, or all these patches are still the same. I think it's the latter.

        Show
        Bill Graham added a comment - Either someone ran alias diff='echo -n ""' on my machine on April 1st, or all these patches are still the same. I think it's the latter.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Are not.

        Show
        Dmitriy V. Ryaboy added a comment - Are not.
        Hide
        Bill Graham added a comment -

        Ok, my bad (embarrassed). Lesson learned: don't wget a patch from JIRA, then manually change the patch name to get another. You'll get something totally unexpected.

        Verified that trunk and 0.8.0 branch tests pass for PIG_1870.3.patch and PIG_1870_for0.8.2.patch, respectively. Also verified ad-hoc Pig HBase jobs with projections now work against a cluster for both.

        Show
        Bill Graham added a comment - Ok, my bad (embarrassed). Lesson learned: don't wget a patch from JIRA, then manually change the patch name to get another. You'll get something totally unexpected. Verified that trunk and 0.8.0 branch tests pass for PIG_1870.3.patch and PIG_1870_for0.8.2.patch, respectively. Also verified ad-hoc Pig HBase jobs with projections now work against a cluster for both.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Massively sped up tests for HBaseStorage by using local mode instead of mapreduce. About 10 secs / test now on my laptop (down from 50).

        Show
        Dmitriy V. Ryaboy added a comment - Massively sped up tests for HBaseStorage by using local mode instead of mapreduce. About 10 secs / test now on my laptop (down from 50).
        Hide
        Daniel Dai added a comment -

        Hi, Dmitriy, do you still plan to commit this patch to 0.8?

        Show
        Daniel Dai added a comment - Hi, Dmitriy, do you still plan to commit this patch to 0.8?
        Hide
        Dmitriy V. Ryaboy added a comment -

        Daniel, yeah, it's ready to go – just waiting on another committer to +1 it.

        Show
        Dmitriy V. Ryaboy added a comment - Daniel, yeah, it's ready to go – just waiting on another committer to +1 it.
        Hide
        Daniel Dai added a comment -

        +1. Please commit if test pass.

        Show
        Daniel Dai added a comment - +1. Please commit if test pass.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Committed to 0.8 branch and trunk.

        Show
        Dmitriy V. Ryaboy added a comment - Committed to 0.8 branch and trunk.
        Hide
        Harsh J added a comment -

        Could someone please add the appropriate 0.8.x fix version here? Or if that's done when a release is tagged, np. Just thought it might help those following tickets here

        Show
        Harsh J added a comment - Could someone please add the appropriate 0.8.x fix version here? Or if that's done when a release is tagged, np. Just thought it might help those following tickets here
        Hide
        Dmitriy V. Ryaboy added a comment -

        it's in 8.1, I updated.

        Show
        Dmitriy V. Ryaboy added a comment - it's in 8.1, I updated.

          People

          • Assignee:
            Dmitriy V. Ryaboy
            Reporter:
            Bill Graham
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development