Hive
  1. Hive
  2. HIVE-1803

Implement bitmap indexing in Hive

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: Indexing
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Implement bitmap index handler to complement compact indexing.

      1. bitmap_index_1.png
        11 kB
        Marquis Wang
      2. bitmap_index_2.png
        12 kB
        Marquis Wang
      3. HIVE-1803.1.patch
        51 kB
        Skye Berghel
      4. HIVE-1803.2.patch
        66 kB
        Marquis Wang
      5. HIVE-1803.3.patch
        503 kB
        Marquis Wang
      6. javaewah.jar
        20 kB
        Skye Berghel
      7. HIVE-1803.4.patch
        629 kB
        Skye Berghel
      8. javaewah.jar
        20 kB
        Skye Berghel
      9. HIVE-1803.5.patch
        795 kB
        Marquis Wang
      10. HIVE-1803.6.patch
        799 kB
        Marquis Wang
      11. JavaEWAH_20110304.zip
        15 kB
        John Sichi
      12. HIVE-1803.7.patch
        819 kB
        Marquis Wang
      13. HIVE-1803.8.patch
        816 kB
        Marquis Wang
      14. HIVE-1803.9.patch
        923 kB
        Marquis Wang
      15. HIVE-1803.10.patch
        924 kB
        Marquis Wang
      16. HIVE-1803.11.patch
        924 kB
        Marquis Wang
      17. unit-tests.patch
        2.08 MB
        Marquis Wang
      18. unit-tests.2.patch
        7.30 MB
        Marquis Wang
      19. unit-tests.3.patch
        7.30 MB
        Marquis Wang
      20. HIVE-1803.12.patch
        1.06 MB
        Marquis Wang
      21. HIVE-1803.13.patch
        928 kB
        Marquis Wang
      22. HIVE-1803.14.patch
        928 kB
        Marquis Wang
      23. HIVE-1803.14.patch
        928 kB
        Marquis Wang
      24. HIVE-1803.15.patch
        1.11 MB
        Marquis Wang
      25. HIVE-1803.15.patch
        1.11 MB
        Marquis Wang

        Issue Links

          Activity

          Hide
          Marquis Wang added a comment -

          Added a proposed design document on Hive wiki at http://wiki.apache.org/hadoop/Hive/IndexDev/Bitmap

          Show
          Marquis Wang added a comment - Added a proposed design document on Hive wiki at http://wiki.apache.org/hadoop/Hive/IndexDev/Bitmap
          Hide
          Skye Berghel added a comment -

          Marquis and I have prepared a preliminary patch. BitmapCollectSet is currently not working---does anyone have any pointers?

          Show
          Skye Berghel added a comment - Marquis and I have prepared a preliminary patch. BitmapCollectSet is currently not working---does anyone have any pointers?
          Hide
          John Sichi added a comment -

          Could you describe the problem you're seeing and provide an example of how to reproduce it?

          Show
          John Sichi added a comment - Could you describe the problem you're seeing and provide an example of how to reproduce it?
          Hide
          Marquis Wang added a comment -

          We fixed the problem in BitmapCollectSet by looking at the PercentileApprox UDAF to figure out how to use an array an input to a UDAF.

          This new patch is a working implementation of bitmap indexing. The new test index_bitmap.q shows how to use the index. However, I am unable to add the test itself, and get errors when I run

          ant test -Dtestcase=TestCliDriver -Dqfile=index_bitmap.q -Doverwrite=true -Dtest.silent=false

          It says

          Exception: java.lang.RuntimeException: The table default_srcpart_srcpart_index_proj_ is an index table. Please do drop index instead.

          wrt to the ALTER INDEX REBUILD line in the test.

          We're pretty confused about whether we're doing the new test incorrectly and would appreciate any help.

          While we're working to get around that we're also going to go ahead and work on a compressed bitmap, since this implementation does no compression.

          Show
          Marquis Wang added a comment - We fixed the problem in BitmapCollectSet by looking at the PercentileApprox UDAF to figure out how to use an array an input to a UDAF. This new patch is a working implementation of bitmap indexing. The new test index_bitmap.q shows how to use the index. However, I am unable to add the test itself, and get errors when I run ant test -Dtestcase=TestCliDriver -Dqfile=index_bitmap.q -Doverwrite=true -Dtest.silent=false It says Exception: java.lang.RuntimeException: The table default_ srcpart_srcpart_index_proj _ is an index table. Please do drop index instead. wrt to the ALTER INDEX REBUILD line in the test. We're pretty confused about whether we're doing the new test incorrectly and would appreciate any help. While we're working to get around that we're also going to go ahead and work on a compressed bitmap, since this implementation does no compression.
          Hide
          John Sichi added a comment -

          I tried it, and I did get that error, but not in the REBUILD, which worked fine. So you may be misreading the output.

          The error I'm getting is

          FAILED: Error in semantic analysis: line 2:65 Expression Not In Group By Key `_bucketname`
          

          This error is correct since the statement below has `_bucketname` in the SELECT list but not in the GROUP BY.

          INSERT OVERWRITE DIRECTORY "/tmp/index_test_index_result" SELECT `_bucketname`, COLLECT_BITMAP_SET(`_offset`, `_bitmaps`) as `_offsets` FROM default__srcpart_srcpart_index_proj__ x WHERE x.key=100 AND x.ds = '2008-04-08' GROUP BY x.key, x.ds;
          

          As a result, the test bails out in the middle (without doing any of its DROP INDEX cleanup). After that, what happens is that the test framework tries to clean up by dropping all existing tables, but it isn't smart enough to know that it shouldn't try to drop index tables directly. (We could use a separate patch for that.)

              [junit] org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: The table default__srcpart_srcpart_index_proj__ is an index table. Please do drop index instead.
              [junit] 	at org.apache.hadoop.hive.ql.metadata.Hive.dropTable(Hive.java:739)
              [junit] 	at org.apache.hadoop.hive.ql.metadata.Hive.dropTable(Hive.java:716)
              [junit] 	at org.apache.hadoop.hive.ql.QTestUtil.clearTestSideEffects(QTestUtil.java:333)
              [junit] 	at org.apache.hadoop.hive.cli.TestCliDriver.setUp(TestCliDriver.java:55)
              [junit] 	at junit.framework.TestCase.runBare(TestCase.java:125)
          ...
          
          Show
          John Sichi added a comment - I tried it, and I did get that error, but not in the REBUILD, which worked fine. So you may be misreading the output. The error I'm getting is FAILED: Error in semantic analysis: line 2:65 Expression Not In Group By Key `_bucketname` This error is correct since the statement below has `_bucketname` in the SELECT list but not in the GROUP BY. INSERT OVERWRITE DIRECTORY "/tmp/index_test_index_result" SELECT `_bucketname`, COLLECT_BITMAP_SET(`_offset`, `_bitmaps`) as `_offsets` FROM default__srcpart_srcpart_index_proj__ x WHERE x.key=100 AND x.ds = '2008-04-08' GROUP BY x.key, x.ds; As a result, the test bails out in the middle (without doing any of its DROP INDEX cleanup). After that, what happens is that the test framework tries to clean up by dropping all existing tables, but it isn't smart enough to know that it shouldn't try to drop index tables directly. (We could use a separate patch for that.) [junit] org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: The table default__srcpart_srcpart_index_proj__ is an index table. Please do drop index instead. [junit] at org.apache.hadoop.hive.ql.metadata.Hive.dropTable(Hive.java:739) [junit] at org.apache.hadoop.hive.ql.metadata.Hive.dropTable(Hive.java:716) [junit] at org.apache.hadoop.hive.ql.QTestUtil.clearTestSideEffects(QTestUtil.java:333) [junit] at org.apache.hadoop.hive.cli.TestCliDriver.setUp(TestCliDriver.java:55) [junit] at junit.framework.TestCase.runBare(TestCase.java:125) ...
          Hide
          John Sichi added a comment -

          Some review comments:

          • Need to factor out all that code duplicated from compact index handler; share it in package org.apache.hadoop.hive.ql.index. Use abstract classes in cases where behavior needs to be overridden, otherwise, just share concrete classes there.
          • If we're going to publish the new UDF's as visible out of the box (not just internal to the index implementation) then they need unit tests of their own, as well as some documentation about the representation on which they operate (maybe best to wait and see how compression shakes out first). Also, for the ones that turn out to be not generally applicable, then they need to be named more specifically.
          • For dense bitmaps, I think you can probably use java.util.BitSet instead of rolling so much of your own (at least for ones where you have control over the bit array representation)
          • The name attribute in the annotation for GenericUDAFCollectBitmapSet is incorrect.
          • In HiveIndex.java, the symbol should be just BITMAP_TABLE (not BITMAP_SUMMARY_TABLE) since the bitmap is actually quite detailed.
          Show
          John Sichi added a comment - Some review comments: Need to factor out all that code duplicated from compact index handler; share it in package org.apache.hadoop.hive.ql.index. Use abstract classes in cases where behavior needs to be overridden, otherwise, just share concrete classes there. If we're going to publish the new UDF's as visible out of the box (not just internal to the index implementation) then they need unit tests of their own, as well as some documentation about the representation on which they operate (maybe best to wait and see how compression shakes out first). Also, for the ones that turn out to be not generally applicable, then they need to be named more specifically. For dense bitmaps, I think you can probably use java.util.BitSet instead of rolling so much of your own (at least for ones where you have control over the bit array representation) The name attribute in the annotation for GenericUDAFCollectBitmapSet is incorrect. In HiveIndex.java, the symbol should be just BITMAP_TABLE (not BITMAP_SUMMARY_TABLE) since the bitmap is actually quite detailed.
          Hide
          Marquis Wang added a comment -

          New patch that includes tests and fixes problems John mentioned.

          Bitmap UDFs are "hidden" by giving them leading underscores, since we do need to expose them for using the index for now. When we get automatic usage working, we can hide them more effectively.

          Show
          Marquis Wang added a comment - New patch that includes tests and fixes problems John mentioned. Bitmap UDFs are "hidden" by giving them leading underscores, since we do need to expose them for using the index for now. When we get automatic usage working, we can hide them more effectively.
          Hide
          Jeff Hammerbacher added a comment -

          Hey,

          I came across a Daniel Lemire project recently that may be of use here: http://code.google.com/p/javaewah.

          Later,
          Jeff

          Show
          Jeff Hammerbacher added a comment - Hey, I came across a Daniel Lemire project recently that may be of use here: http://code.google.com/p/javaewah . Later, Jeff
          Hide
          Marquis Wang added a comment -

          Thanks Jeff. We've actually seen this and have a patch in the works (next couple days) that uses it.

          Show
          Marquis Wang added a comment - Thanks Jeff. We've actually seen this and have a patch in the works (next couple days) that uses it.
          Hide
          Skye Berghel added a comment -

          javaewah.jar file (a dependency for the compressed bitmap)

          Show
          Skye Berghel added a comment - javaewah.jar file (a dependency for the compressed bitmap)
          Hide
          Skye Berghel added a comment -

          A patch that uses a compressed bitmap to save space without losing much computing time for AND and OR operations.

          Show
          Skye Berghel added a comment - A patch that uses a compressed bitmap to save space without losing much computing time for AND and OR operations.
          Hide
          Skye Berghel added a comment -

          Attaching the javaewah.jar file that will go in lib.

          Show
          Skye Berghel added a comment - Attaching the javaewah.jar file that will go in lib.
          Hide
          John Sichi added a comment -
          Show
          John Sichi added a comment - Review comments in https://reviews.apache.org/r/466/
          Hide
          Marquis Wang added a comment -

          New patch that fixes issues raised. I added some comments/questions to the ReviewBoard.

          Are there one or more tables that are already used for testing that have two columns that I can use for testing bitmap_and and bitmap_or?

          Show
          Marquis Wang added a comment - New patch that fixes issues raised. I added some comments/questions to the ReviewBoard. Are there one or more tables that are already used for testing that have two columns that I can use for testing bitmap_and and bitmap_or?
          Hide
          John Sichi added a comment -

          Table src has two columns (key and value). The value is equivalent to the key. For srcbucket, the value is one plus the key, so at least they're not exactly the same.

          Show
          John Sichi added a comment - Table src has two columns (key and value). The value is equivalent to the key. For srcbucket, the value is one plus the key, so at least they're not exactly the same.
          Hide
          Marquis Wang added a comment -

          Another patch where the bitmap_and and bitmap_or tests use two columns instead of faking it with two indexes on one column.

          Show
          Marquis Wang added a comment - Another patch where the bitmap_and and bitmap_or tests use two columns instead of faking it with two indexes on one column.
          Hide
          John Sichi added a comment -

          Uploading a .zip of the source for reference.

          Show
          John Sichi added a comment - Uploading a .zip of the source for reference.
          Hide
          John Sichi added a comment -

          New review board entry (I failed trying to update the old one with the new patch):

          https://reviews.apache.org/r/481/

          Show
          John Sichi added a comment - New review board entry (I failed trying to update the old one with the new patch): https://reviews.apache.org/r/481/
          Hide
          Marquis Wang added a comment -

          New patch which I believe takes care of all the issues in the review for patch 6.

          Show
          Marquis Wang added a comment - New patch which I believe takes care of all the issues in the review for patch 6.
          Hide
          He Yongqiang added a comment -

          I did not do a detailed look, overall the patch looks good. It follows the steps of existing compact index.

          But can you do some tests on the javaewah stuff to see its performance? And what features in javawah are used by this jira? For compression, if we can just compress the bitmap using simple RLE, how much differences are there? What i mean, also not sure about, is javawah may have a lot features that we do not need, but that add some overhead to the code.

          Show
          He Yongqiang added a comment - I did not do a detailed look, overall the patch looks good. It follows the steps of existing compact index. But can you do some tests on the javaewah stuff to see its performance? And what features in javawah are used by this jira? For compression, if we can just compress the bitmap using simple RLE, how much differences are there? What i mean, also not sure about, is javawah may have a lot features that we do not need, but that add some overhead to the code.
          Hide
          He Yongqiang added a comment -

          Did an offline discussion with namit on this jira.

          The basic question is how to use this bitmap indexing. Given there are millions of rows in one block, the block will contain all distinct values this column has. So the bitmap index will not be very useful. A possibly use case maybe do a bitmap and/or. eg, need to find out all records about Male in Japan. Male and Japan are both bitmap indexed. what we can do today is to first do a JOIN and BITMAP AND operation on the 2 index tables, and then find all the matching blocks, which is ok, but there requires a join operation. If we can support an bitmap index with more than 1 index columns, it will help in this case. I mean each index column in the index table has its own bitmap. Eg, FILE_NAME, BLK_OFFSET, GENDER, bitmapForGENDER, COUNTY, bitmapForCountry. bitmapForGENDER will have two bitmaps internally, one for Male, one for Female. And bitmapForCountry will have bitmaps for each country.

          And if hive can support skip rows, the bitmap index will be very useful. I mean with bitmap indexing, block pruning maybe not good enough. For example, in a block, we only find the row1, row3, lastRow satisfy the predicate. We can just skip row2, and row4 to lastRow-1.

          what do you think?

          Show
          He Yongqiang added a comment - Did an offline discussion with namit on this jira. The basic question is how to use this bitmap indexing. Given there are millions of rows in one block, the block will contain all distinct values this column has. So the bitmap index will not be very useful. A possibly use case maybe do a bitmap and/or. eg, need to find out all records about Male in Japan. Male and Japan are both bitmap indexed. what we can do today is to first do a JOIN and BITMAP AND operation on the 2 index tables, and then find all the matching blocks, which is ok, but there requires a join operation. If we can support an bitmap index with more than 1 index columns, it will help in this case. I mean each index column in the index table has its own bitmap. Eg, FILE_NAME, BLK_OFFSET, GENDER, bitmapForGENDER, COUNTY, bitmapForCountry. bitmapForGENDER will have two bitmaps internally, one for Male, one for Female. And bitmapForCountry will have bitmaps for each country. And if hive can support skip rows, the bitmap index will be very useful. I mean with bitmap indexing, block pruning maybe not good enough. For example, in a block, we only find the row1, row3, lastRow satisfy the predicate. We can just skip row2, and row4 to lastRow-1. what do you think?
          Hide
          John Sichi added a comment -

          Right, without row-level skipping, the main use case is AND/OR for block filtering.

          I'd suggest we get this committed without row-level skipping, and then create a followup for that. Besides AND/OR, having the bitmap index build/access code committed will be useful for others working on related issues such as automatic usage in the WHERE clause.

          Show
          John Sichi added a comment - Right, without row-level skipping, the main use case is AND/OR for block filtering. I'd suggest we get this committed without row-level skipping, and then create a followup for that. Besides AND/OR, having the bitmap index build/access code committed will be useful for others working on related issues such as automatic usage in the WHERE clause.
          Hide
          Marquis Wang added a comment -

          New patch with minimal changes (got rid of some unused imports)

          Show
          Marquis Wang added a comment - New patch with minimal changes (got rid of some unused imports)
          Hide
          Marquis Wang added a comment -

          John, I'm resubmitting the patch for inclusion and opened a new ticket for creating row-level indexing.

          Show
          Marquis Wang added a comment - John, I'm resubmitting the patch for inclusion and opened a new ticket for creating row-level indexing.
          Hide
          John Sichi added a comment -

          Latest review comments are in:

          https://reviews.apache.org/r/530/

          Show
          John Sichi added a comment - Latest review comments are in: https://reviews.apache.org/r/530/
          Hide
          Marquis Wang added a comment -

          Uploaded new patch that addresses John's comments on patch 8.

          Show
          Marquis Wang added a comment - Uploaded new patch that addresses John's comments on patch 8.
          Hide
          Marquis Wang added a comment -

          Update patch to include more missing javadocs.

          Show
          Marquis Wang added a comment - Update patch to include more missing javadocs.
          Hide
          John Sichi added a comment -

          When I run ant test, I get a ton of failures with internal column names being different due to the introduction of the new ROWOFFSET virtual column. The same thing happened to Yongqiang with HIVE-417.

          So either you'll need to figure out a way to hide the number adjustment, or you'll have to submit one more (giant) patch with all of the testlog updates (which you can obtain by running ant test -Doverwrite=true and then verifying that all of the changes are expected).

          Example from auto_join1.q:

              [junit] 67c67
              [junit] <               outputColumnNames: _col0, _col6
              [junit] ---
              [junit] >               outputColumnNames: _col0, _col5
              [junit] 73c73
              [junit] <                       expr: _col6
              [junit] ---
              [junit] >                       expr: _col5
              [junit] 143c143
              [junit] <               outputColumnNames: _col0, _col6
              [junit] ---
              [junit] >               outputColumnNames: _col0, _col5
          

          Everything else in the patch was fine, except for a few nits:

          • Javadoc for GenericUDFEWAHBitmapOr/And has the old class name.
          • For BitmapAnd Javadoc, some spaces are missing in front of the stars.
          Show
          John Sichi added a comment - When I run ant test, I get a ton of failures with internal column names being different due to the introduction of the new ROWOFFSET virtual column. The same thing happened to Yongqiang with HIVE-417 . So either you'll need to figure out a way to hide the number adjustment, or you'll have to submit one more (giant) patch with all of the testlog updates (which you can obtain by running ant test -Doverwrite=true and then verifying that all of the changes are expected). Example from auto_join1.q: [junit] 67c67 [junit] < outputColumnNames: _col0, _col6 [junit] --- [junit] > outputColumnNames: _col0, _col5 [junit] 73c73 [junit] < expr: _col6 [junit] --- [junit] > expr: _col5 [junit] 143c143 [junit] < outputColumnNames: _col0, _col6 [junit] --- [junit] > outputColumnNames: _col0, _col5 Everything else in the patch was fine, except for a few nits: Javadoc for GenericUDFEWAHBitmapOr/And has the old class name. For BitmapAnd Javadoc, some spaces are missing in front of the stars.
          Hide
          Marquis Wang added a comment -

          New patch that fixes the minor javadocs comments from patch 10.

          A unit-tests patch that updates all the unit tests that were affected by the virtual column change.

          Show
          Marquis Wang added a comment - New patch that fixes the minor javadocs comments from patch 10. A unit-tests patch that updates all the unit tests that were affected by the virtual column change.
          Hide
          John Sichi added a comment -

          I'm getting test failures still.

          • TestMinimrCliDriver:join1
          • TestMTQueries:testMTQueries1
          • TestParse: 44/45 tests failed

          These all need fixes before commit.

          Show
          John Sichi added a comment - I'm getting test failures still. TestMinimrCliDriver:join1 TestMTQueries:testMTQueries1 TestParse: 44/45 tests failed These all need fixes before commit.
          Hide
          Marquis Wang added a comment -

          New unit tests patch that should fix some more tests.

          John, I didn't see any failures in TestMTQueries even before adding this new patch. I'm not sure why that would be, but I definitely fixed some things in the other two tests.

          Also this patch only includes the unit tests, so you will need to include patch 11 as well.

          Show
          Marquis Wang added a comment - New unit tests patch that should fix some more tests. John, I didn't see any failures in TestMTQueries even before adding this new patch. I'm not sure why that would be, but I definitely fixed some things in the other two tests. Also this patch only includes the unit tests, so you will need to include patch 11 as well.
          Hide
          John Sichi added a comment -

          OK, maybe the TestMTQueries failure was a side-effect of the other failures...I'll retry with your latest patch.

          Show
          John Sichi added a comment - OK, maybe the TestMTQueries failure was a side-effect of the other failures...I'll retry with your latest patch.
          Hide
          John Sichi added a comment -

          Some other stuff got committed in between which is causing conflicts when I try patch -p0 < unit-tests.2.patch

          Show
          John Sichi added a comment - Some other stuff got committed in between which is causing conflicts when I try patch -p0 < unit-tests.2.patch
          Hide
          Marquis Wang added a comment -

          I re-pulled from trunk and made a new patch and there was no difference between the two. If you have the original unit-tests.patch applied then this patch will fail. Can you try patching HIVE-1803.11.patch followed by unit-tests.2.patch on a clean checkout?

          Show
          Marquis Wang added a comment - I re-pulled from trunk and made a new patch and there was no difference between the two. If you have the original unit-tests.patch applied then this patch will fail. Can you try patching HIVE-1803 .11.patch followed by unit-tests.2.patch on a clean checkout?
          Hide
          John Sichi added a comment -

          That's what I did, and the conflicts match files which were in very recent commits.

          Are you sure you did svn update? If you're using git, there may be some lag in the replica.

          Show
          John Sichi added a comment - That's what I did, and the conflicts match files which were in very recent commits. Are you sure you did svn update? If you're using git, there may be some lag in the replica.
          Hide
          Marquis Wang added a comment -

          New patch for unit tests that hopefully shouldn't conflict this time.

          I looked into changing the code so that the outputColumnNames in explains are not affected by virtual columns, but didn't really get anywhere. Besides, wouldn't I have the same problem with commits since the unit tests were changed for the first two virtual columns added?

          I figured I'd go ahead and submit this patch again and if you thought I should keep on looking into that you can not accept it.

          Show
          Marquis Wang added a comment - New patch for unit tests that hopefully shouldn't conflict this time. I looked into changing the code so that the outputColumnNames in explains are not affected by virtual columns, but didn't really get anywhere. Besides, wouldn't I have the same problem with commits since the unit tests were changed for the first two virtual columns added? I figured I'd go ahead and submit this patch again and if you thought I should keep on looking into that you can not accept it.
          Hide
          John Sichi added a comment -

          OK, let's see if this one can pass tests before the next conflict gets committed

          Show
          John Sichi added a comment - OK, let's see if this one can pass tests before the next conflict gets committed
          Hide
          John Sichi added a comment -

          I'm already seeing at least one test failures: auto_join28.q, which recently got its log updated (Friday) so you probably rebased before that. It's really difficult to get something like this through without locking down svn for a few days...

          You have a point about the unit tests; we would have to special-case the new column and leaving the existing ones as is. Ugly any way you slice it.

          Show
          John Sichi added a comment - I'm already seeing at least one test failures: auto_join28.q, which recently got its log updated (Friday) so you probably rebased before that. It's really difficult to get something like this through without locking down svn for a few days... You have a point about the unit tests; we would have to special-case the new column and leaving the existing ones as is. Ugly any way you slice it.
          Hide
          John Sichi added a comment -

          OK, here's the simplest approach I can come up with given the way internal column names currently work:

          • introduce yet another config parameter hive.exec.rowoffset=true/false (default false)
          • only support the new virtual column for queries where this is true (get rid of static field VirtualColumn.registry and replace it with a getRegistry method which takes in a conf)
          • set this in the conf used for compiling the internal statement (but don't modify the top-level conf)

          (Ideally, we wouldn't need the config param; we would do this automatically based on the presence of a reference to the VC in the query. Also ideally, at execution time we would avoid any counter-increment overhead when this is false.)

          Show
          John Sichi added a comment - OK, here's the simplest approach I can come up with given the way internal column names currently work: introduce yet another config parameter hive.exec.rowoffset=true/false (default false) only support the new virtual column for queries where this is true (get rid of static field VirtualColumn.registry and replace it with a getRegistry method which takes in a conf) set this in the conf used for compiling the internal statement (but don't modify the top-level conf) (Ideally, we wouldn't need the config param; we would do this automatically based on the presence of a reference to the VC in the query. Also ideally, at execution time we would avoid any counter-increment overhead when this is false.)
          Hide
          Marquis Wang added a comment -

          New patch that implements John's suggestions about adding the hive.exec.rowoffset configuration variable.

          This patch fixes the issues with column numbers in explains. John, I'm still seeing some test failures in tests such as combine2.q, bucketmapjoin1.q, bucketmapjoin4.q. It looks like one of the numRows outputs is saying zero rows instead of some non-zero number before in an explain in each of these tests. I'm not really sure what could be causing this and don't see anything in this patch that can affect these tests. Do you have any ideas?

          Show
          Marquis Wang added a comment - New patch that implements John's suggestions about adding the hive.exec.rowoffset configuration variable. This patch fixes the issues with column numbers in explains. John, I'm still seeing some test failures in tests such as combine2.q, bucketmapjoin1.q, bucketmapjoin4.q. It looks like one of the numRows outputs is saying zero rows instead of some non-zero number before in an explain in each of these tests. I'm not really sure what could be causing this and don't see anything in this patch that can affect these tests. Do you have any ideas?
          Hide
          John Sichi added a comment -

          I noticed that in one of my attempts with your previous patches. I think it's the exact same problem you were hitting very early on in the clinic in the Hudson setup, having to do with the stats temp database not getting deleted?

          Show
          John Sichi added a comment - I noticed that in one of my attempts with your previous patches. I think it's the exact same problem you were hitting very early on in the clinic in the Hudson setup, having to do with the stats temp database not getting deleted?
          Hide
          Marquis Wang added a comment -

          I don't see anything that needs to be deleted in my checkout. Where is the stats temp database? Also, if you think it might just be something on our side, can you just run the tests and see if it passes for you? When I ran them I didn't see any other issues besides those, I don't think.

          Show
          Marquis Wang added a comment - I don't see anything that needs to be deleted in my checkout. Where is the stats temp database? Also, if you think it might just be something on our side, can you just run the tests and see if it passes for you? When I ran them I didn't see any other issues besides those, I don't think.
          Hide
          John Sichi added a comment -

          Yeah, I can give it a try and see if it passes.

          Show
          John Sichi added a comment - Yeah, I can give it a try and see if it passes.
          Hide
          John Sichi added a comment -

          Meh, I'm still getting numRows failures myself. I noticed that your patch includes some changes to existing test outputs (e.g. bucketmapjoin1.q.out) where it is setting the expected numRows to 0; you should have reverted those before generating the patch. But the failure I got was in another existing test (filter_join_breaktask). I'm trying again after reverting the ones you changed (in case the failure I saw was a side effect), but I'm pessimistic; I'm wondering if something innocuous about the change is somehow exposing some existing non-determinism.

          Show
          John Sichi added a comment - Meh, I'm still getting numRows failures myself. I noticed that your patch includes some changes to existing test outputs (e.g. bucketmapjoin1.q.out) where it is setting the expected numRows to 0; you should have reverted those before generating the patch. But the failure I got was in another existing test (filter_join_breaktask). I'm trying again after reverting the ones you changed (in case the failure I saw was a side effect), but I'm pessimistic; I'm wondering if something innocuous about the change is somehow exposing some existing non-determinism.
          Hide
          John Sichi added a comment -

          OK, I dug into it and found out that it was a problem with HADOOP_CLASSPATH preventing derby.jar getting loaded (so stats couldn't be written from Hadoop tasks, hence numRows=0).

          The existing HADOOP_CLASSPATH was already incorrect, but the problem was only exposed by the addition of the javaewah-0.2.jar. It was using commas for separators instead of colons (and it should not have been using file: at all!).

          Here's the correct format with which I was able to pass a few failing tests I tried individually:

                <env key="HADOOP_CLASSPATH" value="${test.src.data.dir}/conf:${build.dir.\
          hive}/dist/lib/derby.jar:${build.dir.hive}/dist/lib/javaewah-0.2.jar"/>
          

          Can you give me another patch which fixes this and omits all .q.out updates for existing tests unless they need it? Fingers crossed that will be the last one.

          Show
          John Sichi added a comment - OK, I dug into it and found out that it was a problem with HADOOP_CLASSPATH preventing derby.jar getting loaded (so stats couldn't be written from Hadoop tasks, hence numRows=0). The existing HADOOP_CLASSPATH was already incorrect, but the problem was only exposed by the addition of the javaewah-0.2.jar. It was using commas for separators instead of colons (and it should not have been using file: at all!). Here's the correct format with which I was able to pass a few failing tests I tried individually: <env key="HADOOP_CLASSPATH" value="${test.src.data.dir}/conf:${build.dir.\ hive}/dist/lib/derby.jar:${build.dir.hive}/dist/lib/javaewah-0.2.jar"/> Can you give me another patch which fixes this and omits all .q.out updates for existing tests unless they need it? Fingers crossed that will be the last one.
          Hide
          Marquis Wang added a comment -

          New patch that updates HADOOP_CLASSPATH and doesn't change tests except adding new tests and show_functions.q. Fingers crossed for this one passing. I'm optimistic.

          Show
          Marquis Wang added a comment - New patch that updates HADOOP_CLASSPATH and doesn't change tests except adding new tests and show_functions.q. Fingers crossed for this one passing. I'm optimistic.
          Hide
          John Sichi added a comment -

          Everything passed...eeexcept TestParse, where I got 44 failures. Example diff below; somehow the output seems to have gotten reordered, and it's in the area of virtual columns. Best is to find out why and fix it, otherwise make sure it's deterministic and then update .q.out.

              [junit] diff -b -I'\(\(<java version=".*" class="java.beans.XMLDecoder">\)\|
          \(<string>.*/tmp/.*</string>\)\|\(<string>file:.*</string>\)\|\(<string>pfile:.*
          </string>\)\|\(<string>[0-9]\{10\}</string>\)\|\(<string>/.*/warehouse/.*</strin
          g>\)\)' /data/users/jsichi/open/test-trunk/build/ql/test/logs/positive/case_sens
          itivity.q.xml /data/users/jsichi/open/test-trunk/ql/src/test/results/compiler/pl
          an/case_sensitivity.q.xml
              [junit] 1209c1209
              [junit] <                    <string>INPUT__FILE__NAME</string> 
              [junit] ---
              [junit] >                    <string>BLOCK__OFFSET__INSIDE__FILE</string> 
              [junit] 1215c1215,1219
              [junit] <                    <object idref="PrimitiveTypeInfo0"/> 
              [junit] ---
              [junit] >                    <object class="org.apache.hadoop.hive.serde2.ty
          peinfo.PrimitiveTypeInfo"> 
              [junit] >                     <void property="typeName"> 
              [junit] >                      <string>bigint</string> 
              [junit] >                     </void> 
              [junit] >                    </object> 
              [junit] 1225c1229
              [junit] <                    <string>BLOCK__OFFSET__INSIDE__FILE</string> 
              [junit] ---
              [junit] >                    <string>INPUT__FILE__NAME</string> 
              [junit] 1231,1235c1235
              [junit] <                    <object class="org.apache.hadoop.hive.serde2.ty
          peinfo.PrimitiveTypeInfo"> 
              [junit] <                     <void property="typeName"> 
              [junit] <                      <string>bigint</string> 
          
          Show
          John Sichi added a comment - Everything passed...eeexcept TestParse, where I got 44 failures. Example diff below; somehow the output seems to have gotten reordered, and it's in the area of virtual columns. Best is to find out why and fix it, otherwise make sure it's deterministic and then update .q.out. [junit] diff -b -I'\(\(<java version=".*" class="java.beans.XMLDecoder">\)\| \(<string>.*/tmp/.*</string>\)\|\(<string>file:.*</string>\)\|\(<string>pfile:.* </string>\)\|\(<string>[0-9]\{10\}</string>\)\|\(<string>/.*/warehouse/.*</strin g>\)\)' /data/users/jsichi/open/test-trunk/build/ql/test/logs/positive/case_sens itivity.q.xml /data/users/jsichi/open/test-trunk/ql/src/test/results/compiler/pl an/case_sensitivity.q.xml [junit] 1209c1209 [junit] < <string>INPUT__FILE__NAME</string> [junit] --- [junit] > <string>BLOCK__OFFSET__INSIDE__FILE</string> [junit] 1215c1215,1219 [junit] < <object idref="PrimitiveTypeInfo0"/> [junit] --- [junit] > <object class="org.apache.hadoop.hive.serde2.ty peinfo.PrimitiveTypeInfo"> [junit] > <void property="typeName"> [junit] > <string>bigint</string> [junit] > </void> [junit] > </object> [junit] 1225c1229 [junit] < <string>BLOCK__OFFSET__INSIDE__FILE</string> [junit] --- [junit] > <string>INPUT__FILE__NAME</string> [junit] 1231,1235c1235 [junit] < <object class="org.apache.hadoop.hive.serde2.ty peinfo.PrimitiveTypeInfo"> [junit] < <void property="typeName"> [junit] < <string>bigint</string>
          Hide
          Marquis Wang added a comment -

          The issue with the last patch was the order in which VirtualColumn.getRegistry().iterator() was returning. The old code stored the virtual column registry as a HashMap, so I added the columns to the registry in the order the HashMap would have returned them.

          This patch fixes that. I'm still seeing errors in groupby1.q through groupby6.q. It looks like various numbers are returning wrong, but it doesn't appear to be related to the virtual columns. I can't really tell whether there is a pattern to it.

          can you take a look?

          
          

          [junit] > <string>CNTR_NAME_GBY_28_NUM_INPUT_ROWS</string>
          [junit] 1345c1341
          [junit] < <string>CNTR_NAME_GBY_4_NUM_OUTPUT_ROWS</string>
          [junit]
          [junit] > <string>CNTR_NAME_GBY_28_NUM_OUTPUT_ROWS</string>
          [junit] 1348c1344
          [junit] < <string>CNTR_NAME_GBY_4_TIME_TAKEN</string>
          [junit]
          [junit] > <string>CNTR_NAME_GBY_28_TIME_TAKEN</string>
          [junit] 1351c1347
          [junit] < <string>CNTR_NAME_GBY_4_FATAL_ERROR</string>
          [junit]
          [junit] > <string>CNTR_NAME_GBY_28_FATAL_ERROR</string>

          {/noformat}
          Show
          Marquis Wang added a comment - The issue with the last patch was the order in which VirtualColumn.getRegistry().iterator() was returning. The old code stored the virtual column registry as a HashMap, so I added the columns to the registry in the order the HashMap would have returned them. This patch fixes that. I'm still seeing errors in groupby1.q through groupby6.q. It looks like various numbers are returning wrong, but it doesn't appear to be related to the virtual columns. I can't really tell whether there is a pattern to it. can you take a look? [junit] > <string>CNTR_NAME_GBY_28_NUM_INPUT_ROWS</string> [junit] 1345c1341 [junit] < <string>CNTR_NAME_GBY_4_NUM_OUTPUT_ROWS</string> [junit] — [junit] > <string>CNTR_NAME_GBY_28_NUM_OUTPUT_ROWS</string> [junit] 1348c1344 [junit] < <string>CNTR_NAME_GBY_4_TIME_TAKEN</string> [junit] — [junit] > <string>CNTR_NAME_GBY_28_TIME_TAKEN</string> [junit] 1351c1347 [junit] < <string>CNTR_NAME_GBY_4_FATAL_ERROR</string> [junit] — [junit] > <string>CNTR_NAME_GBY_28_FATAL_ERROR</string> {/noformat}
          Hide
          John Sichi added a comment -

          The number comes from the operator ID; I'm not sure why those would have shifted, but if it's consistent, then give me a patch that just updates the .q.out files and if that passes for me, we'll call it a day.

          Show
          John Sichi added a comment - The number comes from the operator ID; I'm not sure why those would have shifted, but if it's consistent, then give me a patch that just updates the .q.out files and if that passes for me, we'll call it a day.
          Hide
          Marquis Wang added a comment -

          New patch that updates the groupby tests in TestParse.

          The number from the operator ID was not consistent, it gives different results when I run just one test at a time vs. all the tests at once, which is why I thought they needed to be updated. The result as it was before works for those tests still.

          Another thing needed to be changed for me though, for the groupby tests:

          @@ -521,7 +521,8 @@
                                  <string>sum</string> 
                                 </void> 
                                 <void property="mode"> 
          -                       <object class="org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$Mode" method="valueOf"> 
          +                       <object class="java.lang.Enum" method="valueOf"> 
          +                        <class>org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$Mode</class> 
                                   <string>PARTIAL1</string> 
                                  </object> 
                                 </void>
          

          The new patch updates those tests.

          Show
          Marquis Wang added a comment - New patch that updates the groupby tests in TestParse. The number from the operator ID was not consistent, it gives different results when I run just one test at a time vs. all the tests at once, which is why I thought they needed to be updated. The result as it was before works for those tests still. Another thing needed to be changed for me though, for the groupby tests: @@ -521,7 +521,8 @@ <string>sum</string> </void> <void property="mode"> - <object class="org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$Mode" method="valueOf"> + <object class="java.lang.Enum" method="valueOf"> + <class>org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator$Mode</class> <string>PARTIAL1</string> </object> </void> The new patch updates those tests.
          Hide
          John Sichi added a comment -

          With patch 15, everything passed for me except the groupby* tests, so I think it's something about the Java version you're using (unrelated to your change). When I revert those .q.xml files, TestParse passes. I'm going to re-run the full ant test with those reverted, and assuming that passes, +1.

          Show
          John Sichi added a comment - With patch 15, everything passed for me except the groupby* tests, so I think it's something about the Java version you're using (unrelated to your change). When I revert those .q.xml files, TestParse passes. I'm going to re-run the full ant test with those reverted, and assuming that passes, +1.
          Hide
          John Sichi added a comment -

          Committed. Thanks Marquis!

          Show
          John Sichi added a comment - Committed. Thanks Marquis!
          Hide
          siddharth ramanan added a comment -

          Hi, I am trying to put an index for a string column in a table, I am getting NumberFormatException with respect to _offsets in the index table when I run the query? Can you give me suggestions as to how to do this?

          Thanks,
          Siddharth

          Show
          siddharth ramanan added a comment - Hi, I am trying to put an index for a string column in a table, I am getting NumberFormatException with respect to _offsets in the index table when I run the query? Can you give me suggestions as to how to do this? Thanks, Siddharth
          Hide
          John Sichi added a comment -

          @Siddharth: please open a new JIRA issue, including exact steps to reproduce your problem.

          Show
          John Sichi added a comment - @Siddharth: please open a new JIRA issue, including exact steps to reproduce your problem.

            People

            • Assignee:
              Marquis Wang
              Reporter:
              Marquis Wang
            • Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development