Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.10.0
    • Fix Version/s: 0.11.0
    • Component/s: Query Processor, Types
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Add support for the DECIMAL data type. HIVE-2272 (TIMESTAMP) provides a nice template for how to do this.

      1. HIVE-2693-take4.patch
        143 kB
        Josh Wills
      2. HIVE-2693-take3.patch
        128 kB
        Josh Wills
      3. HIVE-2693-fix.patch
        103 kB
        Josh Wills
      4. HIVE-2693-all.patch
        102 kB
        Josh Wills
      5. HIVE-2693-23.patch
        273 kB
        Gunther Hagleitner
      6. HIVE-2693-22.patch
        274 kB
        Mark Grover
      7. HIVE-2693-21.patch
        274 kB
        Gunther Hagleitner
      8. HIVE-2693-20.patch
        267 kB
        Gunther Hagleitner
      9. HIVE-2693-19.patch
        267 kB
        Gunther Hagleitner
      10. HIVE-2693-18.patch
        278 kB
        Gunther Hagleitner
      11. HIVE-2693-17.patch
        276 kB
        Gunther Hagleitner
      12. HIVE-2693-16.patch
        183 kB
        Gunther Hagleitner
      13. HIVE-2693-15.patch
        193 kB
        Mark Grover
      14. HIVE-2693-14.patch
        184 kB
        Gunther Hagleitner
      15. HIVE-2693-13.patch
        164 kB
        Mark Grover
      16. HIVE-2693-12-SortableSerDe.patch
        233 kB
        Gunther Hagleitner
      17. HIVE-2693-11.patch
        149 kB
        Mark Grover
      18. HIVE-2693-10.patch
        149 kB
        Prasad Mujumdar
      19. HIVE-2693-1.patch.txt
        124 kB
        cdub
      20. HIVE-2693.patch
        90 kB
        Josh Wills
      21. HIVE-2693.D7683.1.patch
        171 kB
        Phabricator
      22. 2693_fix_all_tests1.patch
        124 kB
        Vikram Dixit K
      23. 2693_8.patch
        124 kB
        Vikram Dixit K
      24. 2693_7.patch
        125 kB
        Vikram Dixit K

        Issue Links

          Activity

          Hide
          Zhiqiang He added a comment -

          greate! date type should also be supportted.

          Show
          Zhiqiang He added a comment - greate! date type should also be supportted.
          Hide
          Carl Steinbach added a comment -

          Review request from a while ago: https://reviews.facebook.net/D1221

          Show
          Carl Steinbach added a comment - Review request from a while ago: https://reviews.facebook.net/D1221
          Hide
          Josh Wills added a comment -

          The old version of this change.

          Show
          Josh Wills added a comment - The old version of this change.
          Hide
          Vikram Dixit K added a comment -

          Hi,

          I was trying out this patch and found that it is missing some of the classes. Specifically BigDecimalWritable, BigDecimalObjectInspector (org.apache.hadoop.hive.serde2.objectinspector.primitive.BigDecimalObjectInspector), WritableBigDecimalObjectInspector (org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableBigDecimalObjectInspector), JavaBigDecimalObjectInspector are missing. Would it be possible to update this patch. Is anyone working on this? It seems like a local git commit is the likely cause for this as Carl pointed out in the review.

          Thanks
          Vikram.

          Show
          Vikram Dixit K added a comment - Hi, I was trying out this patch and found that it is missing some of the classes. Specifically BigDecimalWritable, BigDecimalObjectInspector (org.apache.hadoop.hive.serde2.objectinspector.primitive.BigDecimalObjectInspector), WritableBigDecimalObjectInspector (org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableBigDecimalObjectInspector), JavaBigDecimalObjectInspector are missing. Would it be possible to update this patch. Is anyone working on this? It seems like a local git commit is the likely cause for this as Carl pointed out in the review. Thanks Vikram.
          Hide
          Josh Wills added a comment -

          Hey Vikram-- sorry about that, I haven't paid much attention to this issue. How does this patch work for you?

          Show
          Josh Wills added a comment - Hey Vikram-- sorry about that, I haven't paid much attention to this issue. How does this patch work for you?
          Hide
          Vikram Dixit K added a comment -

          Hey Josh, Thank you for your response. I still get errors when I apply this patch. The class - BigDecimalWritable is missing in the repository. The changes specified for this class does not apply.

          Show
          Vikram Dixit K added a comment - Hey Josh, Thank you for your response. I still get errors when I apply this patch. The class - BigDecimalWritable is missing in the repository. The changes specified for this class does not apply.
          Hide
          Josh Wills added a comment -

          One more try.

          Show
          Josh Wills added a comment - One more try.
          Hide
          Vikram Dixit K added a comment -

          Patch applied fine but when I try to compile the change, I see that the WritableBigDecimalObjectInspector class is missing.

          Show
          Vikram Dixit K added a comment - Patch applied fine but when I try to compile the change, I see that the WritableBigDecimalObjectInspector class is missing.
          Hide
          Josh Wills added a comment -

          This compiles and makes it to the test phase before I started running into errors, but I can't tell if it's a problem w/the patch or my test harness. In any case, that's enough work on it for tonight.

          Show
          Josh Wills added a comment - This compiles and makes it to the test phase before I started running into errors, but I can't tell if it's a problem w/the patch or my test harness. In any case, that's enough work on it for tonight.
          Hide
          Vikram Dixit K added a comment -

          I tried the latest patch and it builds successfully. However, I cannot run any tests because there seems to be something broken in the parser. Posting the error below:

          [junit] FAILED: ParseException line 1:23 cannot recognize input near ''/grid/0/hive/BigD/data/files/kv1.txt'' '(' ''2008-04-08''
          [junit]
          [junit] java.lang.Exception: load command: LOAD DATA LOCAL INPATH '/grid/0/hive/BigD/data/files/kv1.txt' OVERWRITE INTO TABLE srcpart PARTITION (ds='2008-04-08',hr='11') failed with exit code= 40000
          [junit] at org.apache.hadoop.hive.ql.QTestUtil.runLoadCmd(QTestUtil.java:468)
          [junit] at org.apache.hadoop.hive.ql.QTestUtil.createSources(QTestUtil.java:512)
          [junit] at org.apache.hadoop.hive.cli.TestCliDriver.<clinit>(TestCliDriver.java:55)

          Show
          Vikram Dixit K added a comment - I tried the latest patch and it builds successfully. However, I cannot run any tests because there seems to be something broken in the parser. Posting the error below: [junit] FAILED: ParseException line 1:23 cannot recognize input near ''/grid/0/hive/BigD/data/files/kv1.txt'' '(' ''2008-04-08'' [junit] [junit] java.lang.Exception: load command: LOAD DATA LOCAL INPATH '/grid/0/hive/BigD/data/files/kv1.txt' OVERWRITE INTO TABLE srcpart PARTITION (ds='2008-04-08',hr='11') failed with exit code= 40000 [junit] at org.apache.hadoop.hive.ql.QTestUtil.runLoadCmd(QTestUtil.java:468) [junit] at org.apache.hadoop.hive.ql.QTestUtil.createSources(QTestUtil.java:512) [junit] at org.apache.hadoop.hive.cli.TestCliDriver.<clinit>(TestCliDriver.java:55)
          Hide
          Josh Wills added a comment -

          Hrm, I'm not seeing that. I'm just doing 'ant package' followed by 'ant test'-- I assume you're doing the same?

          Show
          Josh Wills added a comment - Hrm, I'm not seeing that. I'm just doing 'ant package' followed by 'ant test'-- I assume you're doing the same?
          Hide
          Vikram Dixit K added a comment -

          Yes. For the test however, I am running ant test -Dtestcase=TestCliDriver -Dqfile=decimal_1.q. However, I see this error even when I run ant test -Dtestcase=TestCliDriver. I ran the compilation step after cleaning up my m2 and ivy caches.

          Show
          Vikram Dixit K added a comment - Yes. For the test however, I am running ant test -Dtestcase=TestCliDriver -Dqfile=decimal_1.q. However, I see this error even when I run ant test -Dtestcase=TestCliDriver. I ran the compilation step after cleaning up my m2 and ivy caches.
          Hide
          Josh Wills added a comment -

          Latest and greatest.

          Show
          Josh Wills added a comment - Latest and greatest.
          Hide
          Josh Wills added a comment -

          Comments/feedback?

          Show
          Josh Wills added a comment - Comments/feedback?
          Hide
          Vikram Dixit K added a comment -

          Hi Josh, I have been able to successfully build and run the tests contained in this patch. However, I see a few failures (4 in total) related to some UDF tests - udf_div.q, udf_round.q, ql_rewrite_gbtoidx.q, udf7.q.

          Show
          Vikram Dixit K added a comment - Hi Josh, I have been able to successfully build and run the tests contained in this patch. However, I see a few failures (4 in total) related to some UDF tests - udf_div.q, udf_round.q, ql_rewrite_gbtoidx.q, udf7.q.
          Hide
          Vikram Dixit K added a comment -

          org.apache.hadoop.hive.ql.parse.SemanticException: Line 3:36 Wrong arguments 'TOK_NULL': Ambiguous method for class org.apache.hadoop.hive.ql.udf.UDFRound with (int, void). Possible choices: FUNC(double) FUNC(double, int) FUNC(decimal, int)
          at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:914)

          Similar errors in the others.

          Show
          Vikram Dixit K added a comment - org.apache.hadoop.hive.ql.parse.SemanticException: Line 3:36 Wrong arguments 'TOK_NULL': Ambiguous method for class org.apache.hadoop.hive.ql.udf.UDFRound with (int, void). Possible choices: FUNC (double) FUNC (double, int) FUNC (decimal, int) at org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:914) Similar errors in the others.
          Hide
          Vikram Dixit K added a comment -

          Hi Josh, I found the issue. The getMethodInternal api within the FunctionRegistry class iterates over all the apis within a class say UDFRound and finds the api with the lowest conversion cost to use for the subsequent operations. Since the addition of the big decimal evaluate api in the UDFRound class, there is now a collision between the Double and Big decimal evaluate apis since both have the same conversion cost. One way I think we can fix this is to add an enum of conversion precedence in the functionregistry class. The precedence order can be:

          BigDecimal > Double > Float > Long > Int > boolean. If we see a collision, we can resolve it in favor of the broadest (most accepting) object which in this case is BigDecimal. Let me know what you think and also if you have time to work on it.

          Show
          Vikram Dixit K added a comment - Hi Josh, I found the issue. The getMethodInternal api within the FunctionRegistry class iterates over all the apis within a class say UDFRound and finds the api with the lowest conversion cost to use for the subsequent operations. Since the addition of the big decimal evaluate api in the UDFRound class, there is now a collision between the Double and Big decimal evaluate apis since both have the same conversion cost. One way I think we can fix this is to add an enum of conversion precedence in the functionregistry class. The precedence order can be: BigDecimal > Double > Float > Long > Int > boolean. If we see a collision, we can resolve it in favor of the broadest (most accepting) object which in this case is BigDecimal. Let me know what you think and also if you have time to work on it.
          Hide
          Josh Wills added a comment -

          Hey Vikram-- I'm a bit swamped at the moment and we're pushing the limits of my knowledge of Hive internals. Would you or someone else like to take a crack at it?

          Show
          Josh Wills added a comment - Hey Vikram-- I'm a bit swamped at the moment and we're pushing the limits of my knowledge of Hive internals. Would you or someone else like to take a crack at it?
          Hide
          Vikram Dixit K added a comment -

          Ok. I will look into this error.

          Show
          Vikram Dixit K added a comment - Ok. I will look into this error.
          Hide
          Vikram Dixit K added a comment -

          Hi Josh,

          Can you tell me why in the UDFRound.java, the math context is being initialized with the formula bd.precision() - (bd.scale() - i.get())? This is causing some issues with respect to calls such as round(55555, -6) etc. The reason is the bigdecimal constructor sets precision by default to the number of digits of the integer passed into it. In this case for 55555, it is set to 5 and the scale by default is 0. This implies that for this case, the formula above results in a negative number that the mathcontext constructor cannot take. I am not sure what the behavior should be?

          Thanks
          Vikram.

          Show
          Vikram Dixit K added a comment - Hi Josh, Can you tell me why in the UDFRound.java, the math context is being initialized with the formula bd.precision() - (bd.scale() - i.get())? This is causing some issues with respect to calls such as round(55555, -6) etc. The reason is the bigdecimal constructor sets precision by default to the number of digits of the integer passed into it. In this case for 55555, it is set to 5 and the scale by default is 0. This implies that for this case, the formula above results in a negative number that the mathcontext constructor cannot take. I am not sure what the behavior should be? Thanks Vikram.
          Hide
          Josh Wills added a comment -

          For the life of me, I do not remember-- if there's something that makes sense, by all means, change it up.

          Show
          Josh Wills added a comment - For the life of me, I do not remember-- if there's something that makes sense, by all means, change it up.
          Hide
          Vikram Dixit K added a comment -

          Updated patch with fixes for the failed unit-tests.

          Show
          Vikram Dixit K added a comment - Updated patch with fixes for the failed unit-tests.
          Hide
          Mark Grover added a comment -

          Can someone update the review please?

          Show
          Mark Grover added a comment - Can someone update the review please?
          Hide
          Vikram Dixit K added a comment -

          With this latest patch, I see all the failed tests passing. I have just started a run of the entire test suite just to be sure I did not break anything. There is one issue however. The mysql round function behavior does not seem to be reproducible via the bigdecimal round function. The round function takes a math context object whose scale can only be a positive integer and this results in some inconsistent results. The currently existing double-based method is producing the expected results in accordance with the mysql response. I am not sure which way the coin should drop in this case. Bigdecimal's primary use case is to have higher precision but, the round operation is not entirely compatible with the current mysql standard. As an err on the mysql side strategy, I have taken the approach of sticking to the current double based round api. Based on consensus in the hive community, I can add back the capability to go the big decimal route or if anyone knows how to tune the bigdecimal to round the way mysql does it, I would be happy to correct/use that.

          Apart from this change, I have also had to refactor the code in FunctionRegistry to be able to choose a 'higher denominator' api over a lower one as alluded to in the comment above. This was required for the logarithm udf function. I have refactored things in such a way that new apis can easily be added that uses the already existing hierarchy without the user having to get caught in these internals.

          Show
          Vikram Dixit K added a comment - With this latest patch, I see all the failed tests passing. I have just started a run of the entire test suite just to be sure I did not break anything. There is one issue however. The mysql round function behavior does not seem to be reproducible via the bigdecimal round function. The round function takes a math context object whose scale can only be a positive integer and this results in some inconsistent results. The currently existing double-based method is producing the expected results in accordance with the mysql response. I am not sure which way the coin should drop in this case. Bigdecimal's primary use case is to have higher precision but, the round operation is not entirely compatible with the current mysql standard. As an err on the mysql side strategy, I have taken the approach of sticking to the current double based round api. Based on consensus in the hive community, I can add back the capability to go the big decimal route or if anyone knows how to tune the bigdecimal to round the way mysql does it, I would be happy to correct/use that. Apart from this change, I have also had to refactor the code in FunctionRegistry to be able to choose a 'higher denominator' api over a lower one as alluded to in the comment above. This was required for the logarithm udf function. I have refactored things in such a way that new apis can easily be added that uses the already existing hierarchy without the user having to get caught in these internals.
          Hide
          Vikram Dixit K added a comment -

          Updated to apply on latest trunk.

          Show
          Vikram Dixit K added a comment - Updated to apply on latest trunk.
          Hide
          Ashutosh Chauhan added a comment -

          @Vikram, Patch still doesn't apply. Can you update the patch. Also, note that you need to install thrift-0.9.0 now after we updated from thrift-0.7. Also, can you please create a phabricator or review board review request?

          Show
          Ashutosh Chauhan added a comment - @Vikram, Patch still doesn't apply. Can you update the patch. Also, note that you need to install thrift-0.9.0 now after we updated from thrift-0.7. Also, can you please create a phabricator or review board review request?
          Hide
          Vikram Dixit K added a comment -

          Latest update with Ashutosh's comments addressed.

          Show
          Vikram Dixit K added a comment - Latest update with Ashutosh's comments addressed.
          Hide
          Vikram Dixit K added a comment -

          Patch needs review.

          Show
          Vikram Dixit K added a comment - Patch needs review.
          Hide
          Vikram Dixit K added a comment -
          Show
          Vikram Dixit K added a comment - Review board url: https://reviews.apache.org/r/8049/
          Hide
          Ashutosh Chauhan added a comment -

          Vikram, I left some comments on RB.

          Show
          Ashutosh Chauhan added a comment - Vikram, I left some comments on RB.
          Hide
          cdub added a comment -

          Ignore: just testing to see if JIRA contributors can attach files...

          Show
          cdub added a comment - Ignore: just testing to see if JIRA contributors can attach files...
          Hide
          Carl Steinbach added a comment -

          @Vikram: Just spoke with Josh and found out that he doesn't have time to work on this. I'm reassigning this to Prasad since he's already working on this patch and has done some additional changes that are needed for the JDBC driver. Let me know if you want to work on this too and we can find some way to collaborate. Thanks.

          Show
          Carl Steinbach added a comment - @Vikram: Just spoke with Josh and found out that he doesn't have time to work on this. I'm reassigning this to Prasad since he's already working on this patch and has done some additional changes that are needed for the JDBC driver. Let me know if you want to work on this too and we can find some way to collaborate. Thanks.
          Hide
          Vikram Dixit K added a comment -

          Hi Carl, I have already done some work with regard to review comments. I am planning to continue to work on this. Let me know how you want to proceed. Thanks Vikram.

          Show
          Vikram Dixit K added a comment - Hi Carl, I have already done some work with regard to review comments. I am planning to continue to work on this. Let me know how you want to proceed. Thanks Vikram.
          Hide
          Ashutosh Chauhan added a comment -

          Vikram,
          Current patch also lacks test coverage. Please add following tests:
          a) Test which loads data in the table having decimal column via reading from text file.
          b) Test which does group-by / join on decimal column.

          Show
          Ashutosh Chauhan added a comment - Vikram, Current patch also lacks test coverage. Please add following tests: a) Test which loads data in the table having decimal column via reading from text file. b) Test which does group-by / join on decimal column.
          Hide
          Prasad Mujumdar added a comment -

          Here's the updated patch with test updates, additional patches for failed tests and JDBC support -
          There were a few tests still failing, like clientnegative/invalid_cast_from_binary_[1-6] which required out file updates.
          The patch 7&8 removed decimal round function which breaks things like round(int/int) and also round(decimal), so its added back. Also added JDBC support for handling decimal type data and metadata which was missing in Josh's original patch. I will log a separate ticket for HiveServer2 driver once both the patches are committed.

          I think the outstanding issue is the implicit type conversion for UDFs. Firstly this changes expressions like (int/int) from double to decimal. This could be a problem of existing clients like ODBC, perl and python which expect this to be a double. Besides this leads to inconsistent behavior on division by 0, for example 1.1/0.0 stays NaN but 1/0 throws exception since it gets promoted by division of decimal which behaves differently from double. The BigDecimal throws an exception in case of division by 0. I added a couple of patches and modified the udf_round_2 test so that it returns NULL (which is also MySQL default behavior). Perhaps we should change the other cases also to from NaN to NULL and support a configuration option to fall back to old behavior (which can be done in a separate patch).

          @Vikram, we can collaborate on this. You can add your new changes on top of it and update on reviewboard.

          Show
          Prasad Mujumdar added a comment - Here's the updated patch with test updates, additional patches for failed tests and JDBC support - There were a few tests still failing, like clientnegative/invalid_cast_from_binary_ [1-6] which required out file updates. The patch 7&8 removed decimal round function which breaks things like round(int/int) and also round(decimal), so its added back. Also added JDBC support for handling decimal type data and metadata which was missing in Josh's original patch. I will log a separate ticket for HiveServer2 driver once both the patches are committed. I think the outstanding issue is the implicit type conversion for UDFs. Firstly this changes expressions like (int/int) from double to decimal. This could be a problem of existing clients like ODBC, perl and python which expect this to be a double. Besides this leads to inconsistent behavior on division by 0, for example 1.1/0.0 stays NaN but 1/0 throws exception since it gets promoted by division of decimal which behaves differently from double. The BigDecimal throws an exception in case of division by 0. I added a couple of patches and modified the udf_round_2 test so that it returns NULL (which is also MySQL default behavior). Perhaps we should change the other cases also to from NaN to NULL and support a configuration option to fall back to old behavior (which can be done in a separate patch). @Vikram, we can collaborate on this. You can add your new changes on top of it and update on reviewboard.
          Hide
          Mark Grover added a comment -

          I was playing around with this and noticed that group by on Decimal column doesn't work because BinarySortableSerDe doesn't support DECIMAL type. It's questionable if it would ever support it since BigDecimal is arbitrarily long and BinarySortableSerDe dictates data to be serialized so that the value can be compared byte by byte with the same order. That would have been ok but it seems like getReduceKeyTableDesc in PlanUtils.java is hardcoded to use BinarySortableSerDe. I will have to poke around some more to see how Group By can be made to work with Decimal type (backed by BigDecimal).

          In any case, the last patch didn't apply cleanly on trunk, so I fixed some merge conflicts and am attaching a new patch (#11) which is a refresh of patch 10 that applies cleanly on trunk as of today.

          Show
          Mark Grover added a comment - I was playing around with this and noticed that group by on Decimal column doesn't work because BinarySortableSerDe doesn't support DECIMAL type. It's questionable if it would ever support it since BigDecimal is arbitrarily long and BinarySortableSerDe dictates data to be serialized so that the value can be compared byte by byte with the same order. That would have been ok but it seems like getReduceKeyTableDesc in PlanUtils.java is hardcoded to use BinarySortableSerDe. I will have to poke around some more to see how Group By can be made to work with Decimal type (backed by BigDecimal). In any case, the last patch didn't apply cleanly on trunk, so I fixed some merge conflicts and am attaching a new patch (#11) which is a refresh of patch 10 that applies cleanly on trunk as of today.
          Hide
          Mark Grover added a comment -

          Refresh of patch 10.

          Show
          Mark Grover added a comment - Refresh of patch 10.
          Hide
          Gunther Hagleitner added a comment -

          It should be possible to serialize a big decimal as <sign><scale><length><int-digits>. That way the natural ordering should be preserved at the binary level.

          Show
          Gunther Hagleitner added a comment - It should be possible to serialize a big decimal as <sign><scale><length><int-digits>. That way the natural ordering should be preserved at the binary level.
          Hide
          Gunther Hagleitner added a comment -

          Actually, that's not going to work because of the length. But for this case I don't think we need to encode the length. Hive key will put the stuff in a byteswritable and skip the length header on comparison.

          Also <sign> has to be 1 for positive 0 for negative (to preserve order) and <scale> has to be negated for negative numbers.

          Show
          Gunther Hagleitner added a comment - Actually, that's not going to work because of the length. But for this case I don't think we need to encode the length. Hive key will put the stuff in a byteswritable and skip the length header on comparison. Also <sign> has to be 1 for positive 0 for negative (to preserve order) and <scale> has to be negated for negative numbers.
          Hide
          Gunther Hagleitner added a comment -

          Wrong again, but getting closer. The length of strings/byte arrays are encoded with trailing "\0" in BinarySortableSerDe. So the encoding should be

          <sign byte><sale int><string of digits>

          <sign byte>: 1 for >= 0, 0 for <0
          <scale>: scale integer if sign byte 1, -scale integer otherwise
          <string of digits>: Zero terminated string of digits

          I'm trying this out right now.

          Show
          Gunther Hagleitner added a comment - Wrong again, but getting closer. The length of strings/byte arrays are encoded with trailing "\0" in BinarySortableSerDe. So the encoding should be <sign byte><sale int><string of digits> <sign byte>: 1 for >= 0, 0 for <0 <scale>: scale integer if sign byte 1, -scale integer otherwise <string of digits>: Zero terminated string of digits I'm trying this out right now.
          Hide
          Mark Grover added a comment -

          Thanks for your comments, Gunther Hagleitner. Yeah, the right thing moving forward would be to update BinarySortableSerDe to support BigDecimal. When I thinking of the best way to serialize BigDecimal, the sign and scale part were easy but I wasn't able to come up with a space efficient way to store and arbitrary number of digits so they are in-order byte sortable. Correct me if I am wrong but seems like you are suggesting 1 byte per digit which would work (if the lengths are equal) but can be dangerous since we are "exploding" an arbitrarily long integer. Having said that, given Hive's historical philosophy of ignoring malicious intent users, I am ok with moving forward with that approach.

          On a related note, let's talk about the scale.
          13267 has a scale of 0 and digits 13267
          132.67 has a scale of 2 and digits 13267
          132670 has a scale of -1 and digits 13267

          So, it seems like the lower scale always means bigger number for positive numbers, so shouldn't we do

          <scale>: -scale for positive numbers (sign byte 1) and scale for negative numbers (sign byte 0)
          

          instead of

          <scale>: scale integer if sign byte 1, -scale integer otherwise
          

          which you suggested in your previous comment. I am basically asking to flip it around since negative numbers have sign byte 0 and positive have sign byte.
          BTW, feel free to contact me offline if you want to bounce around some ideas!

          Show
          Mark Grover added a comment - Thanks for your comments, Gunther Hagleitner . Yeah, the right thing moving forward would be to update BinarySortableSerDe to support BigDecimal. When I thinking of the best way to serialize BigDecimal, the sign and scale part were easy but I wasn't able to come up with a space efficient way to store and arbitrary number of digits so they are in-order byte sortable. Correct me if I am wrong but seems like you are suggesting 1 byte per digit which would work (if the lengths are equal) but can be dangerous since we are "exploding" an arbitrarily long integer. Having said that, given Hive's historical philosophy of ignoring malicious intent users, I am ok with moving forward with that approach. On a related note, let's talk about the scale. 13267 has a scale of 0 and digits 13267 132.67 has a scale of 2 and digits 13267 132670 has a scale of -1 and digits 13267 So, it seems like the lower scale always means bigger number for positive numbers, so shouldn't we do <scale>: -scale for positive numbers (sign byte 1) and scale for negative numbers (sign byte 0) instead of <scale>: scale integer if sign byte 1, -scale integer otherwise which you suggested in your previous comment. I am basically asking to flip it around since negative numbers have sign byte 0 and positive have sign byte. BTW, feel free to contact me offline if you want to bounce around some ideas!
          Hide
          Gunther Hagleitner added a comment -

          HIVE-2693-12-SortableSerDe.patch adds DECIMAL decimal to the LazySortableSerDe. It also adds tests for order by asc/desc, group by, distinct and join.

          The serialization is similar as described before: <sign><factor><digits>

          where:
          <sign> = -1,0,1 (zero has it's own sign now)
          <factor> = factor is the position of the first digit before the decimal point (>1) or the first non zero after the decimal point
          <digits> = string of digits of the number without the decimal point

          Show
          Gunther Hagleitner added a comment - HIVE-2693 -12-SortableSerDe.patch adds DECIMAL decimal to the LazySortableSerDe. It also adds tests for order by asc/desc, group by, distinct and join. The serialization is similar as described before: <sign><factor><digits> where: <sign> = -1,0,1 (zero has it's own sign now) <factor> = factor is the position of the first digit before the decimal point (>1) or the first non zero after the decimal point <digits> = string of digits of the number without the decimal point
          Hide
          Mark Grover added a comment -

          Thanks for contributing, Gunther Hagleitner! I am just about to take a closer look at your latest patch but like we discussed offline, a better packed byte (with radix 256) might be a better solution to a digit (i.e. a radix 10) based serialization. Since we would need a sentinel character for terminating this arbitrary long sequence of bytes (regardless of what radix we use), we have 2 options:
          A. To use a smaller radix and reserve a particular byte as a terminator
          B. To use a larger radix (like 256) and have certain escape character to escape in the terminator if it appears in the content, similar to what we have for string.

          I would prefer the latter (option B). Any particular concerns from anybody about that?

          Show
          Mark Grover added a comment - Thanks for contributing, Gunther Hagleitner ! I am just about to take a closer look at your latest patch but like we discussed offline, a better packed byte (with radix 256) might be a better solution to a digit (i.e. a radix 10) based serialization. Since we would need a sentinel character for terminating this arbitrary long sequence of bytes (regardless of what radix we use), we have 2 options: A. To use a smaller radix and reserve a particular byte as a terminator B. To use a larger radix (like 256) and have certain escape character to escape in the terminator if it appears in the content, similar to what we have for string. I would prefer the latter (option B). Any particular concerns from anybody about that?
          Hide
          Mark Grover added a comment -

          I added some test data and more tests to patch12 and found a few more interesting issues:
          1. I added new tests related to where clauses. The where clause doesn't seem to be working as expected. I will take another look to see if I am doing something wrong but that's my first impression anyways.
          2. I added more test data where the decimal column has values like 3.14 and 3.140. This is an interesting case since we would like to maintain compatibility with MySQL where possible. If I remember correctly (from a few days ago when I tried it), MySQL considers 3.14 and 3.140 to be equivalent. Therefore, they would be considered the same in equi-join, where clauses, etc. This addition of a new data led me to see that order by is non-deterministic when done over a decimal column. Again something, we should look more into. FWIW, 3.14 are correctly being joined to 3.140 rows, so that's good!
          3. I added some more test data with NULLs for the decimal column to make sure those were being read and handled properly when the table was being loaded.
          I will submit a new patch with these added tests and data shortly.

          Gunther Hagleitner About patch 12, do we need to have a separate sign (0) for zero? Would it not suffice for it to use the same sign as positive numbers? That would make it consistent with other datatypes as well. Been having a little busy morning but will review the rest of the patch shortly!

          Show
          Mark Grover added a comment - I added some test data and more tests to patch12 and found a few more interesting issues: 1. I added new tests related to where clauses. The where clause doesn't seem to be working as expected. I will take another look to see if I am doing something wrong but that's my first impression anyways. 2. I added more test data where the decimal column has values like 3.14 and 3.140. This is an interesting case since we would like to maintain compatibility with MySQL where possible. If I remember correctly (from a few days ago when I tried it), MySQL considers 3.14 and 3.140 to be equivalent. Therefore, they would be considered the same in equi-join, where clauses, etc. This addition of a new data led me to see that order by is non-deterministic when done over a decimal column. Again something, we should look more into. FWIW, 3.14 are correctly being joined to 3.140 rows, so that's good! 3. I added some more test data with NULLs for the decimal column to make sure those were being read and handled properly when the table was being loaded. I will submit a new patch with these added tests and data shortly. Gunther Hagleitner About patch 12, do we need to have a separate sign (0) for zero? Would it not suffice for it to use the same sign as positive numbers? That would make it consistent with other datatypes as well. Been having a little busy morning but will review the rest of the patch shortly!
          Hide
          Mark Grover added a comment -

          Based on top off patch 12 but has more tests, some more interesting test data. Also, takes out an extraneous file Hive.g.orig which shouldn't be present in the patch.

          Show
          Mark Grover added a comment - Based on top off patch 12 but has more tests, some more interesting test data. Also, takes out an extraneous file Hive.g.orig which shouldn't be present in the patch.
          Hide
          Mark Grover added a comment -

          Ok, took a better look at a patch 12. Looks good overall.
          On second thought, radix 10 is good for now. I am not going to vouch for unnecessary optimization unless there is evidence that we need to.

          Gunther Hagleitner
          1. I understand the intent of +/- 2 you are doing to the sign bit. Is that really necessary? Sure, you can have a sign that is the same as terminator but the terminator is not considered until you get to the variable length part. What do you think?
          2. Were you able to test out your serialization and deserialization code? We should make sure it works the way we expect? Do you need some help with testing that?

          So, once we can fix the issues I mentioned in the previous comment (where clauses, deterministic order by), we should be good to review and post.
          Thanks again!

          Show
          Mark Grover added a comment - Ok, took a better look at a patch 12. Looks good overall. On second thought, radix 10 is good for now. I am not going to vouch for unnecessary optimization unless there is evidence that we need to. Gunther Hagleitner 1. I understand the intent of +/- 2 you are doing to the sign bit. Is that really necessary? Sure, you can have a sign that is the same as terminator but the terminator is not considered until you get to the variable length part. What do you think? 2. Were you able to test out your serialization and deserialization code? We should make sure it works the way we expect? Do you need some help with testing that? So, once we can fix the issues I mentioned in the previous comment (where clauses, deterministic order by), we should be good to review and post. Thanks again!
          Hide
          Gunther Hagleitner added a comment -

          Mark, thanks for the additional tests, I'll take a closer look this afternoon.

          To answer the questions:

          1: I introduced +/-2 when I was at a point in the debugging stage where paranoia took over. I can remove that, it'll make the code more readable.
          2: It is only implicitly tested in all the queries that use a reduce stage. I agree that a test of just that code would be good. Is there a place in the current unit tests that already does that/that I could use as a model?

          Sign bit: I introduced a value for zero to avoid a "factor" of negative infinity. If you lump 0 into either the positive or negative bucket it would become the number that has an infinite number of zeros before the first non-zero digit (after the decimal point). MIN_INT might have been an option, but it seems cleaner to just make the sign have three states (-1,0,1). BigDecimal class in Java itself for instance more or less randomly defines precision of 0 (i.e.: number of unscaled digits) as 1.

          Non-deterministic order: 3.14 and 3.140 are indeed equal. Their representation should be exactly the same (<1>,<1>,<314>). Given that, I'm not sure how to enforce a deterministic order or even what that would be. Are you suggesting 3.14 should always appear before 3.140?

          I am worried about your comments about the where clause. I'll take a look at the tests. But you say it's not working right?

          Show
          Gunther Hagleitner added a comment - Mark, thanks for the additional tests, I'll take a closer look this afternoon. To answer the questions: 1: I introduced +/-2 when I was at a point in the debugging stage where paranoia took over. I can remove that, it'll make the code more readable. 2: It is only implicitly tested in all the queries that use a reduce stage. I agree that a test of just that code would be good. Is there a place in the current unit tests that already does that/that I could use as a model? Sign bit: I introduced a value for zero to avoid a "factor" of negative infinity. If you lump 0 into either the positive or negative bucket it would become the number that has an infinite number of zeros before the first non-zero digit (after the decimal point). MIN_INT might have been an option, but it seems cleaner to just make the sign have three states (-1,0,1). BigDecimal class in Java itself for instance more or less randomly defines precision of 0 (i.e.: number of unscaled digits) as 1. Non-deterministic order: 3.14 and 3.140 are indeed equal. Their representation should be exactly the same (<1>,<1>,<314>). Given that, I'm not sure how to enforce a deterministic order or even what that would be. Are you suggesting 3.14 should always appear before 3.140? I am worried about your comments about the where clause. I'll take a look at the tests. But you say it's not working right?
          Hide
          Mark Grover added a comment -

          1. Yeah, let's take the +/- 2 out please.
          2. I am not aware of any such unit tests at the top of my head. I can try to poke around the code and see if I find something, will post if I find something. It probably wouldn't be until tomorrow.

          I didn't realize that about the sign bit. Yeah, (-1,0,1) is good then.

          Non-deterministic order: You are right they that are equal (and they should be). However, if you diff patch12 patch13, you will find the order in which order by displayed 1.0 and 1 got switched. The only thing that changed was me adding some data. All I would expect is for the order to remain consistent and deterministic which doesn't seem to be the case presently.

          One thing that did stand out was in serialize()

          BigDecimal dec = boi.getPrimitiveJavaObject(o).stripTrailingZeros();
          

          stripTrailingZeros() seems interesting. I am just handwaving right now, need to look more before I can assert further but could this be a part of the problem?

          Yeah, where clause is not working. The tests didn't give the expected output. Consequently, I tested using the Hive CLI (which I built after applying the patch) and it doesn't work on that either. You're welcome to take a look, I will try to find some time tonight or tomorrow morning to look into this as well.
          These two problems might be related but look at this:

          hive> select cast(3.14 as decimal) from decimal_3 limit 1;
          3.140000000000000124344978758017532527446746826171875
          

          That doesn't look right to me

          Show
          Mark Grover added a comment - 1. Yeah, let's take the +/- 2 out please. 2. I am not aware of any such unit tests at the top of my head. I can try to poke around the code and see if I find something, will post if I find something. It probably wouldn't be until tomorrow. I didn't realize that about the sign bit. Yeah, (-1,0,1) is good then. Non-deterministic order: You are right they that are equal (and they should be). However, if you diff patch12 patch13, you will find the order in which order by displayed 1.0 and 1 got switched. The only thing that changed was me adding some data. All I would expect is for the order to remain consistent and deterministic which doesn't seem to be the case presently. One thing that did stand out was in serialize() BigDecimal dec = boi.getPrimitiveJavaObject(o).stripTrailingZeros(); stripTrailingZeros() seems interesting. I am just handwaving right now, need to look more before I can assert further but could this be a part of the problem? Yeah, where clause is not working. The tests didn't give the expected output. Consequently, I tested using the Hive CLI (which I built after applying the patch) and it doesn't work on that either. You're welcome to take a look, I will try to find some time tonight or tomorrow morning to look into this as well. These two problems might be related but look at this: hive> select cast (3.14 as decimal) from decimal_3 limit 1; 3.140000000000000124344978758017532527446746826171875 That doesn't look right to me
          Hide
          Gunther Hagleitner added a comment -

          Ordering: The stripTrailingZeros() does not change the value of the BigDecimal and therefore doesn't change the serialized output. I don't think this is the culprit. The order of the output depends on what hadoop does in the shuffle/sort phase. My guess is that since you changed the input the results that mapred came up with where different. Hadoop doesn't guarantee stable sort afaik.

          Where clause: Yes, it's one and the same problem. We end up comparing 3.14 with 3.140000...1243... Still debugging but it seems we're going through double on the way to BigDecimal which changes the number.

          Show
          Gunther Hagleitner added a comment - Ordering: The stripTrailingZeros() does not change the value of the BigDecimal and therefore doesn't change the serialized output. I don't think this is the culprit. The order of the output depends on what hadoop does in the shuffle/sort phase. My guess is that since you changed the input the results that mapred came up with where different. Hadoop doesn't guarantee stable sort afaik. Where clause: Yes, it's one and the same problem. We end up comparing 3.14 with 3.140000...1243... Still debugging but it seems we're going through double on the way to BigDecimal which changes the number.
          Hide
          Gunther Hagleitner added a comment -

          Ok. I think I've convinced myself that the cast statement above is executed correctly, but the result is definitely not very intuitive.

          When you say: "cast (3.14 as decimal)" you ask hive to convert the double 3.14 to a bigdecimal in unlimited context, i.e. no rounding. Thus 3.1400001234... is correct. If you ask "cast('3.14' as decimal)" you get 3.14. Which is what you want. Rounding would be another option.

          There was a similar issue with big integer: HIVE-2509. They 'solved' the problem by at least having a shorthand for big int literals (i.e.: 0L). BigDecimal should at the very least allow something like that too. E.g.: 3.14D or 3.14BD.

          So, should we just introduce a literal shorthand and leave the behavior as is?

          Show
          Gunther Hagleitner added a comment - Ok. I think I've convinced myself that the cast statement above is executed correctly, but the result is definitely not very intuitive. When you say: "cast (3.14 as decimal)" you ask hive to convert the double 3.14 to a bigdecimal in unlimited context, i.e. no rounding. Thus 3.1400001234... is correct. If you ask "cast('3.14' as decimal)" you get 3.14. Which is what you want. Rounding would be another option. There was a similar issue with big integer: HIVE-2509 . They 'solved' the problem by at least having a shorthand for big int literals (i.e.: 0L). BigDecimal should at the very least allow something like that too. E.g.: 3.14D or 3.14BD. So, should we just introduce a literal shorthand and leave the behavior as is?
          Hide
          Gunther Hagleitner added a comment -

          One more thing. It would be great to stop auto-converting double to decimal. "WHERE decimal_column = 3.14" should preferably fail in the semantic analysis, so that any user would know that they have to cast from string or use a short hand.

          Show
          Gunther Hagleitner added a comment - One more thing. It would be great to stop auto-converting double to decimal. "WHERE decimal_column = 3.14" should preferably fail in the semantic analysis, so that any user would know that they have to cast from string or use a short hand.
          Hide
          Mark Grover added a comment -

          Fair enough. I would say let's not worry about the literal in this JIRA. We can create another JIRA and deal with it separately. I personally don't think it's a big deal because the 3.14 had it been read from a file in HDFS (when doing say 'LOCAL DATA IN PATH'), would have been correctly interpreted as a BigDecimal.

          Good point about the hadoop shuffle/sort not being stable, I was hoping it was but I can see why it's not.

          So, that leaves us with a hurdle and a half:
          1: where clause
          0.5: rigorous testing of serialization and deserialization code for Decimal type.

          Show
          Mark Grover added a comment - Fair enough. I would say let's not worry about the literal in this JIRA. We can create another JIRA and deal with it separately. I personally don't think it's a big deal because the 3.14 had it been read from a file in HDFS (when doing say 'LOCAL DATA IN PATH'), would have been correctly interpreted as a BigDecimal. Good point about the hadoop shuffle/sort not being stable, I was hoping it was but I can see why it's not. So, that leaves us with a hurdle and a half: 1: where clause 0.5: rigorous testing of serialization and deserialization code for Decimal type.
          Hide
          Mark Grover added a comment -

          That actually explains why where clauses aren't working because we are comparing BigDecimal with Double.
          I tried the same where clause but with a string and that worked:

          SELECT * FROM DECIMAL_3 WHERE key='3.14';
          3.14	3
          3.14	3
          3.14	3
          3.140	4
          

          So, here is the million dollar question. Our users (understandably so) are going to forget about the quotes. Even if we introduce a new literal as a part of this JIRA, they are going to forget about the literal. I can think of two options:
          1. Somehow promote the double 3.14 to be a BigDecimal 3.14 (maintaining precision). Maybe via a double->string->BigDecimal?
          2. Throw an error like you suggested.

          What do you think Gunther Hagleitner?

          Show
          Mark Grover added a comment - That actually explains why where clauses aren't working because we are comparing BigDecimal with Double. I tried the same where clause but with a string and that worked: SELECT * FROM DECIMAL_3 WHERE key='3.14'; 3.14 3 3.14 3 3.14 3 3.140 4 So, here is the million dollar question. Our users (understandably so) are going to forget about the quotes. Even if we introduce a new literal as a part of this JIRA, they are going to forget about the literal. I can think of two options: 1. Somehow promote the double 3.14 to be a BigDecimal 3.14 (maintaining precision). Maybe via a double->string->BigDecimal? 2. Throw an error like you suggested. What do you think Gunther Hagleitner ?
          Hide
          Mark Grover added a comment -

          Created HIVE-3820 for the Decimal type literal.

          Show
          Mark Grover added a comment - Created HIVE-3820 for the Decimal type literal.
          Hide
          Gunther Hagleitner added a comment -

          Solution Number 1 would be nice, but I don't think we'll be able to do that. Any time we convert to double in an intermediate step we lose precision - and I don't think it can be recovered by using string or anything else afterwards. That seems to be just in the nature of floating point operations. We could round to fewer digits in the conversion for instance, but that will probably just cause issues elsewhere.

          Not converting to double also seems a bad choice. That would change some very fundamental behavior in hive. All of a sudden "select 3.14 * 2 ..." would return BigDecimal instead of double. Not good.

          So, I am in favor of throwing an error. At least we'll fail early and don't send users on a wild goose chase. Although it's not ideal either: The where clause issue wouldn't have happened for instance. But we probably don't want to bar explicit casts ("cast (3.14 as decimal)") even though they probably won't do what anyone thinks they do. Or do we want to disallow them?

          Show
          Gunther Hagleitner added a comment - Solution Number 1 would be nice, but I don't think we'll be able to do that. Any time we convert to double in an intermediate step we lose precision - and I don't think it can be recovered by using string or anything else afterwards. That seems to be just in the nature of floating point operations. We could round to fewer digits in the conversion for instance, but that will probably just cause issues elsewhere. Not converting to double also seems a bad choice. That would change some very fundamental behavior in hive. All of a sudden "select 3.14 * 2 ..." would return BigDecimal instead of double. Not good. So, I am in favor of throwing an error. At least we'll fail early and don't send users on a wild goose chase. Although it's not ideal either: The where clause issue wouldn't have happened for instance. But we probably don't want to bar explicit casts ("cast (3.14 as decimal)") even though they probably won't do what anyone thinks they do. Or do we want to disallow them?
          Hide
          Mark Grover added a comment -

          I figured out why CAST(3.4 as DECIMAL) doesn't work as expected. It's because there is a no UDF equivalent to UDFToDouble (we should call it UDFToDecimal) that would do the cast from double to decimal. The associated implicit costs of conversions need to be updated in FunctionRegistry as well for the newly introduced decimal type. I am working on a patch. Will keep you posted!

          Show
          Mark Grover added a comment - I figured out why CAST(3.4 as DECIMAL) doesn't work as expected. It's because there is a no UDF equivalent to UDFToDouble (we should call it UDFToDecimal) that would do the cast from double to decimal. The associated implicit costs of conversions need to be updated in FunctionRegistry as well for the newly introduced decimal type. I am working on a patch. Will keep you posted!
          Hide
          Gunther Hagleitner added a comment -

          Very cool. So there is a way directly from string to decimal in casts? Looking forward to it.

          Show
          Gunther Hagleitner added a comment - Very cool. So there is a way directly from string to decimal in casts? Looking forward to it.
          Hide
          Gunther Hagleitner added a comment -
          • Changed +/-2 in serialization of sign to +/-1. Can't change it to just use the output of compareTo, because the negative value will be sorted incorrectly.
          • Added BigDecimal to TestStatsSerde, TestLazyBinarySerde and (most importantly) to TestSortableBinarySerde. This does roundtrip serialization/deserialization of a large struct and should cover the requested testing.
          • Fixed bug found with TestSortableBinarySerde. Good call on the additional testing. Nasty bug.
          • Changed BigDecimalWritable.equals. It uses compareTo == 0 now. With big decimals equals is not compatible with compareTo.
          • Added BigDecimal as the java object for Decimal in PrimitiveTypeEntry to make the reflection based objectinspector work.
          Show
          Gunther Hagleitner added a comment - Changed +/-2 in serialization of sign to +/-1. Can't change it to just use the output of compareTo, because the negative value will be sorted incorrectly. Added BigDecimal to TestStatsSerde, TestLazyBinarySerde and (most importantly) to TestSortableBinarySerde. This does roundtrip serialization/deserialization of a large struct and should cover the requested testing. Fixed bug found with TestSortableBinarySerde. Good call on the additional testing. Nasty bug. Changed BigDecimalWritable.equals. It uses compareTo == 0 now. With big decimals equals is not compatible with compareTo. Added BigDecimal as the java object for Decimal in PrimitiveTypeEntry to make the reflection based objectinspector work.
          Hide
          Mark Grover added a comment -

          Gunther, to answer your former question. Yes, the trick is to convert double(by default)-> String -> BigDecimal in the casting code. I just verified with my latest changes that cast(3.4 as decimal) gives the correct precise BigDecimal. An added benefit of this is that where clause are now working. No quotes or literals required. Let me tidy up a few loose ends and I will post the patch up.

          About your latest patch, thanks for the testing more. I agree that we should use compareTo over equals. There is another place where equals() is being used in PrimitiveObjectInspectorUtils.java I will change that to compareTo() in my next patch. Was the nasty bug you are referring to related to equals/compareTo inconsistency?

          Thanks again, patch coming soon.

          Show
          Mark Grover added a comment - Gunther, to answer your former question. Yes, the trick is to convert double(by default)-> String -> BigDecimal in the casting code. I just verified with my latest changes that cast(3.4 as decimal) gives the correct precise BigDecimal. An added benefit of this is that where clause are now working. No quotes or literals required. Let me tidy up a few loose ends and I will post the patch up. About your latest patch, thanks for the testing more. I agree that we should use compareTo over equals. There is another place where equals() is being used in PrimitiveObjectInspectorUtils.java I will change that to compareTo() in my next patch. Was the nasty bug you are referring to related to equals/compareTo inconsistency? Thanks again, patch coming soon.
          Hide
          Gunther Hagleitner added a comment -

          The nasty bug was that the deserialization code didn't read the final byte. Which works fine if you serialize only one column but generates incorrect results in the general case.

          Show
          Gunther Hagleitner added a comment - The nasty bug was that the deserialization code didn't read the final byte. Which works fine if you serialize only one column but generates incorrect results in the general case.
          Hide
          Mark Grover added a comment -

          Latest patch that makes where clause, casts to decimal types work. I added more tests too. This patch is rebased on top of patch14 so includes all bug fixes from there as well. We are ready for review!

          Show
          Mark Grover added a comment - Latest patch that makes where clause, casts to decimal types work. I added more tests too. This patch is rebased on top of patch14 so includes all bug fixes from there as well. We are ready for review!
          Hide
          Mark Grover added a comment -
          Show
          Mark Grover added a comment - New review at https://reviews.facebook.net/D7491
          Hide
          Gunther Hagleitner added a comment -

          Mark,

          If the conversion code goes through double -> String -> BigDecimal, I think it's basically rounding the result of the operation. You can achieve the same by setting the MathContext in the BigDecimal. That might seem better, but still causes issues. For instance: BigDecimal(Double.parseDouble(String.valueOf("0.99999999999999999")) = 1.0 and BigDecimal("0.99999999999999999") = 0.99999999999999999.

          Maybe documenting that we're rounding to a specific precision is the way forward, but I still think that not implicitly casting + error is the safer choice.

          Thoughts?

          Show
          Gunther Hagleitner added a comment - Mark, If the conversion code goes through double -> String -> BigDecimal, I think it's basically rounding the result of the operation. You can achieve the same by setting the MathContext in the BigDecimal. That might seem better, but still causes issues. For instance: BigDecimal(Double.parseDouble(String.valueOf("0.99999999999999999")) = 1.0 and BigDecimal("0.99999999999999999") = 0.99999999999999999. Maybe documenting that we're rounding to a specific precision is the way forward, but I still think that not implicitly casting + error is the safer choice. Thoughts?
          Hide
          Mark Grover added a comment -

          Gunther, I see where you are coming from. However, in my opinion, a closer analogy is new BigDecimal(Double.valueOf("0.99999999999999999").toString()) which is what's being done instead of new BigDecimal(Double.valueOf("0.99999999999999999")) which was being done previously.

          I am inclined towards keeping the present behavior and documenting it. However, if you feel strongly about not implicitly casting, I am happy to reconsider.

          Show
          Mark Grover added a comment - Gunther, I see where you are coming from. However, in my opinion, a closer analogy is new BigDecimal(Double.valueOf("0.99999999999999999").toString()) which is what's being done instead of new BigDecimal(Double.valueOf("0.99999999999999999")) which was being done previously. I am inclined towards keeping the present behavior and documenting it. However, if you feel strongly about not implicitly casting, I am happy to reconsider.
          Hide
          Gunther Hagleitner added a comment -

          No, I don't feel strongly. I was playing with your patch this afternoon and am starting to think this is better. For instance, one would run into the same problem with BigInteger where at some point you'd run out of space in the double and start having to use a big int literal. So it seems consistent to do BigDecimal that way as well. Also, most cases are going to be covered with the cast and when someone runs into issues there are ways to handle it.

          Show
          Gunther Hagleitner added a comment - No, I don't feel strongly. I was playing with your patch this afternoon and am starting to think this is better. For instance, one would run into the same problem with BigInteger where at some point you'd run out of space in the double and start having to use a big int literal. So it seems consistent to do BigDecimal that way as well. Also, most cases are going to be covered with the cast and when someone runs into issues there are ways to handle it.
          Hide
          Mark Grover added a comment -

          Sweet! I will self-review the code tomorrow and if you get a chance, it'd be great if you could do that too!

          Show
          Mark Grover added a comment - Sweet! I will self-review the code tomorrow and if you get a chance, it'd be great if you could do that too!
          Hide
          Gunther Hagleitner added a comment -

          New patch that addresses all issues in the review:

          • Removed mathematical UDFs that implicitly converted to double (Log, Exp, Sqrt). Should open new jira to provide big decimal implementations
          • Fixed Ceil, Floor, Round to work with decimals with any scale
          • added additional tests for UDFs
          • Type conversion: Code keeps conversion rules backwards compatible now (e.g.: integer division return double), numerical values are converted to big decimal only if needed (e.g.: one half of a binop is already in decimal), otherwise conversion to lower types are preferred.
          • Switched to mathematical equivalency in UDFToBoolean (was comparison by value + scale)
          Show
          Gunther Hagleitner added a comment - New patch that addresses all issues in the review: Removed mathematical UDFs that implicitly converted to double (Log, Exp, Sqrt). Should open new jira to provide big decimal implementations Fixed Ceil, Floor, Round to work with decimals with any scale added additional tests for UDFs Type conversion: Code keeps conversion rules backwards compatible now (e.g.: integer division return double), numerical values are converted to big decimal only if needed (e.g.: one half of a binop is already in decimal), otherwise conversion to lower types are preferred. Switched to mathematical equivalency in UDFToBoolean (was comparison by value + scale)
          Hide
          Phabricator added a comment -

          hagleitn requested code review of "HIVE-2693 [jira] Add DECIMAL data type".
          Reviewers: JIRA

          HIVE-2693: Add DECIMAL data type

          Add support for the DECIMAL data type. HIVE-2272 (TIMESTAMP) provides a nice template for how to do this.

          TEST PLAN
          New tests decimal_1.q, decimal_2.q, decimal_3.q (Casts, udfs, joins, aggregates, order/sort, etc)
          Decimal has been added to SerDe test classes (TestBinarySortable, etc)
          Decimal has been added to UDF test files
          Decimal has been added to jdbc test file

          REVISION DETAIL
          https://reviews.facebook.net/D7683

          AFFECTED FILES
          data/files/datatypes.txt
          data/files/kv7.txt
          jdbc/src/java/org/apache/hadoop/hive/jdbc/HiveBaseResultSet.java
          jdbc/src/java/org/apache/hadoop/hive/jdbc/HiveResultSetMetaData.java
          jdbc/src/java/org/apache/hadoop/hive/jdbc/JdbcColumn.java
          jdbc/src/java/org/apache/hadoop/hive/jdbc/Utils.java
          jdbc/src/test/org/apache/hadoop/hive/jdbc/TestJdbcDriver.java
          metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java
          ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java
          ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g
          ql/src/java/org/apache/hadoop/hive/ql/parse/TypeCheckProcFactory.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/UDFAbs.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/UDFBaseNumericOp.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/UDFBaseNumericUnaryOp.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/UDFCeil.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/UDFFloor.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPDivide.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPMinus.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPMod.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPMultiply.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPNegative.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPPlus.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPPositive.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/UDFPosMod.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/UDFPower.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/UDFRound.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/UDFToBoolean.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/UDFToByte.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/UDFToDouble.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/UDFToFloat.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/UDFToInteger.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/UDFToLong.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/UDFToShort.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/UDFToString.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCorrelation.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCovariance.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCovarianceSample.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogramNumeric.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFStd.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFStdSample.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFSum.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFVariance.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFVarianceSample.java
          ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFToDecimal.java
          ql/src/test/queries/clientpositive/decimal_1.q
          ql/src/test/queries/clientpositive/decimal_2.q
          ql/src/test/queries/clientpositive/decimal_3.q
          ql/src/test/queries/clientpositive/udf7.q
          ql/src/test/results/clientnegative/invalid_cast_from_binary_1.q.out
          ql/src/test/results/clientnegative/invalid_cast_from_binary_2.q.out
          ql/src/test/results/clientnegative/invalid_cast_from_binary_3.q.out
          ql/src/test/results/clientnegative/invalid_cast_from_binary_4.q.out
          ql/src/test/results/clientnegative/invalid_cast_from_binary_5.q.out
          ql/src/test/results/clientnegative/invalid_cast_from_binary_6.q.out
          ql/src/test/results/clientnegative/wrong_column_type.q.out
          ql/src/test/results/clientpositive/decimal_1.q.out
          ql/src/test/results/clientpositive/decimal_2.q.out
          ql/src/test/results/clientpositive/decimal_3.q.out
          ql/src/test/results/clientpositive/udf7.q.out
          serde/if/serde.thrift
          serde/src/gen/thrift/gen-cpp/serde_constants.cpp
          serde/src/gen/thrift/gen-cpp/serde_constants.h
          serde/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/serde/serdeConstants.java
          serde/src/gen/thrift/gen-php/org/apache/hadoop/hive/serde/Types.php
          serde/src/gen/thrift/gen-py/org_apache_hadoop_hive_serde/constants.py
          serde/src/gen/thrift/gen-rb/serde_constants.rb
          serde/src/java/org/apache/hadoop/hive/serde2/SerDeUtils.java
          serde/src/java/org/apache/hadoop/hive/serde2/binarysortable/BinarySortableSerDe.java
          serde/src/java/org/apache/hadoop/hive/serde2/io/BigDecimalWritable.java
          serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyBigDecimal.java
          serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyFactory.java
          serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java
          serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/primitive/LazyBigDecimalObjectInspector.java
          serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/primitive/LazyPrimitiveObjectInspectorFactory.java
          serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryBigDecimal.java
          serde/src/test/org/apache/hadoop/hive/serde2/lazybinary/MyTestClassSmaller.java
          serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryFactory.java
          serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinarySerDe.java
          serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryUtils.java
          serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorConverters.java
          serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java
          serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/PrimitiveObjectInspector.java
          serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/BigDecimalObjectInspector.java
          serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/JavaBigDecimalObjectInspector.java
          serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorConverter.java
          serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorFactory.java
          serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils.java
          serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/SettableBigDecimalObjectInspector.java
          serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/WritableBigDecimalObjectInspector.java
          serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/WritableConstantBigDecimalObjectInspector.java
          serde/src/java/org/apache/hadoop/hive/serde2/typeinfo/TypeInfoFactory.java
          serde/src/test/org/apache/hadoop/hive/serde2/TestStatsSerde.java
          serde/src/test/org/apache/hadoop/hive/serde2/binarysortable/MyTestClass.java
          serde/src/test/org/apache/hadoop/hive/serde2/binarysortable/TestBinarySortableSerDe.java
          serde/src/test/org/apache/hadoop/hive/serde2/lazybinary/MyTestClassBigger.java
          serde/src/test/org/apache/hadoop/hive/serde2/lazybinary/TestLazyBinarySerDe.java

          MANAGE HERALD DIFFERENTIAL RULES
          https://reviews.facebook.net/herald/view/differential/

          WHY DID I GET THIS EMAIL?
          https://reviews.facebook.net/herald/transcript/18435/

          To: JIRA, hagleitn
          Cc: mgrover

          Show
          Phabricator added a comment - hagleitn requested code review of " HIVE-2693 [jira] Add DECIMAL data type". Reviewers: JIRA HIVE-2693 : Add DECIMAL data type Add support for the DECIMAL data type. HIVE-2272 (TIMESTAMP) provides a nice template for how to do this. TEST PLAN New tests decimal_1.q, decimal_2.q, decimal_3.q (Casts, udfs, joins, aggregates, order/sort, etc) Decimal has been added to SerDe test classes (TestBinarySortable, etc) Decimal has been added to UDF test files Decimal has been added to jdbc test file REVISION DETAIL https://reviews.facebook.net/D7683 AFFECTED FILES data/files/datatypes.txt data/files/kv7.txt jdbc/src/java/org/apache/hadoop/hive/jdbc/HiveBaseResultSet.java jdbc/src/java/org/apache/hadoop/hive/jdbc/HiveResultSetMetaData.java jdbc/src/java/org/apache/hadoop/hive/jdbc/JdbcColumn.java jdbc/src/java/org/apache/hadoop/hive/jdbc/Utils.java jdbc/src/test/org/apache/hadoop/hive/jdbc/TestJdbcDriver.java metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g ql/src/java/org/apache/hadoop/hive/ql/parse/TypeCheckProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFAbs.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFBaseNumericOp.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFBaseNumericUnaryOp.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFCeil.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFFloor.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPDivide.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPMinus.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPMod.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPMultiply.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPNegative.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPPlus.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFOPPositive.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFPosMod.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFPower.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFRound.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFToBoolean.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFToByte.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFToDouble.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFToFloat.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFToInteger.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFToLong.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFToShort.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFToString.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCorrelation.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCovariance.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCovarianceSample.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogramNumeric.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFStd.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFStdSample.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFSum.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFVariance.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFVarianceSample.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFToDecimal.java ql/src/test/queries/clientpositive/decimal_1.q ql/src/test/queries/clientpositive/decimal_2.q ql/src/test/queries/clientpositive/decimal_3.q ql/src/test/queries/clientpositive/udf7.q ql/src/test/results/clientnegative/invalid_cast_from_binary_1.q.out ql/src/test/results/clientnegative/invalid_cast_from_binary_2.q.out ql/src/test/results/clientnegative/invalid_cast_from_binary_3.q.out ql/src/test/results/clientnegative/invalid_cast_from_binary_4.q.out ql/src/test/results/clientnegative/invalid_cast_from_binary_5.q.out ql/src/test/results/clientnegative/invalid_cast_from_binary_6.q.out ql/src/test/results/clientnegative/wrong_column_type.q.out ql/src/test/results/clientpositive/decimal_1.q.out ql/src/test/results/clientpositive/decimal_2.q.out ql/src/test/results/clientpositive/decimal_3.q.out ql/src/test/results/clientpositive/udf7.q.out serde/if/serde.thrift serde/src/gen/thrift/gen-cpp/serde_constants.cpp serde/src/gen/thrift/gen-cpp/serde_constants.h serde/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/serde/serdeConstants.java serde/src/gen/thrift/gen-php/org/apache/hadoop/hive/serde/Types.php serde/src/gen/thrift/gen-py/org_apache_hadoop_hive_serde/constants.py serde/src/gen/thrift/gen-rb/serde_constants.rb serde/src/java/org/apache/hadoop/hive/serde2/SerDeUtils.java serde/src/java/org/apache/hadoop/hive/serde2/binarysortable/BinarySortableSerDe.java serde/src/java/org/apache/hadoop/hive/serde2/io/BigDecimalWritable.java serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyBigDecimal.java serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyFactory.java serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/primitive/LazyBigDecimalObjectInspector.java serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/primitive/LazyPrimitiveObjectInspectorFactory.java serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryBigDecimal.java serde/src/test/org/apache/hadoop/hive/serde2/lazybinary/MyTestClassSmaller.java serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryFactory.java serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinarySerDe.java serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryUtils.java serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorConverters.java serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/PrimitiveObjectInspector.java serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/BigDecimalObjectInspector.java serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/JavaBigDecimalObjectInspector.java serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorConverter.java serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorFactory.java serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils.java serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/SettableBigDecimalObjectInspector.java serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/WritableBigDecimalObjectInspector.java serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/WritableConstantBigDecimalObjectInspector.java serde/src/java/org/apache/hadoop/hive/serde2/typeinfo/TypeInfoFactory.java serde/src/test/org/apache/hadoop/hive/serde2/TestStatsSerde.java serde/src/test/org/apache/hadoop/hive/serde2/binarysortable/MyTestClass.java serde/src/test/org/apache/hadoop/hive/serde2/binarysortable/TestBinarySortableSerDe.java serde/src/test/org/apache/hadoop/hive/serde2/lazybinary/MyTestClassBigger.java serde/src/test/org/apache/hadoop/hive/serde2/lazybinary/TestLazyBinarySerDe.java MANAGE HERALD DIFFERENTIAL RULES https://reviews.facebook.net/herald/view/differential/ WHY DID I GET THIS EMAIL? https://reviews.facebook.net/herald/transcript/18435/ To: JIRA, hagleitn Cc: mgrover
          Hide
          Gunther Hagleitner added a comment -

          Sorry for the noise. Here's the right review: https://reviews.facebook.net/D7689

          Tried to fix the fact that some new files show up as copied files in the review, but the end result seems right.

          Show
          Gunther Hagleitner added a comment - Sorry for the noise. Here's the right review: https://reviews.facebook.net/D7689 Tried to fix the fact that some new files show up as copied files in the review, but the end result seems right.
          Hide
          Namit Jain added a comment -

          comments on phabricator

          Show
          Namit Jain added a comment - comments on phabricator
          Hide
          Namit Jain added a comment -

          Looks mostly good.

          Most of the comments are minor - the only major ones are around lack of testing.

          Show
          Namit Jain added a comment - Looks mostly good. Most of the comments are minor - the only major ones are around lack of testing.
          Hide
          Gunther Hagleitner added a comment -

          Thanks for the review! I am in the process of adding the requested tests and taking care of the other comments.

          Show
          Gunther Hagleitner added a comment - Thanks for the review! I am in the process of adding the requested tests and taking care of the other comments.
          Hide
          Gunther Hagleitner added a comment -

          This one should address the comments. Added the requested new tests: https://reviews.facebook.net/D7779

          Show
          Gunther Hagleitner added a comment - This one should address the comments. Added the requested new tests: https://reviews.facebook.net/D7779
          Hide
          Mark Grover added a comment -

          Some comments on the reviews.

          I agree with Namit that we should deal with the ambiguous UDF issue separately in a different JIRA. Created a subtask for that: HIVE-3859

          Show
          Mark Grover added a comment - Some comments on the reviews. I agree with Namit that we should deal with the ambiguous UDF issue separately in a different JIRA. Created a subtask for that: HIVE-3859
          Hide
          Gunther Hagleitner added a comment -

          I thought I had added a response on the ambiguous UDF issue on the review, but it seems I never submitted that. It's there now.

          I think it would be rather cumbersome to remove that piece from the patch. We would have to remove all the conversion code from the file (including any implicit conversion of numbers to decimal) and then rewrite all the tests to work around this. Just to re-introduce this in a new jira and change the tests back to what they were.

          It seems Mark you have an idea of re-writing this piece altogether using different cost for conversions. I think that's more intrusive (the current code just adds a case for multiple choices) and could be handled in another jira.

          Show
          Gunther Hagleitner added a comment - I thought I had added a response on the ambiguous UDF issue on the review, but it seems I never submitted that. It's there now. I think it would be rather cumbersome to remove that piece from the patch. We would have to remove all the conversion code from the file (including any implicit conversion of numbers to decimal) and then rewrite all the tests to work around this. Just to re-introduce this in a new jira and change the tests back to what they were. It seems Mark you have an idea of re-writing this piece altogether using different cost for conversions. I think that's more intrusive (the current code just adds a case for multiple choices) and could be handled in another jira.
          Hide
          Gunther Hagleitner added a comment -

          HIVE-2693-18.patch is the same as in the review + the couple line change to add the Text conversion for big decimal (as reported by Mark). (+ Unit test)

          Show
          Gunther Hagleitner added a comment - HIVE-2693 -18.patch is the same as in the review + the couple line change to add the Text conversion for big decimal (as reported by Mark). (+ Unit test)
          Hide
          Gunther Hagleitner added a comment -

          HIVE-2693-19.patch is same as patch 18, but rebased (didn't apply cleanly anymore because of CRLF conversion).

          Show
          Gunther Hagleitner added a comment - HIVE-2693 -19.patch is same as patch 18, but rebased (didn't apply cleanly anymore because of CRLF conversion).
          Hide
          Mark Grover added a comment -

          Non-committer +1

          Namit, any thoughts on the UDF method selection logic?

          Show
          Mark Grover added a comment - Non-committer +1 Namit, any thoughts on the UDF method selection logic?
          Hide
          Gunther Hagleitner added a comment -

          Rebased. HIVE-2693.20.patch applies cleanly again.

          Show
          Gunther Hagleitner added a comment - Rebased. HIVE-2693 .20.patch applies cleanly again.
          Hide
          Gunther Hagleitner added a comment -

          HIVE-2693-21.patch addresses Ashutosh' review.

          • Fixes/adds comments
          • Error message
          • Additional conversion tests
          • ASF headers
          Show
          Gunther Hagleitner added a comment - HIVE-2693 -21.patch addresses Ashutosh' review. Fixes/adds comments Error message Additional conversion tests ASF headers
          Hide
          Carl Steinbach added a comment -

          +1. Changes look good to me.

          Ashutosh and Namit: let me know if you want more time to look at this. Otherwise I'll plan to commit it in 24 hours. Thanks.

          Show
          Carl Steinbach added a comment - +1. Changes look good to me. Ashutosh and Namit: let me know if you want more time to look at this. Otherwise I'll plan to commit it in 24 hours. Thanks.
          Hide
          Ashutosh Chauhan added a comment -

          Latest patch looks good to me. +1

          Show
          Ashutosh Chauhan added a comment - Latest patch looks good to me. +1
          Hide
          Namit Jain added a comment -

          Carl, give me some time. I will try to take a look today/tomorrow.
          Can you give me ~2 days ? Go ahead, if you dont hear from me.

          Show
          Namit Jain added a comment - Carl, give me some time. I will try to take a look today/tomorrow. Can you give me ~2 days ? Go ahead, if you dont hear from me.
          Hide
          Carl Steinbach added a comment -

          @Namit: Sure thing. I'll wait until Sunday to commit. Is that cool?

          Show
          Carl Steinbach added a comment - @Namit: Sure thing. I'll wait until Sunday to commit. Is that cool?
          Hide
          Namit Jain added a comment -

          All, Does https://reviews.facebook.net/D7779 contain the latest patch ?

          Show
          Namit Jain added a comment - All, Does https://reviews.facebook.net/D7779 contain the latest patch ?
          Hide
          Namit Jain added a comment -

          I see that some comments from Ashutosh have not been addressed.

          Show
          Namit Jain added a comment - I see that some comments from Ashutosh have not been addressed.
          Hide
          Namit Jain added a comment -

          minor comments on phabricator

          Show
          Namit Jain added a comment - minor comments on phabricator
          Hide
          Ashutosh Chauhan added a comment -

          @Namit: I think you looked at phabricator patch, but if you look at the latest patch which is attached here on jira HIVE-2693-21.patch my comments has been addressed.

          Show
          Ashutosh Chauhan added a comment - @Namit: I think you looked at phabricator patch, but if you look at the latest patch which is attached here on jira HIVE-2693 -21.patch my comments has been addressed.
          Hide
          Mark Grover added a comment -

          Thanks for the review, Namit, Carl and Ashutosh!

          Ashutosh's comments were addressed in patch 21. I think we missed updating the review. Gunther, can you please update the review with patch 21.

          Namit Jain: Your review comments about ceil, round, floor, etc. are correct. MySQL doesn't return decimal on floor(decimal), etc. So, I will fix that.

          Show
          Mark Grover added a comment - Thanks for the review, Namit, Carl and Ashutosh! Ashutosh's comments were addressed in patch 21. I think we missed updating the review. Gunther, can you please update the review with patch 21. Namit Jain : Your review comments about ceil, round, floor, etc. are correct. MySQL doesn't return decimal on floor(decimal), etc. So, I will fix that.
          Hide
          Mark Grover added a comment -

          Updated patch with Namit's suggestions from code review:

          • Updated ceil and floor UDFs to return long instead of decimal type
          • Updated unit tests to reflect the above change
          Show
          Mark Grover added a comment - Updated patch with Namit's suggestions from code review: Updated ceil and floor UDFs to return long instead of decimal type Updated unit tests to reflect the above change
          Hide
          Gunther Hagleitner added a comment -

          I don't think ceil/floor should return long. If the decimal is larger than the max long value it will result in truncation and return an incorrect result, which isn't necessary if we stay in the same type. As far as compatibility goes: Judging from the docs MySQL just promises that exact-value goes to an exact-value type and floating point to floating point. Returning decimal would be in line with this. Oracle/SQL Server stay in the same datatype.

          Show
          Gunther Hagleitner added a comment - I don't think ceil/floor should return long. If the decimal is larger than the max long value it will result in truncation and return an incorrect result, which isn't necessary if we stay in the same type. As far as compatibility goes: Judging from the docs MySQL just promises that exact-value goes to an exact-value type and floating point to floating point. Returning decimal would be in line with this. Oracle/SQL Server stay in the same datatype.
          Hide
          Gunther Hagleitner added a comment -

          As requested: Created https://reviews.facebook.net/D7875

          This is the same as HIVE-2693-21.patch, which if we agree on decimal return type for floor/ceil is the patch to use. It has the fixes Ashutosh requested.

          Show
          Gunther Hagleitner added a comment - As requested: Created https://reviews.facebook.net/D7875 This is the same as HIVE-2693 -21.patch, which if we agree on decimal return type for floor/ceil is the patch to use. It has the fixes Ashutosh requested.
          Hide
          Mark Grover added a comment -

          Gunther Hagleitner, I see what you mean. The reason I had changed the type was to be consistent with other methods in the UDF in terms of "scale" of the return type, albeit at the cost of losing precision. In any case, given what you found in documentation, I agree that it makes sense to retain the type to decimal.

          However, the scale of the return value in floor/ceil/round UDFs for decimal type methods seems incorrect to me. We should be not be setting the return values back to the original scale.

          This seems to lead to inconsistencies in ceil/floor and a bug in round UDF.
          For example,

          mysql> select ceil(-0.33) from t;
          +-------------+
          | ceil(-0.33) |
          +-------------+
          |           0 |
          +-------------+
          1 row in set (0.00 sec)
          

          However, patch 21 gives ceil(-0.33) as 0.00. With the proposed change, the result in hive would be consistent in representation with MySQL.

          Here is what seems like a bug in round UDF:
          In MySQL:
          round(1E-99,2)=0.00
          But, hive gives
          round(1E-99,2)=0E-99
          which seems to be the incorrect representation.

          The proposed change will make hive give the correct result (consistent with MySQL): round(1E-99,2)=0.00

          What do you think? If you agree, I can upload the new patch. Please let me know. Thanks!

          Show
          Mark Grover added a comment - Gunther Hagleitner , I see what you mean. The reason I had changed the type was to be consistent with other methods in the UDF in terms of "scale" of the return type, albeit at the cost of losing precision. In any case, given what you found in documentation, I agree that it makes sense to retain the type to decimal. However, the scale of the return value in floor/ceil/round UDFs for decimal type methods seems incorrect to me. We should be not be setting the return values back to the original scale. This seems to lead to inconsistencies in ceil/floor and a bug in round UDF. For example, mysql> select ceil(-0.33) from t; +-------------+ | ceil(-0.33) | +-------------+ | 0 | +-------------+ 1 row in set (0.00 sec) However, patch 21 gives ceil(-0.33) as 0.00. With the proposed change, the result in hive would be consistent in representation with MySQL. Here is what seems like a bug in round UDF: In MySQL: round(1E-99,2)=0.00 But, hive gives round(1E-99,2)=0E-99 which seems to be the incorrect representation. The proposed change will make hive give the correct result (consistent with MySQL): round(1E-99,2)=0.00 What do you think? If you agree, I can upload the new patch. Please let me know. Thanks!
          Hide
          Namit Jain added a comment -

          Either approach is fine, as long as we document it clearly.

          Intuitively, returning a long makes sense, barring the corner cases.
          But, I am open.

          Show
          Namit Jain added a comment - Either approach is fine, as long as we document it clearly. Intuitively, returning a long makes sense, barring the corner cases. But, I am open.
          Hide
          Gunther Hagleitner added a comment -

          Mark: Scale is different from the return type. Seems both you and Namit are open to using decimal as the return type with documentation. Given that decimal avoids overflow issues, I think we can settle on the current (patch .21) implementation for this.

          As far as the scale/representation of the returned decimal goes: I don't see a bug. 0.00, 0, 0E-99 are all the same number it's just the representation that's different. And since all these numbers are considered the same they will be handled correctly by the system (we don't consider identical numbers with different scale different). Keeping representation in line with MySQL has other implications. For one, MySQL doesn't use scientific notation, also MySQL decimals are defined with additional parameters prec and scale, which control among other things the representation.

          My proposal: Keep patch HIVE-2693-21.patch as is and check it in. It passed muster with Ashutosh, Carl and Namit and gives correct results in Mark's cases as well. Open a new jira to extend the feature and give users control over representation with optional prec/scale (i.e.: decimal(5,2)).

          Show
          Gunther Hagleitner added a comment - Mark: Scale is different from the return type. Seems both you and Namit are open to using decimal as the return type with documentation. Given that decimal avoids overflow issues, I think we can settle on the current (patch .21) implementation for this. As far as the scale/representation of the returned decimal goes: I don't see a bug. 0.00, 0, 0E-99 are all the same number it's just the representation that's different. And since all these numbers are considered the same they will be handled correctly by the system (we don't consider identical numbers with different scale different). Keeping representation in line with MySQL has other implications. For one, MySQL doesn't use scientific notation, also MySQL decimals are defined with additional parameters prec and scale, which control among other things the representation. My proposal: Keep patch HIVE-2693 -21.patch as is and check it in. It passed muster with Ashutosh, Carl and Namit and gives correct results in Mark's cases as well. Open a new jira to extend the feature and give users control over representation with optional prec/scale (i.e.: decimal(5,2)).
          Hide
          Mark Grover added a comment -

          Gunther: thanks for the feedback! I understand what you are saying about the different representation for the same number. That's why I referred to the scale change as inconsistencies for floor/ceil but a bug for round. I called them inconsistencies for floor/ceil because they are the exact same numbers and like we have discussed before the decimal patch considers same numbers with different representation equal (by using compareTo() instead of equals()). However, here is why I considered it a bug for round because by definition round(x,d) rounds number x to d decimal places. While 0.00, 0 and 0E-99 are all the same number (with different representation), only one of them has 2 decimal places as expected by the result of round(1E-99, 2). Here is an example where it could cause problems: if someone was using the thrift client to issue Hive queries from C++, and issued a query like this:

          select round(mycol,2) from mytable;
          

          and split the output based on the decimal point to obtain the fractional part; they would expect the fractional part to fit in a string of length of 2. However, given the present implementation, that's not the case.

          Consistency with MySQL aside, I don't think we should be setting the value back to the original scale, especially in round.

          Having said the above, I agree that they are just different representations of the same number so if you feel strongly about not changing this, I happily +1 patch 21.

          Show
          Mark Grover added a comment - Gunther: thanks for the feedback! I understand what you are saying about the different representation for the same number. That's why I referred to the scale change as inconsistencies for floor/ceil but a bug for round. I called them inconsistencies for floor/ceil because they are the exact same numbers and like we have discussed before the decimal patch considers same numbers with different representation equal (by using compareTo() instead of equals()). However, here is why I considered it a bug for round because by definition round(x,d) rounds number x to d decimal places. While 0.00, 0 and 0E-99 are all the same number (with different representation), only one of them has 2 decimal places as expected by the result of round(1E-99, 2). Here is an example where it could cause problems: if someone was using the thrift client to issue Hive queries from C++, and issued a query like this: select round(mycol,2) from mytable; and split the output based on the decimal point to obtain the fractional part; they would expect the fractional part to fit in a string of length of 2. However, given the present implementation, that's not the case. Consistency with MySQL aside, I don't think we should be setting the value back to the original scale, especially in round. Having said the above, I agree that they are just different representations of the same number so if you feel strongly about not changing this, I happily +1 patch 21.
          Hide
          Gunther Hagleitner added a comment -

          Mark: Thanks for the clarification. I had misunderstood you earlier. I think you do bring up an important point about controlling representation, however, I think my point about best fixing this with an optional scoped decimal is still valid.

          The example of the poor C++ thrift developer eventually got me thinking: Why would setting the scale in round even have any effect? We strip trailing zeros internally from all decimals, so 0.00, 0E-99 should always come out as simply 0.

          (I think we've talked about this before, we have to strip zeros, because otherwise sorting, joins, group-bys etc would all internally distinguish by value and scale instead of just value, which would be bad because we allow mixed-scale input and any operation performed on decimals can change the scale.)

          So given that, the reason you are seeing the sticky scale on "round()" is this: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6480539

          A bug in the JVM: stripTrailingZeros doesn't work for the number "0".

          I've put a workaround for that bug in the code (simply storing BigDecimal.ZERO when the value is "0") and that gives at least consistent behavior between floor, ceiling and round, however that still does not allow you to control the representation. For that (given that we have to strip internally), I still suggest another jira that let's you define types that do specifically that.

          It does also help the C++ coder: With this round(mycol, 2) would give you at most 2 decimal digits (instead of potentially unlimited many). So if they init that buffer right, they should be fine.

          So can you take a look at patch 23? It only special-cases the ZERO value (and removes the meaningless setting to original scale in round). Does that seem reasonable?

          Show
          Gunther Hagleitner added a comment - Mark: Thanks for the clarification. I had misunderstood you earlier. I think you do bring up an important point about controlling representation, however, I think my point about best fixing this with an optional scoped decimal is still valid. The example of the poor C++ thrift developer eventually got me thinking: Why would setting the scale in round even have any effect? We strip trailing zeros internally from all decimals, so 0.00, 0E-99 should always come out as simply 0. (I think we've talked about this before, we have to strip zeros, because otherwise sorting, joins, group-bys etc would all internally distinguish by value and scale instead of just value, which would be bad because we allow mixed-scale input and any operation performed on decimals can change the scale.) So given that, the reason you are seeing the sticky scale on "round()" is this: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6480539 A bug in the JVM: stripTrailingZeros doesn't work for the number "0". I've put a workaround for that bug in the code (simply storing BigDecimal.ZERO when the value is "0") and that gives at least consistent behavior between floor, ceiling and round, however that still does not allow you to control the representation. For that (given that we have to strip internally), I still suggest another jira that let's you define types that do specifically that. It does also help the C++ coder: With this round(mycol, 2) would give you at most 2 decimal digits (instead of potentially unlimited many). So if they init that buffer right, they should be fine. So can you take a look at patch 23? It only special-cases the ZERO value (and removes the meaningless setting to original scale in round). Does that seem reasonable?
          Hide
          Mark Grover added a comment -

          Gunther: I took a look at patch 23 and it looks good to me! Good catch on stripTrailingZeros(), BTW!

          Namit Jain, Carl Steinbach, Ashutosh Chauhan Thanks for reviewing the change. It seems like you all were pretty happy with patch21 but it seemed to have a bug in round UDF that we noticed while reviewing Namit's review comments. That bug has now been fixed in patch 23. Would one of you please take a final look at the patch and commit it if there are no concerns?

          FWIW, here is the diff between patch21 and patch23:
          It basically doesn't set the original scale on the return value of round UDF and has a workaround for the stripTrailingZeros() bug in BigDecimal. Apart from that it's just updates to the test results.

          diff --git ql/src/java/org/apache/hadoop/hive/ql/udf/UDFRound.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFRound.java
          index af8186a..1c807ef 100644
          --- ql/src/java/org/apache/hadoop/hive/ql/udf/UDFRound.java
          +++ ql/src/java/org/apache/hadoop/hive/ql/udf/UDFRound.java
          @@ -72,8 +72,7 @@ private BigDecimalWritable evaluate(BigDecimalWritable n, int i) {
                 return null;
               }
               BigDecimal bd = n.getBigDecimal();
          -    int origScale = bd.scale();
          -    bd = n.getBigDecimal().setScale(i, RoundingMode.HALF_UP).setScale(origScale);
          +    bd = n.getBigDecimal().setScale(i, RoundingMode.HALF_UP);
               bigDecimalWritable.set(bd);
               return bigDecimalWritable;
             }
          diff --git ql/src/test/results/clientpositive/decimal_3.q.out ql/src/test/results/clientpositive/decimal_3.q.out
          index ad6190a..30e6d92 100644
          --- ql/src/test/results/clientpositive/decimal_3.q.out
          +++ ql/src/test/results/clientpositive/decimal_3.q.out
          @@ -219,7 +219,7 @@ POSTHOOK: Input: default@decimal_3
           -0.333
           -0.33
           -0.3
          -0.00
          +0
           1E-99
           0.01
           0.02
          @@ -258,7 +258,7 @@ POSTHOOK: Input: default@decimal_3
           -0.333	0
           -0.33	0
           -0.3	0
          -0.00	0
          +0	0
           1E-99	0
           0.01	0
           0.02	0
          diff --git ql/src/test/results/clientpositive/decimal_udf.q.out ql/src/test/results/clientpositive/decimal_udf.q.out
          index be05146..e55ad9d 100644
          --- ql/src/test/results/clientpositive/decimal_udf.q.out
          +++ ql/src/test/results/clientpositive/decimal_udf.q.out
          @@ -145,7 +145,7 @@ POSTHOOK: query: SELECT key + value FROM DECIMAL_UDF
           POSTHOOK: type: QUERY
           POSTHOOK: Input: default@decimal_udf
           #### A masked pattern was here ####
          -0E+2
          +0
           1E+99
           1E-99
           0
          @@ -229,7 +229,7 @@ POSTHOOK: Input: default@decimal_udf
           -2.2E+3
           1E+99
           1E-99
          -0.0
          +0
           1.5E+2
           15
           1.5
          @@ -238,7 +238,7 @@ POSTHOOK: Input: default@decimal_udf
           3E+2
           3E+1
           3
          -0.0
          +0
           0.2
           0.02
           0.3
          @@ -390,44 +390,44 @@ POSTHOOK: query: SELECT key - key FROM DECIMAL_UDF
           POSTHOOK: type: QUERY
           POSTHOOK: Input: default@decimal_udf
           #### A masked pattern was here ####
          -0E+2
          -0E+99
          -0E-99
          -0
          -0E+2
          -0E+1
          -0
          -0.0
          -0.00
          -0E+2
          -0E+1
          -0
          -0
          -0.0
          -0.00
          -0.0
          -0.00
          -0.000
          -0.0
          -0.00
          -0.000
          -0
          -0
          -0.00
          -0.00
          -0.00
          -0.000
          -0.00
          -0.000
          -0
          -0.0
          -0.00
          -0.00
          -0.00
          -0.00
          -0E-25
          -0E-9
          -0E-8
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
           PREHOOK: query: EXPLAIN SELECT key - value FROM DECIMAL_UDF
           PREHOOK: type: QUERY
           POSTHOOK: query: EXPLAIN SELECT key - value FROM DECIMAL_UDF
          @@ -475,13 +475,13 @@ POSTHOOK: Input: default@decimal_udf
           1E+99
           1E-99
           0
          -0E+2
          -0E+1
          +0
          +0
           0
           0.1
           0.01
          -0E+2
          -0E+1
          +0
          +0
           0
           0
           0.2
          @@ -555,7 +555,7 @@ POSTHOOK: Input: default@decimal_udf
           -6.6E+3
           1E+99
           1E-99
          -0.0
          +0
           5E+1
           5
           0.5
          @@ -564,7 +564,7 @@ POSTHOOK: Input: default@decimal_udf
           1E+2
           1E+1
           1
          -0.0
          +0
           0.2
           0.02
           0.3
          @@ -798,26 +798,26 @@ POSTHOOK: type: QUERY
           POSTHOOK: Input: default@decimal_udf
           #### A masked pattern was here ####
           -1.936E+7
          -0E+99
          -0E-99
          +0
          +0
           0
           1E+4
           1E+2
           1
          -0.0
          -0.00
          +0
          +0
           4E+4
           4E+2
           4
           0
          -0.0
          -0.00
          -0.0
          -0.00
          -0.000
          -0.0
          -0.00
          -0.000
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
           1
           4
           9.42
          @@ -879,26 +879,26 @@ POSTHOOK: type: QUERY
           POSTHOOK: Input: default@decimal_udf
           #### A masked pattern was here ####
           -9.68E+6
          -0E+98
          -0E-100
          -0.0
          +0
          +0
          +0
           5E+3
           5E+1
           0.5
          -0.00
          -0.000
          +0
          +0
           2E+4
           2E+2
           2
          -0.0
          -0.00
          -0.000
          -0.00
          -0.000
          -0.0000
          -0.00
          -0.000
          -0.0000
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
          +0
           0.5
           2
           4.71
          @@ -1359,8 +1359,8 @@ POSTHOOK: Input: default@decimal_udf
           #### A masked pattern was here ####
           -2.2E+3
           5E+98
          -0E-65
          -0E-65
          +0
          +0
           5E+1
           5
           0.5
          @@ -1369,7 +1369,7 @@ POSTHOOK: Input: default@decimal_udf
           1E+2
           1E+1
           1
          -0E-65
          +0
           0.1
           0.01
           0.15
          @@ -1707,9 +1707,9 @@ POSTHOOK: Input: default@decimal_udf
           1
           1
           1
          -0.0
          -0.00
          -0.000
          +0
          +0
          +0
           1
           2
           4
          @@ -1774,22 +1774,22 @@ POSTHOOK: Input: default@decimal_udf
           #### A masked pattern was here ####
           -4.4E+3
           1E+99
          -0E-99
          +0
           0
           1E+2
           1E+1
           1
          -0.0
          -0.00
          +0
          +0
           2E+2
           2E+1
           2
           0
          -0.0
          -0.00
          -0.0
          -0.00
          -0.000
          +0
          +0
          +0
          +0
          +0
           -1
           -1
           -1
          @@ -1807,7 +1807,7 @@ POSTHOOK: Input: default@decimal_udf
           3
           3
           3
          -0E-25
          +0
           -1234567891
           1.23456789E+9
           PREHOOK: query: -- round
          @@ -1857,7 +1857,7 @@ POSTHOOK: Input: default@decimal_udf
           #### A masked pattern was here ####
           -4.4E+3
           1E+99
          -0E-99
          +0
           0
           1E+2
           1E+1
          @@ -2027,22 +2027,22 @@ NULL
           NULL
           1
           1
          -0.0
          -0.00
          -0.000
          +0
          +0
          +0
           1
           1
           0
           NULL
          -0.0
          -0.00
          +0
          +0
           0.1
           0.01
           0.001
           0.1
           0.01
           0.001
          -0.0
          +0
           0
           1
           -0.12
          diff --git serde/src/java/org/apache/hadoop/hive/serde2/io/BigDecimalWritable.java serde/src/java/org/apache/hadoop/hive/serde2/io/BigDecimalWritable.java
          index 1d0a31b..db009c4 100644
          --- serde/src/java/org/apache/hadoop/hive/serde2/io/BigDecimalWritable.java
          +++ serde/src/java/org/apache/hadoop/hive/serde2/io/BigDecimalWritable.java
          @@ -57,6 +57,10 @@ public BigDecimalWritable(BigDecimal value) {
           
             public void set(BigDecimal value) {
               value = value.stripTrailingZeros();
          +    if (value.compareTo(BigDecimal.ZERO) == 0) {
          +      // Special case for 0, because java doesn't strip zeros correctly on that number.
          +      value = BigDecimal.ZERO;
          +    }
               set(value.unscaledValue().toByteArray(), value.scale());
             }
          
          Show
          Mark Grover added a comment - Gunther: I took a look at patch 23 and it looks good to me! Good catch on stripTrailingZeros(), BTW! Namit Jain , Carl Steinbach , Ashutosh Chauhan Thanks for reviewing the change. It seems like you all were pretty happy with patch21 but it seemed to have a bug in round UDF that we noticed while reviewing Namit's review comments. That bug has now been fixed in patch 23. Would one of you please take a final look at the patch and commit it if there are no concerns? FWIW, here is the diff between patch21 and patch23: It basically doesn't set the original scale on the return value of round UDF and has a workaround for the stripTrailingZeros() bug in BigDecimal. Apart from that it's just updates to the test results. diff --git ql/src/java/org/apache/hadoop/hive/ql/udf/UDFRound.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFRound.java index af8186a..1c807ef 100644 --- ql/src/java/org/apache/hadoop/hive/ql/udf/UDFRound.java +++ ql/src/java/org/apache/hadoop/hive/ql/udf/UDFRound.java @@ -72,8 +72,7 @@ private BigDecimalWritable evaluate(BigDecimalWritable n, int i) { return null ; } BigDecimal bd = n.getBigDecimal(); - int origScale = bd.scale(); - bd = n.getBigDecimal().setScale(i, RoundingMode.HALF_UP).setScale(origScale); + bd = n.getBigDecimal().setScale(i, RoundingMode.HALF_UP); bigDecimalWritable.set(bd); return bigDecimalWritable; } diff --git ql/src/test/results/clientpositive/decimal_3.q.out ql/src/test/results/clientpositive/decimal_3.q.out index ad6190a..30e6d92 100644 --- ql/src/test/results/clientpositive/decimal_3.q.out +++ ql/src/test/results/clientpositive/decimal_3.q.out @@ -219,7 +219,7 @@ POSTHOOK: Input: default @decimal_3 -0.333 -0.33 -0.3 -0.00 +0 1E-99 0.01 0.02 @@ -258,7 +258,7 @@ POSTHOOK: Input: default @decimal_3 -0.333 0 -0.33 0 -0.3 0 -0.00 0 +0 0 1E-99 0 0.01 0 0.02 0 diff --git ql/src/test/results/clientpositive/decimal_udf.q.out ql/src/test/results/clientpositive/decimal_udf.q.out index be05146..e55ad9d 100644 --- ql/src/test/results/clientpositive/decimal_udf.q.out +++ ql/src/test/results/clientpositive/decimal_udf.q.out @@ -145,7 +145,7 @@ POSTHOOK: query: SELECT key + value FROM DECIMAL_UDF POSTHOOK: type: QUERY POSTHOOK: Input: default @decimal_udf #### A masked pattern was here #### -0E+2 +0 1E+99 1E-99 0 @@ -229,7 +229,7 @@ POSTHOOK: Input: default @decimal_udf -2.2E+3 1E+99 1E-99 -0.0 +0 1.5E+2 15 1.5 @@ -238,7 +238,7 @@ POSTHOOK: Input: default @decimal_udf 3E+2 3E+1 3 -0.0 +0 0.2 0.02 0.3 @@ -390,44 +390,44 @@ POSTHOOK: query: SELECT key - key FROM DECIMAL_UDF POSTHOOK: type: QUERY POSTHOOK: Input: default @decimal_udf #### A masked pattern was here #### -0E+2 -0E+99 -0E-99 -0 -0E+2 -0E+1 -0 -0.0 -0.00 -0E+2 -0E+1 -0 -0 -0.0 -0.00 -0.0 -0.00 -0.000 -0.0 -0.00 -0.000 -0 -0 -0.00 -0.00 -0.00 -0.000 -0.00 -0.000 -0 -0.0 -0.00 -0.00 -0.00 -0.00 -0E-25 -0E-9 -0E-8 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 +0 PREHOOK: query: EXPLAIN SELECT key - value FROM DECIMAL_UDF PREHOOK: type: QUERY POSTHOOK: query: EXPLAIN SELECT key - value FROM DECIMAL_UDF @@ -475,13 +475,13 @@ POSTHOOK: Input: default @decimal_udf 1E+99 1E-99 0 -0E+2 -0E+1 +0 +0 0 0.1 0.01 -0E+2 -0E+1 +0 +0 0 0 0.2 @@ -555,7 +555,7 @@ POSTHOOK: Input: default @decimal_udf -6.6E+3 1E+99 1E-99 -0.0 +0 5E+1 5 0.5 @@ -564,7 +564,7 @@ POSTHOOK: Input: default @decimal_udf 1E+2 1E+1 1 -0.0 +0 0.2 0.02 0.3 @@ -798,26 +798,26 @@ POSTHOOK: type: QUERY POSTHOOK: Input: default @decimal_udf #### A masked pattern was here #### -1.936E+7 -0E+99 -0E-99 +0 +0 0 1E+4 1E+2 1 -0.0 -0.00 +0 +0 4E+4 4E+2 4 0 -0.0 -0.00 -0.0 -0.00 -0.000 -0.0 -0.00 -0.000 +0 +0 +0 +0 +0 +0 +0 +0 1 4 9.42 @@ -879,26 +879,26 @@ POSTHOOK: type: QUERY POSTHOOK: Input: default @decimal_udf #### A masked pattern was here #### -9.68E+6 -0E+98 -0E-100 -0.0 +0 +0 +0 5E+3 5E+1 0.5 -0.00 -0.000 +0 +0 2E+4 2E+2 2 -0.0 -0.00 -0.000 -0.00 -0.000 -0.0000 -0.00 -0.000 -0.0000 +0 +0 +0 +0 +0 +0 +0 +0 +0 0.5 2 4.71 @@ -1359,8 +1359,8 @@ POSTHOOK: Input: default @decimal_udf #### A masked pattern was here #### -2.2E+3 5E+98 -0E-65 -0E-65 +0 +0 5E+1 5 0.5 @@ -1369,7 +1369,7 @@ POSTHOOK: Input: default @decimal_udf 1E+2 1E+1 1 -0E-65 +0 0.1 0.01 0.15 @@ -1707,9 +1707,9 @@ POSTHOOK: Input: default @decimal_udf 1 1 1 -0.0 -0.00 -0.000 +0 +0 +0 1 2 4 @@ -1774,22 +1774,22 @@ POSTHOOK: Input: default @decimal_udf #### A masked pattern was here #### -4.4E+3 1E+99 -0E-99 +0 0 1E+2 1E+1 1 -0.0 -0.00 +0 +0 2E+2 2E+1 2 0 -0.0 -0.00 -0.0 -0.00 -0.000 +0 +0 +0 +0 +0 -1 -1 -1 @@ -1807,7 +1807,7 @@ POSTHOOK: Input: default @decimal_udf 3 3 3 -0E-25 +0 -1234567891 1.23456789E+9 PREHOOK: query: -- round @@ -1857,7 +1857,7 @@ POSTHOOK: Input: default @decimal_udf #### A masked pattern was here #### -4.4E+3 1E+99 -0E-99 +0 0 1E+2 1E+1 @@ -2027,22 +2027,22 @@ NULL NULL 1 1 -0.0 -0.00 -0.000 +0 +0 +0 1 1 0 NULL -0.0 -0.00 +0 +0 0.1 0.01 0.001 0.1 0.01 0.001 -0.0 +0 0 1 -0.12 diff --git serde/src/java/org/apache/hadoop/hive/serde2/io/BigDecimalWritable.java serde/src/java/org/apache/hadoop/hive/serde2/io/BigDecimalWritable.java index 1d0a31b..db009c4 100644 --- serde/src/java/org/apache/hadoop/hive/serde2/io/BigDecimalWritable.java +++ serde/src/java/org/apache/hadoop/hive/serde2/io/BigDecimalWritable.java @@ -57,6 +57,10 @@ public BigDecimalWritable(BigDecimal value) { public void set(BigDecimal value) { value = value.stripTrailingZeros(); + if (value.compareTo(BigDecimal.ZERO) == 0) { + // Special case for 0, because java doesn't strip zeros correctly on that number. + value = BigDecimal.ZERO; + } set(value.unscaledValue().toByteArray(), value.scale()); }
          Hide
          Carl Steinbach added a comment -

          +1 on patch 23. Running tests. Will commit if everything passes.

          Show
          Carl Steinbach added a comment - +1 on patch 23. Running tests. Will commit if everything passes.
          Hide
          Carl Steinbach added a comment -

          Committed to trunk. Thanks Josh, Vikram, Prasad, Mark, and Gunther!

          Show
          Carl Steinbach added a comment - Committed to trunk. Thanks Josh, Vikram, Prasad, Mark, and Gunther!
          Hide
          Mark Grover added a comment -

          Thanks for committing this, Carl!

          FYI, I have updated the wiki pages with info about Decimal type.
          https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
          https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

          Show
          Mark Grover added a comment - Thanks for committing this, Carl! FYI, I have updated the wiki pages with info about Decimal type. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

            People

            • Assignee:
              Prasad Mujumdar
              Reporter:
              Carl Steinbach
            • Votes:
              4 Vote for this issue
              Watchers:
              24 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development