Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-1661

Support aggregation functions on DECIMAL in DruidAdapter

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.12.0
    • Fix Version/s: 1.12.0
    • Component/s: druid
    • Labels:
      None

      Description

      Currently, whether to use fractional or integer aggregations is based on following code (L699 in DruidQuery.java).

      final boolean b = aggCall.getType().getSqlTypeName() == SqlTypeName.DOUBLE;
      

      Since Hive might use other fractional types for the aggregation, we might end up using the wrong type of aggregation in Druid. We could extend the check as follows:

      final boolean b = SqlTypeName.FRACTIONAL_TYPES.contains(aggCall.getType().getSqlTypeName());
      

        Issue Links

          Activity

          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Julian Hyde, could you take a look at the PR in https://github.com/apache/calcite/pull/385 ? This problem is not reproducible in Calcite, since all columns coming from Druid metrics are considered either DOUBLE or LONG, and if a CAST is on the way, we will not push it to Druid. However, when we create our own tables in Druid from Hive, we might have other data types (FLOAT, DECIMAL) coming from the Druid table. Thanks

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Julian Hyde , could you take a look at the PR in https://github.com/apache/calcite/pull/385 ? This problem is not reproducible in Calcite, since all columns coming from Druid metrics are considered either DOUBLE or LONG, and if a CAST is on the way, we will not push it to Druid. However, when we create our own tables in Druid from Hive, we might have other data types (FLOAT, DECIMAL) coming from the Druid table. Thanks
          Hide
          julianhyde Julian Hyde added a comment -

          I see that FRACTIONAL_TYPES = approximate types (DOUBLE + FLOAT) plus DECIMAL. Should we include DECIMAL? I find that it's a mistake to mix exact types with inexact types.

          To be clear, I'm not proposing to redefine FRACTIONAL_TYPES. I'm proposing that we use APPROX_TYPES for recognizing aggregation function types.

          Show
          julianhyde Julian Hyde added a comment - I see that FRACTIONAL_TYPES = approximate types (DOUBLE + FLOAT) plus DECIMAL. Should we include DECIMAL? I find that it's a mistake to mix exact types with inexact types. To be clear, I'm not proposing to redefine FRACTIONAL_TYPES. I'm proposing that we use APPROX_TYPES for recognizing aggregation function types.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Julian Hyde, I had to choose FRACTIONAL types because the problem in Hive was with Decimal, so it was intentional. If we use only approximate types, we will still face this issue for decimal type. Maybe another option is to check the scale of the decimal, and generate one type or another depending on that?

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Julian Hyde , I had to choose FRACTIONAL types because the problem in Hive was with Decimal, so it was intentional. If we use only approximate types, we will still face this issue for decimal type. Maybe another option is to check the scale of the decimal, and generate one type or another depending on that?
          Hide
          julianhyde Julian Hyde added a comment -

          So, if Hive has a column c of type DECIMAL(6, 2) stored inside Druid, and a user executes SUM(c), they are going to get a DOUBLE (i.e. 64 bit floating point) result. Rounding may occur. Is the user OK with that?

          Show
          julianhyde Julian Hyde added a comment - So, if Hive has a column c of type DECIMAL(6, 2) stored inside Druid, and a user executes SUM(c) , they are going to get a DOUBLE (i.e. 64 bit floating point) result. Rounding may occur. Is the user OK with that?
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - - edited

          You are right, that is fair. Rounding might happen. Overflowing might happen too, though I do not know if there is a deterministic behavior to handle it, e.g., defined by the SQL standard.

          When I proposed this fix, I was assuming that this is something that we will document properly on the Hive side, so we can support DECIMAL type (similar to what we do with the handling of NULL semantics by Druid). Are you suggesting that we should not support DECIMAL in Druid?

          We could check the precision&scale to control overflow, though user needs to be aware of what is going on in Druid and that rounding might happen.

          Ashutosh Chauhan, what do you think? Julian Hyde, could you give your opinion about this too?

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - - edited You are right, that is fair. Rounding might happen. Overflowing might happen too, though I do not know if there is a deterministic behavior to handle it, e.g., defined by the SQL standard. When I proposed this fix, I was assuming that this is something that we will document properly on the Hive side, so we can support DECIMAL type (similar to what we do with the handling of NULL semantics by Druid). Are you suggesting that we should not support DECIMAL in Druid? We could check the precision&scale to control overflow, though user needs to be aware of what is going on in Druid and that rounding might happen. Ashutosh Chauhan , what do you think? Julian Hyde , could you give your opinion about this too?
          Hide
          julianhyde Julian Hyde added a comment -

          I don't think we should support DECIMAL in Druid. Druid doesn't support DECIMAL any more than it supports, say, a geospatial data type.

          Show
          julianhyde Julian Hyde added a comment - I don't think we should support DECIMAL in Druid. Druid doesn't support DECIMAL any more than it supports, say, a geospatial data type.
          Hide
          ashutoshc Ashutosh Chauhan added a comment -

          I agree if column is declared as Decimal in hive, user would expect calculations to be precise and doing it in double would surprise users.
          However, I think it is worth considering to add a mode accept.approx.resulset (false by default) in Hive/Calcite to allow this. We can use this mode elsewhere also e.g., to use topN queries which also returns approx resultset.
          In my experience world of big data, atleast some users are willing to trade speed for accuracy.

          Show
          ashutoshc Ashutosh Chauhan added a comment - I agree if column is declared as Decimal in hive, user would expect calculations to be precise and doing it in double would surprise users. However, I think it is worth considering to add a mode accept.approx.resulset (false by default) in Hive/Calcite to allow this. We can use this mode elsewhere also e.g., to use topN queries which also returns approx resultset. In my experience world of big data, atleast some users are willing to trade speed for accuracy.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Julian Hyde, Ashutosh Chauhan, I have updated the PR in https://github.com/apache/calcite/pull/385 . Could you take a look? Thanks

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Julian Hyde , Ashutosh Chauhan , I have updated the PR in https://github.com/apache/calcite/pull/385 . Could you take a look? Thanks
          Hide
          julianhyde Julian Hyde added a comment -

          Jesus Camacho Rodriguez, Looks good, but please update site/_docs/adapter.md for the new connect string property.

          Show
          julianhyde Julian Hyde added a comment - Jesus Camacho Rodriguez , Looks good, but please update site/_docs/adapter.md for the new connect string property.
          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/0372d23 Thanks for the feedback Julian Hyde
          Hide
          julianhyde Julian Hyde added a comment - - edited

          Re-opening. This change breaks DruidAdapterIT:

          • All tests in DruidAdapterIT pass against a269a2b, CALCITE-1623.
          • As of 49888a6 there is one failure, DruidAdapterIT.testFilterTimestamp. Nishant Bangarwa mentioned this already. I'll look into it. I suspect that it is a plan improvement due to CALCITE-1601.
          • The next change, 0372d23, there are 25 failures. The test takes much longer than usual (13 minutes), and it seems that plans have changed, to do much of the aggregation in Calcite rather than in Druid.
          Show
          julianhyde Julian Hyde added a comment - - edited Re-opening. This change breaks DruidAdapterIT: All tests in DruidAdapterIT pass against a269a2b , CALCITE-1623 . As of 49888a6 there is one failure, DruidAdapterIT.testFilterTimestamp. Nishant Bangarwa mentioned this already. I'll look into it. I suspect that it is a plan improvement due to CALCITE-1601 . The next change, 0372d23 , there are 25 failures. The test takes much longer than usual (13 minutes), and it seems that plans have changed, to do much of the aggregation in Calcite rather than in Druid.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Julian Hyde, I am working on it.

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Julian Hyde , I am working on it.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - - edited

          Fixed up in http://git-wip-us.apache.org/repos/asf/calcite/commit/952214a .

          Checked that all unit tests and Druid tests pass, except DruidAdapterIT.testFilterTimestamp.

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - - edited Fixed up in http://git-wip-us.apache.org/repos/asf/calcite/commit/952214a . Checked that all unit tests and Druid tests pass, except DruidAdapterIT.testFilterTimestamp .
          Hide
          julianhyde Julian Hyde added a comment -

          Thanks, that's perfect. I have a subsequent change that will fix DruidAdapterIT.testFilterTimestamp.

          Show
          julianhyde Julian Hyde added a comment - Thanks, that's perfect. I have a subsequent change that will fix DruidAdapterIT.testFilterTimestamp.
          Hide
          julianhyde Julian Hyde added a comment -

          Resolved in release 1.12.0 (2017-03-24).

          Show
          julianhyde Julian Hyde added a comment - Resolved in release 1.12.0 (2017-03-24).

            People

            • Assignee:
              jcamachorodriguez Jesus Camacho Rodriguez
              Reporter:
              jcamachorodriguez Jesus Camacho Rodriguez
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development