Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-1358

Push filters on time dimension to Druid

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.9.0
    • Fix Version/s: 1.9.0
    • Component/s: druid
    • Labels:
      None

      Description

      Porting work done in HIVE-14217.

      Logic should split the filter conditions into two parts: those predicates referred to the time dimension, and predicates referred to other columns.

      Then, the predicates on the time dimension should be translated into Druid intervals, possibly consolidating those ranges e.g. to detect overlapping. The other predicates will go into the Druid filter field.

        Issue Links

          Activity

          Hide
          julianhyde Julian Hyde added a comment -

          How does this relate to CALCITE-1334? See the work that I have already done in https://github.com/julianhyde/calcite/tree/1334-extract-to-range. In particular, see the test cases that work on the EXTRACT function.

          Show
          julianhyde Julian Hyde added a comment - How does this relate to CALCITE-1334 ? See the work that I have already done in https://github.com/julianhyde/calcite/tree/1334-extract-to-range . In particular, see the test cases that work on the EXTRACT function.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Julian Hyde, I checked the initial work in CALCITE-1334, and both issues complement each other well. In particular:

          • CALCITE-1334 helps identifying ranges out of filter expressions expressed in terms of standard SQL functions.
          • CALCITE-1358 consolidates those filters e.g. checking for overlaps, and translate them into Druid valid intervals. In fact, the consolidation logic is generic enough and could probably factorized in a follow-up.

          In addition, HIVE-14217 adds support for the recognition of Druid default granularity functions in Group By clause. While this is not standard SQL, it should be quite useful, as e.g. Druid minute granularity would mean to group by:

          extract(year from __time), extract(month from __time),..., extract(minute from __time)

          (let alone the fact that the actual value of __time column would change)

          ==

          I have pulled all the Calcite work out of HIVE-14217 and pushed it to a new branch:
          https://github.com/jcamachor/calcite/tree/calcite-druid

          There are a couple of items remaining to make it ready to go into Calcite, specifically 1) adding a results reader for new supported Druid query types, and 2) adding new tests for those query types as well as for more complex Filter predicates. I am going to pull other issues from Hive for 1.9.0; Julian Hyde, if you want to take over the work in the branch in the meantime, please feel free.

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Julian Hyde , I checked the initial work in CALCITE-1334 , and both issues complement each other well. In particular: CALCITE-1334 helps identifying ranges out of filter expressions expressed in terms of standard SQL functions. CALCITE-1358 consolidates those filters e.g. checking for overlaps, and translate them into Druid valid intervals. In fact, the consolidation logic is generic enough and could probably factorized in a follow-up. In addition, HIVE-14217 adds support for the recognition of Druid default granularity functions in Group By clause. While this is not standard SQL, it should be quite useful, as e.g. Druid minute granularity would mean to group by: extract(year from __time), extract(month from __time),..., extract(minute from __time) (let alone the fact that the actual value of __time column would change) == I have pulled all the Calcite work out of HIVE-14217 and pushed it to a new branch: https://github.com/jcamachor/calcite/tree/calcite-druid There are a couple of items remaining to make it ready to go into Calcite, specifically 1) adding a results reader for new supported Druid query types, and 2) adding new tests for those query types as well as for more complex Filter predicates. I am going to pull other issues from Hive for 1.9.0; Julian Hyde , if you want to take over the work in the branch in the meantime, please feel free.
          Hide
          julianhyde Julian Hyde added a comment - - edited

          In addition, HIVE-14217 adds support for the recognition of Druid default granularity functions in Group By clause.

          Jesus Camacho Rodriguez, What syntax did you use for this? We have the FLOOR(dateTime TO timeUnit) function in Calcite (it's non standard but quite I believe other DBs have it, and we lean heavily on it in streaming queries) and you can use this to group in minute granularity:

          select productId, count(*)
          from Orders
          group by floor(orderTime to minute), productId
          

          Would floor solve your use-case?

          Show
          julianhyde Julian Hyde added a comment - - edited In addition, HIVE-14217 adds support for the recognition of Druid default granularity functions in Group By clause. Jesus Camacho Rodriguez , What syntax did you use for this? We have the FLOOR(dateTime TO timeUnit) function in Calcite (it's non standard but quite I believe other DBs have it, and we lean heavily on it in streaming queries) and you can use this to group in minute granularity: select productId, count(*) from Orders group by floor(orderTime to minute), productId Would floor solve your use-case?
          Hide
          julianhyde Julian Hyde added a comment -

          Jesus Camacho Rodriguez, I have CALCITE-1334 almost ready to check in. For instance, it converts

          EXTRACT(YEAR FROM __time) BETWEEN 2010 AND 2020
          AND EXTRACT(MONTH FROM __time) = 2
          AND EXTRACT(DAY FROM __TIME) = 29
          

          to

          AND(>=($8, 2011-01-01),
            AND(>=($8, 2011-01-01), <($8, 2020-01-01)),
            OR(AND(>=($8, 2011-02-01), <($8, 2011-03-01)),
              AND(>=($8, 2012-02-01), <($8, 2012-03-01)),
              AND(>=($8, 2013-02-01), <($8, 2013-03-01)),
              AND(>=($8, 2014-02-01), <($8, 2014-03-01)),
              AND(>=($8, 2015-02-01), <($8, 2015-03-01)),
              AND(>=($8, 2016-02-01), <($8, 2016-03-01)),
              AND(>=($8, 2017-02-01), <($8, 2017-03-01)),
              AND(>=($8, 2018-02-01), <($8, 2018-03-01)),
              AND(>=($8, 2019-02-01), <($8, 2019-03-01))),
            OR(AND(>=($8, 2012-02-29), <($8, 2012-03-01)),
              AND(>=($8, 2016-02-29), <($8, 2016-03-01))))
          

          As you can see, the result is correct (it returns the leap days in 2012 and 2016) but redundant. You can delete all but the last 2 lines, because they are wider intervals that subsume the ones at the end.

          Have you done this "interval folding" as part of HIVE-14217, and will it be in CALCITE-1358? If you have, I won't write it again!

          Show
          julianhyde Julian Hyde added a comment - Jesus Camacho Rodriguez , I have CALCITE-1334 almost ready to check in. For instance, it converts EXTRACT(YEAR FROM __time) BETWEEN 2010 AND 2020 AND EXTRACT(MONTH FROM __time) = 2 AND EXTRACT(DAY FROM __TIME) = 29 to AND(>=($8, 2011-01-01), AND(>=($8, 2011-01-01), <($8, 2020-01-01)), OR(AND(>=($8, 2011-02-01), <($8, 2011-03-01)), AND(>=($8, 2012-02-01), <($8, 2012-03-01)), AND(>=($8, 2013-02-01), <($8, 2013-03-01)), AND(>=($8, 2014-02-01), <($8, 2014-03-01)), AND(>=($8, 2015-02-01), <($8, 2015-03-01)), AND(>=($8, 2016-02-01), <($8, 2016-03-01)), AND(>=($8, 2017-02-01), <($8, 2017-03-01)), AND(>=($8, 2018-02-01), <($8, 2018-03-01)), AND(>=($8, 2019-02-01), <($8, 2019-03-01))), OR(AND(>=($8, 2012-02-29), <($8, 2012-03-01)), AND(>=($8, 2016-02-29), <($8, 2016-03-01)))) As you can see, the result is correct (it returns the leap days in 2012 and 2016) but redundant. You can delete all but the last 2 lines, because they are wider intervals that subsume the ones at the end. Have you done this "interval folding" as part of HIVE-14217 , and will it be in CALCITE-1358 ? If you have, I won't write it again!
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Yes, this is a good idea. I did not know it could be done that way; I could use floor to infer the granularity, instead of defining custom functions. I will change the code accordingly.

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Yes, this is a good idea. I did not know it could be done that way; I could use floor to infer the granularity, instead of defining custom functions. I will change the code accordingly.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          That is exactly the part that was already implemented as part of HIVE-14217, thus no need to reimplement it!

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - That is exactly the part that was already implemented as part of HIVE-14217 , thus no need to reimplement it!
          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/962eac5 .
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Resolved in release 1.9.0 (2016-09-22)

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Resolved in release 1.9.0 (2016-09-22)

            People

            • Assignee:
              jcamachorodriguez Jesus Camacho Rodriguez
              Reporter:
              jcamachorodriguez Jesus Camacho Rodriguez
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development