Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.12.0
    • Component/s: druid
    • Labels:
      None

      Description

      The druid calcite adapter throw an exception when an IN filter is used.
      This happens only in hive because in calcite project the IN filter is transformed to OR automatically.
      Since this rule does not kick in HIVE and it is better to use the native IN filter from druid instead having huge number of OR clauses i will send a patch that adds the IN filter.

        Issue Links

          Activity

          Hide
          bslim slim bouguerra added a comment -

          Julian Hyde can you please checkout this patch https://github.com/apache/calcite/pull/381.
          Still looking how to add IT test, i can not find a way to disable the automatic rewriting of IN to OR, any pointers ?

          Show
          bslim slim bouguerra added a comment - Julian Hyde can you please checkout this patch https://github.com/apache/calcite/pull/381 . Still looking how to add IT test, i can not find a way to disable the automatic rewriting of IN to OR, any pointers ?
          Hide
          julianhyde Julian Hyde added a comment -

          Disabling the rewriting of IN to OR is probably not a good idea. After all, people could write the query as an OR and we'd like that to work too.

          I really need a test case. It's difficult to visualize what's happening without one.

          Show
          julianhyde Julian Hyde added a comment - Disabling the rewriting of IN to OR is probably not a good idea. After all, people could write the query as an OR and we'd like that to work too. I really need a test case. It's difficult to visualize what's happening without one.
          Hide
          bslim slim bouguerra added a comment -

          Julian Hyde i agree that we should keep the OR but i think that re-writing IN to bunch of OR is not necessary. In fact i bet all the DB stores perform better with IN than multiple OR clauses.
          As you can see here https://issues.apache.org/jira/browse/HIVE-16025 the re-writing is not happening because HIVE likes to keep IN rather than evaluating all the ORs individually.

          Show
          bslim slim bouguerra added a comment - Julian Hyde i agree that we should keep the OR but i think that re-writing IN to bunch of OR is not necessary. In fact i bet all the DB stores perform better with IN than multiple OR clauses. As you can see here https://issues.apache.org/jira/browse/HIVE-16025 the re-writing is not happening because HIVE likes to keep IN rather than evaluating all the ORs individually.
          Hide
          julianhyde Julian Hyde added a comment -

          Disabling IN expansion isn't really an option. Yes, most engines benefit from having ordered sequences of ranges when they are doing scans. And Calcite can help convert into SARGs. But at an earlier stage it is useful to expand IN so that we can apply de Morgan's rules. Also, we can't alter the behavior just for Druid because Calcite doesn't (and shouldn't) know which engine is going to be running the query.

          Calcite isn't expecting to see IN, just as it isn't expecting to see BETWEEN. If Hive passes INs to Calcite, even if we fix this code, something else will break.

          Show
          julianhyde Julian Hyde added a comment - Disabling IN expansion isn't really an option. Yes, most engines benefit from having ordered sequences of ranges when they are doing scans. And Calcite can help convert into SARGs . But at an earlier stage it is useful to expand IN so that we can apply de Morgan's rules. Also, we can't alter the behavior just for Druid because Calcite doesn't (and shouldn't) know which engine is going to be running the query. Calcite isn't expecting to see IN, just as it isn't expecting to see BETWEEN. If Hive passes INs to Calcite, even if we fix this code, something else will break.
          Hide
          bslim slim bouguerra added a comment -

          Julian Hyde FYI looking at native druid calcite implementation, we are using native druid IN filter as well for IN clause.

          Show
          bslim slim bouguerra added a comment - Julian Hyde FYI looking at native druid calcite implementation, we are using native druid IN filter as well for IN clause.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          IN and BETWEEN are included in the isValidFilter method in DruidQuery, that is why we get this Exception: we push the Filter, but then we do not know how to generate the JSON for its predicate. To avoid the Exception, we would either need to 1) remove IN/BETWEEN from isValidFilter (which would prevent us from pushing them to Druid), or 2) extend the JSON generator as slim bouguerra did.

          I prefer option 2), as Hive does not expand those kind of predicates, but we can still push them to Druid. The problem is that it is not possible to add tests to DruidIT because of the Calcite IN clause expansion, but the solution is not to alter Calcite behavior for all cases. Why don't we simply add a unit test that verify the JSON that we are generating from the RexNode? That would fix the problem.
          In addition, can we add the case to generate JSON for the BETWEEN clause too (if Druid does not support BETWEEN natively, we can just generate it as a range predicate)? Otherwise, we will get the same Exception for BETWEEN clause.

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - IN and BETWEEN are included in the isValidFilter method in DruidQuery, that is why we get this Exception: we push the Filter, but then we do not know how to generate the JSON for its predicate. To avoid the Exception, we would either need to 1) remove IN/BETWEEN from isValidFilter (which would prevent us from pushing them to Druid), or 2) extend the JSON generator as slim bouguerra did. I prefer option 2), as Hive does not expand those kind of predicates, but we can still push them to Druid. The problem is that it is not possible to add tests to DruidIT because of the Calcite IN clause expansion, but the solution is not to alter Calcite behavior for all cases. Why don't we simply add a unit test that verify the JSON that we are generating from the RexNode? That would fix the problem. In addition, can we add the case to generate JSON for the BETWEEN clause too (if Druid does not support BETWEEN natively, we can just generate it as a range predicate)? Otherwise, we will get the same Exception for BETWEEN clause.
          Hide
          julianhyde Julian Hyde added a comment -

          IN and BETWEEN are not valid operators in the RexNode language. Hive has been sending calls to IN and BETWEEN to Calcite, bypassing the validation that happens when you go from SQL via SqlToRelConverter. If you want to create these calls within the Druid adapter I guess it won't do much harm, but if these {{RexNode}}s ever come into contact with the existing rules, bad things may happen. If that happens, you'll have to change the code in Hive that generates {{RexNode}}s. I don't see any point in supporting IN and BETWEEN in the wider Calcite code, because they are syntactic sugar.

          I agree with Jesus Camacho Rodriguez that we need to see a unit test going from RexNode to Druid JSON.

          Show
          julianhyde Julian Hyde added a comment - IN and BETWEEN are not valid operators in the RexNode language. Hive has been sending calls to IN and BETWEEN to Calcite, bypassing the validation that happens when you go from SQL via SqlToRelConverter. If you want to create these calls within the Druid adapter I guess it won't do much harm, but if these {{RexNode}}s ever come into contact with the existing rules, bad things may happen. If that happens, you'll have to change the code in Hive that generates {{RexNode}}s. I don't see any point in supporting IN and BETWEEN in the wider Calcite code, because they are syntactic sugar. I agree with Jesus Camacho Rodriguez that we need to see a unit test going from RexNode to Druid JSON.
          Hide
          ashutoshc Ashutosh Chauhan added a comment -

          I dont think its a good idea to force runtime to execute OR expressions even when user has specified IN or BETWEEN. Simply because in most runtimes (Hive & Druid) in/between executes way faster than OR tree. By insisting on translating IN/BETWEEN into ORs you are forcing perf penalty unnecessarily.

          Show
          ashutoshc Ashutosh Chauhan added a comment - I dont think its a good idea to force runtime to execute OR expressions even when user has specified IN or BETWEEN. Simply because in most runtimes (Hive & Druid) in/between executes way faster than OR tree. By insisting on translating IN/BETWEEN into ORs you are forcing perf penalty unnecessarily.
          Hide
          julianhyde Julian Hyde added a comment -

          Ashutosh Chauhan, Sure, it's nice if we can generate sargs for Hive and Druid and other runtimes.

          However, we can and should do that without adding IN and BETWEEN to Calcite's RexNode language. SqlKind.OR is used in 27 places, so that gives you an idea of the scope of adding support for IN. BETWEEN would be similar. I'd estimate that the patch would be 1000-2000 lines. And even then, things would keep breaking, because we've increased the surface area (e.g. someone writes a query with JOIN ... ON ... BETWEEN and we haven't tested all of the 23 rules that handle Join).

          My suggestion is write a converter from RexNode that contains only (AND, OR, =, <=, etc.) to one that also includes scalar-IN and BETWEEN. It would happen in the Druid adapter just before Rel-to-JSON generation, performance impact negligible, and a straightforward coding task. Plus it will work for queries that were originally written using OR.

          Show
          julianhyde Julian Hyde added a comment - Ashutosh Chauhan , Sure, it's nice if we can generate sargs for Hive and Druid and other runtimes. However, we can and should do that without adding IN and BETWEEN to Calcite's RexNode language. SqlKind.OR is used in 27 places, so that gives you an idea of the scope of adding support for IN. BETWEEN would be similar. I'd estimate that the patch would be 1000-2000 lines. And even then, things would keep breaking, because we've increased the surface area (e.g. someone writes a query with JOIN ... ON ... BETWEEN and we haven't tested all of the 23 rules that handle Join). My suggestion is write a converter from RexNode that contains only (AND, OR, =, <=, etc.) to one that also includes scalar-IN and BETWEEN. It would happen in the Druid adapter just before Rel-to-JSON generation, performance impact negligible, and a straightforward coding task. Plus it will work for queries that were originally written using OR.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Whether we go into Calcite with IN/BETWEEN clauses, or we generate them via converter before generating the Druid JSON query, we need this fix, so I would suggest we check it in.

          Concerning the IN/BETWEEN integration, it goes beyond the scope of the Druid adapter. As it is mentioned above, Hive has been sending calls to IN and BETWEEN to Calcite, which maybe it is preventing some optimizations from kicking in; we need to explore that.

          My suggestion is write a converter from RexNode that contains only (AND, OR, =, <=, etc.) to one that also includes scalar-IN and BETWEEN.

          In Hive, we already have such converter rule for IN clause, not for BETWEEN though. We are also missing the piece that would decompose IN/BETWEEN clauses into AND, OR, =, <=, etc.

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Whether we go into Calcite with IN/BETWEEN clauses, or we generate them via converter before generating the Druid JSON query, we need this fix, so I would suggest we check it in. Concerning the IN/BETWEEN integration, it goes beyond the scope of the Druid adapter. As it is mentioned above, Hive has been sending calls to IN and BETWEEN to Calcite , which maybe it is preventing some optimizations from kicking in; we need to explore that. My suggestion is write a converter from RexNode that contains only (AND, OR, =, <=, etc.) to one that also includes scalar-IN and BETWEEN. In Hive, we already have such converter rule for IN clause, not for BETWEEN though. We are also missing the piece that would decompose IN/BETWEEN clauses into AND, OR, =, <=, etc.
          Hide
          ashutoshc Ashutosh Chauhan added a comment -

          Agree that patch which slim bouguerra has for generating IN in druid json is required regardless. That can go in independent of decision we make for adding IN/Between support in RexNode.

          Show
          ashutoshc Ashutosh Chauhan added a comment - Agree that patch which slim bouguerra has for generating IN in druid json is required regardless. That can go in independent of decision we make for adding IN/Between support in RexNode.
          Hide
          bslim slim bouguerra added a comment -

          @all i have added UTs to the current PR.

          Show
          bslim slim bouguerra added a comment - @all i have added UTs to the current PR.
          Hide
          julianhyde Julian Hyde added a comment -

          slim bouguerra, I don't think you should use Mockito. RexNode is quite straightforward to instantiate: see RexBuilderTest and RexProgramTest for examples. Using the real thing will make the tests easier to read, maintain and extend.

          I think you should add tests for some other types, such as timestamp and decimal.

          Show
          julianhyde Julian Hyde added a comment - slim bouguerra , I don't think you should use Mockito. RexNode is quite straightforward to instantiate: see RexBuilderTest and RexProgramTest for examples. Using the real thing will make the tests easier to read, maintain and extend. I think you should add tests for some other types, such as timestamp and decimal.
          Hide
          julianhyde Julian Hyde added a comment -

          I have reviewed; looks good. I have staged in my https://github.com/julianhyde/calcite/tree/1656-druid-prune branch until the breakage due to CALCITE-1661 is cleared up, then I will commit to master.

          Show
          julianhyde Julian Hyde added a comment - I have reviewed; looks good. I have staged in my https://github.com/julianhyde/calcite/tree/1656-druid-prune branch until the breakage due to CALCITE-1661 is cleared up, then I will commit to master.
          Hide
          julianhyde Julian Hyde added a comment -

          slim bouguerra, I found another issue. The test fails on JDK 9. Can you take a look, please. Here's the output:

          Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 0.619 sec <<< FAILURE! - in org.apache.calcite.adapter.druid.DruidQueryFilterTest
          testInFilter(org.apache.calcite.adapter.druid.DruidQueryFilterTest)  Time elapsed: 0.619 sec  <<< ERROR!
          java.lang.NoClassDefFoundError: Could not initialize class org.mockito.internal.creation.cglib.ClassImposterizer$3
                  at org.apache.calcite.adapter.druid.DruidQueryFilterTest.testInFilter(DruidQueryFilterTest.java:52)
          
          testBetweenFilterStringCase(org.apache.calcite.adapter.druid.DruidQueryFilterTest)  Time elapsed: 0.618 sec  <<< ERROR!
          java.lang.ExceptionInInitializerError
                  at org.apache.calcite.adapter.druid.DruidQueryFilterTest.testBetweenFilterStringCase(DruidQueryFilterTest.java:81)
          Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make protected final java.lang.Class java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain) throws java.lang.ClassFormatError accessible: module java.base does not "opens java.lang" to unnamed module @2cc3dec5
                  at org.apache.calcite.adapter.druid.DruidQueryFilterTest.testBetweenFilterStringCase(DruidQueryFilterTest.java:81)
          
          
          Show
          julianhyde Julian Hyde added a comment - slim bouguerra , I found another issue. The test fails on JDK 9. Can you take a look, please. Here's the output: Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 0.619 sec <<< FAILURE! - in org.apache.calcite.adapter.druid.DruidQueryFilterTest testInFilter(org.apache.calcite.adapter.druid.DruidQueryFilterTest) Time elapsed: 0.619 sec <<< ERROR! java.lang.NoClassDefFoundError: Could not initialize class org.mockito.internal.creation.cglib.ClassImposterizer$3 at org.apache.calcite.adapter.druid.DruidQueryFilterTest.testInFilter(DruidQueryFilterTest.java:52) testBetweenFilterStringCase(org.apache.calcite.adapter.druid.DruidQueryFilterTest) Time elapsed: 0.618 sec <<< ERROR! java.lang.ExceptionInInitializerError at org.apache.calcite.adapter.druid.DruidQueryFilterTest.testBetweenFilterStringCase(DruidQueryFilterTest.java:81) Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make protected final java.lang.Class java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain) throws java.lang.ClassFormatError accessible: module java.base does not "opens java.lang" to unnamed module @2cc3dec5 at org.apache.calcite.adapter.druid.DruidQueryFilterTest.testBetweenFilterStringCase(DruidQueryFilterTest.java:81)
          Hide
          bslim slim bouguerra added a comment -

          Julian Hyde please checkout the new commit.

          Show
          bslim slim bouguerra added a comment - Julian Hyde please checkout the new commit.
          Hide
          julianhyde Julian Hyde added a comment -

          I'm not sure which was your "new commit". Anyway, I fixed the problem with Mockito on JDK 9 by using upgrading Mockito and using mockito-core rather than mockito-all; the same fix as Josh Elser made in CALCITE-1568. So, it all works now.

          Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/f8546915. Thanks for the PR, slim bouguerra!

          Show
          julianhyde Julian Hyde added a comment - I'm not sure which was your "new commit". Anyway, I fixed the problem with Mockito on JDK 9 by using upgrading Mockito and using mockito-core rather than mockito-all; the same fix as Josh Elser made in CALCITE-1568 . So, it all works now. Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/f8546915 . Thanks for the PR, slim bouguerra !
          Hide
          julianhyde Julian Hyde added a comment -

          Resolved in release 1.12.0 (2017-03-24).

          Show
          julianhyde Julian Hyde added a comment - Resolved in release 1.12.0 (2017-03-24).

            People

            • Assignee:
              bslim slim bouguerra
              Reporter:
              bslim slim bouguerra
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development