Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-794

Detect cycles when computing statistics

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6.0
    • Component/s: None
    • Labels:
      None

      Description

      The graph of RelNodes is allowed to be cyclic. This causes problems when evaluating certain metadata, for example RelMetataQuery.areColumnsUnique. While computing the value for RelNode r, it might recurse through say a Project and hit r again. This causes a stack overflow.

      We solve this by adding a map or set of active RelNodes. The map is stored within RelMetadataQuery, which can now be instantiated, and its methods are no longer static. The first call should instantiate a RelMetadataQuery, but all subsequent calls for metadata (perhaps several kinds of metadata) will use the same RelMetadataQuery instance, hence the same map.

      Also add a RelMetadataQuery argument to the static "handler" methods in RelMdColumnUniqueness and similar classes.

      This is a breaking change for people who have written a metadata handler, and might be subtle to detect, because the methods are invoked via reflection.

      For code that is just using RelMetadataQuery methods, the change is still breaking, but the break points and remedy will be obvious: the methods are no longer static, so they need to change RelMetadataQuery.foo() to RelMetadataQuery.instance().foo().

        Issue Links

          Activity

          Hide
          julianhyde Julian Hyde added a comment -

          Resolved in release 1.6.0 (2016-01-22).

          Show
          julianhyde Julian Hyde added a comment - Resolved in release 1.6.0 (2016-01-22).
          Show
          julianhyde Julian Hyde added a comment - Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/cabdcf44 .
          Hide
          julianhyde Julian Hyde added a comment -

          I have logged CALCITE-1048 as a follow-up, and will rename CALCITE_794_FIXED to CALCITE_1048_FIXED.

          Show
          julianhyde Julian Hyde added a comment - I have logged CALCITE-1048 as a follow-up, and will rename CALCITE_794_FIXED to CALCITE_1048_FIXED.
          Hide
          julianhyde Julian Hyde added a comment -

          Jinfeng Ni, Thanks for reviewing. I didn't expect you to be able to compile/run against my branch; I just wanted a review to confirm that we are heading in a direction that would be useful to Drill, and you've done that.

          I'll reply to your comments here.

          What's the reasoning that we use the min of rowCount from the Rels in a RelSubset?

          I figured that if we'd found a "clever" algorithm (say if an implementation of a join has noticed that one of the inputs has distinct rows and therefore it can use a semi-join) then it would have both a lower estimated row count and a better estimate. But it's just a hunch, not backed up by any empirical evidence.

          Why do we still need this flag? I assume the patch CALCITE_794 would fix the problem.

          Good point. I will log a follow-up JIRA case for all issues not fixed in CALCITE-794, and will rename the flag to match.

          If the logic to detect cycle belongs to RelMetadataQuery, does it make more sense to catch CyclicMetadataException in RelMetadataQuery, and return default value there?

          You might be right. We aren't dealing with cyclic metadata very systematically yet. I think it will evolve as we learn more. Now we have the mechanism to detect cycles, it seems to be very easy to fix them when they arise. My guess is that the provider of the specific kind of metadata is in a better position to provide a sensible default than RelMetadataQuery, but I don't feel strongly about it. I'll look into your idea when I finalize this patch.

          Could you please add comments here to explain why the code is applied for these types of Rel nodes? What about other Rel nodes like Sort?

          That method, public Boolean areColumnsUnique(RelSubset rel, RelMetadataQuery mq, ImmutableBitSet columns, boolean ignoreNulls), is still somewhat of a hack, of a similar nature to the other problems with the Bug.CALCITE_794_FIXED flag. We found that if we went too deep into a RelSubset (and its descendants) then things blew up. There is still work to be done there.

          Show
          julianhyde Julian Hyde added a comment - Jinfeng Ni , Thanks for reviewing. I didn't expect you to be able to compile/run against my branch; I just wanted a review to confirm that we are heading in a direction that would be useful to Drill, and you've done that. I'll reply to your comments here. What's the reasoning that we use the min of rowCount from the Rels in a RelSubset? I figured that if we'd found a "clever" algorithm (say if an implementation of a join has noticed that one of the inputs has distinct rows and therefore it can use a semi-join) then it would have both a lower estimated row count and a better estimate. But it's just a hunch, not backed up by any empirical evidence. Why do we still need this flag? I assume the patch CALCITE_794 would fix the problem. Good point. I will log a follow-up JIRA case for all issues not fixed in CALCITE-794 , and will rename the flag to match. If the logic to detect cycle belongs to RelMetadataQuery, does it make more sense to catch CyclicMetadataException in RelMetadataQuery, and return default value there? You might be right. We aren't dealing with cyclic metadata very systematically yet. I think it will evolve as we learn more. Now we have the mechanism to detect cycles, it seems to be very easy to fix them when they arise. My guess is that the provider of the specific kind of metadata is in a better position to provide a sensible default than RelMetadataQuery, but I don't feel strongly about it. I'll look into your idea when I finalize this patch. Could you please add comments here to explain why the code is applied for these types of Rel nodes? What about other Rel nodes like Sort? That method, public Boolean areColumnsUnique(RelSubset rel, RelMetadataQuery mq, ImmutableBitSet columns, boolean ignoreNulls) , is still somewhat of a hack, of a similar nature to the other problems with the Bug.CALCITE_794_FIXED flag. We found that if we went too deep into a RelSubset (and its descendants) then things blew up. There is still work to be done there.
          Hide
          jni Jinfeng Ni added a comment -

          Looks like it's not an easy job to cherry-pick CALCITE-794 onto Drill's forked Calcite, as Drill still uses Calcite 1.4.0. I run into bunch of problems which make me think maybe we had better wait until we move Drill onto Calcite 1.5.0.

          I did look through the code change and put several comments in Julian's Calcite-794 branch: https://github.com/julianhyde/calcite/commit/516f934e317e12d38cf4050f009b2772a9b0987e

          Show
          jni Jinfeng Ni added a comment - Looks like it's not an easy job to cherry-pick CALCITE-794 onto Drill's forked Calcite, as Drill still uses Calcite 1.4.0. I run into bunch of problems which make me think maybe we had better wait until we move Drill onto Calcite 1.5.0. I did look through the code change and put several comments in Julian's Calcite-794 branch: https://github.com/julianhyde/calcite/commit/516f934e317e12d38cf4050f009b2772a9b0987e
          Hide
          jni Jinfeng Ni added a comment -

          Drill is still using Calcite 1.4.0 as the base of a forked version. I'll cherry-pick CALCITE-794, and see if it causes any problem in Drill side. Hopefully, I will get result back in 1-2 days.

          Jacques Nadeau, it would be great if you can take a look at CALCITE-794 and its impact on Drill side too.

          Show
          jni Jinfeng Ni added a comment - Drill is still using Calcite 1.4.0 as the base of a forked version. I'll cherry-pick CALCITE-794 , and see if it causes any problem in Drill side. Hopefully, I will get result back in 1-2 days. Jacques Nadeau , it would be great if you can take a look at CALCITE-794 and its impact on Drill side too.
          Hide
          julianhyde Julian Hyde added a comment -
          Show
          julianhyde Julian Hyde added a comment - Thanks Maryann Xue ; I have added your change as https://github.com/julianhyde/calcite/commit/f83e56fdfbb4a73a01509d24db9f93853302cf42 and I will squash later.
          Hide
          maryannxue Maryann Xue added a comment -

          Tested out the branch with Calcite-Phoenix and found one bug (https://github.com/julianhyde/calcite/commit/516f934e317e12d38cf4050f009b2772a9b0987e#commitcomment-15024485). Otherwise, it looks good to me.

          Show
          maryannxue Maryann Xue added a comment - Tested out the branch with Calcite-Phoenix and found one bug ( https://github.com/julianhyde/calcite/commit/516f934e317e12d38cf4050f009b2772a9b0987e#commitcomment-15024485 ). Otherwise, it looks good to me.
          Hide
          julianhyde Julian Hyde added a comment -

          As of https://github.com/julianhyde/calcite/commit/779d72523690d8f4c41a5c209c108937a7bb3057 all tests on the 794-cycles branch pass, and performance is at par with the master branch. So, we can merge into master.

          To achieve adequate performance, I had to prevent some of the RelMdXxxx methods from looking into all of the children of RelSubset. To be clear, this leaves us no worse off than master branch, and we now have an API that can detect cycles. I think the solution is to add some caching, maybe within the RelMetadataQuery.

          I still need reviewers from Hive and Drill. Jesus Camacho Rodriguez and Jacques Nadeau, can you review or suggest someone.

          Show
          julianhyde Julian Hyde added a comment - As of https://github.com/julianhyde/calcite/commit/779d72523690d8f4c41a5c209c108937a7bb3057 all tests on the 794-cycles branch pass, and performance is at par with the master branch. So, we can merge into master. To achieve adequate performance, I had to prevent some of the RelMdXxxx methods from looking into all of the children of RelSubset. To be clear, this leaves us no worse off than master branch, and we now have an API that can detect cycles. I think the solution is to add some caching, maybe within the RelMetadataQuery. I still need reviewers from Hive and Drill. Jesus Camacho Rodriguez and Jacques Nadeau , can you review or suggest someone.
          Hide
          maryannxue Maryann Xue added a comment -

          Yes, I see where visiting all rels for deducing metadata of a RelSubset will be used, for Predicates, maxRowCount, etc. Actually what I'm saying is that most of them (if not all) should do it this way rather than use the "best" rel. And rowCount is an example that I think should go through all rels to be strictly correct in non-monotonic cases, while we now choose to use "best" and cache it for performance benefits.

          Show
          maryannxue Maryann Xue added a comment - Yes, I see where visiting all rels for deducing metadata of a RelSubset will be used, for Predicates, maxRowCount, etc. Actually what I'm saying is that most of them (if not all) should do it this way rather than use the "best" rel. And rowCount is an example that I think should go through all rels to be strictly correct in non-monotonic cases, while we now choose to use "best" and cache it for performance benefits.
          Hide
          julianhyde Julian Hyde added a comment -

          Predicates are a good example where you don't want to just use the result from the "best". You want to union the predicates of all the rels in the subset (or in fact the set).

          Show
          julianhyde Julian Hyde added a comment - Predicates are a good example where you don't want to just use the result from the "best". You want to union the predicates of all the rels in the subset (or in fact the set).
          Hide
          maryannxue Maryann Xue added a comment -

          I see, Julian Hyde. Thanks for the explanation! So speaking of metadata caching, CALCITE-830 is just another problem, right? Certain metadata can be represented by the "best" RelNode in a RelSubset in most cases, like rowCount and columnUniqueness(CALCITE-938), but there might be exceptions. So I'm not sure if we should sacrifice this efficiency of returning the "best" RelNode's metadata vs. traversing all RelNodes in the RelSubset for some rare cases which (if ever exist) we haven't found in any real query examples yet.

          Show
          maryannxue Maryann Xue added a comment - I see, Julian Hyde . Thanks for the explanation! So speaking of metadata caching, CALCITE-830 is just another problem, right? Certain metadata can be represented by the "best" RelNode in a RelSubset in most cases, like rowCount and columnUniqueness( CALCITE-938 ), but there might be exceptions. So I'm not sure if we should sacrifice this efficiency of returning the "best" RelNode's metadata vs. traversing all RelNodes in the RelSubset for some rare cases which (if ever exist) we haven't found in any real query examples yet.
          Hide
          julianhyde Julian Hyde added a comment -

          The performance impact of creating the RelMetadataQuery object should be insignificant compared to other inefficiencies that exist. RelMetadataQuery.instance creates a RelMetadataQuery and it contains a HashSet.

          One inefficiency is that ReflectiveRelMetadataProvider uses Proxy.newProxyInstance to create implementations of interfaces, and Method.invoke to call the methods. I think method handles would be much better. See CALCITE-604.

          A larger inefficiency is that I'm not sure we're doing effective caching. I think we make many calls to, say, get the estimated row count of a particular RelNode. We have to be careful not to introduce too much caching: if we add a RelNode to a RelSubset we need to make sure that that improvement is seen by any RelNode that uses that RelSubset or anything downstream of it.

          But still, if you sample the stack during a test that uses volcano heavily, Calcite will likely be in a metadata call, both before and after this issue is fixed. That seems wrong.

          Show
          julianhyde Julian Hyde added a comment - The performance impact of creating the RelMetadataQuery object should be insignificant compared to other inefficiencies that exist. RelMetadataQuery.instance creates a RelMetadataQuery and it contains a HashSet. One inefficiency is that ReflectiveRelMetadataProvider uses Proxy.newProxyInstance to create implementations of interfaces, and Method.invoke to call the methods. I think method handles would be much better. See CALCITE-604 . A larger inefficiency is that I'm not sure we're doing effective caching. I think we make many calls to, say, get the estimated row count of a particular RelNode. We have to be careful not to introduce too much caching: if we add a RelNode to a RelSubset we need to make sure that that improvement is seen by any RelNode that uses that RelSubset or anything downstream of it. But still, if you sample the stack during a test that uses volcano heavily, Calcite will likely be in a metadata call, both before and after this issue is fixed. That seems wrong.
          Hide
          maryannxue Maryann Xue added a comment -

          Julian Hyde, Sure, I am very interested actually. One quick question before I dive into the code, what is the performance impact? I had been thinking about cycle detection (with no knowledge of this issue's existence) when I worked on CALCITE-938 and CALCITE-1018, but was just not sure what the performance would be. Does the current solution create a map each time a RelMetadataQuery is instantiated? Would it be better (or worse, or is it even possible) that we maintain a global map in the VolcanoPlanner?

          Show
          maryannxue Maryann Xue added a comment - Julian Hyde , Sure, I am very interested actually. One quick question before I dive into the code, what is the performance impact? I had been thinking about cycle detection (with no knowledge of this issue's existence) when I worked on CALCITE-938 and CALCITE-1018 , but was just not sure what the performance would be. Does the current solution create a map each time a RelMetadataQuery is instantiated? Would it be better (or worse, or is it even possible) that we maintain a global map in the VolcanoPlanner?
          Hide
          julianhyde Julian Hyde added a comment -

          Maryann Xue, If you are interested in working on this issue as part of a longer term fix for CALCITE-1018, let me know. I am dusting off the 794-cycles branch right now, and will rebase and see how the test suite is looking.

          Show
          julianhyde Julian Hyde added a comment - Maryann Xue , If you are interested in working on this issue as part of a longer term fix for CALCITE-1018 , let me know. I am dusting off the 794-cycles branch right now, and will rebase and see how the test suite is looking.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Julian Hyde, thanks for the details.

          I will checkout 794-cycles as you suggested and continue the work done in that branch.

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Julian Hyde , thanks for the details. I will checkout 794-cycles as you suggested and continue the work done in that branch.
          Hide
          julianhyde Julian Hyde added a comment -

          Jesus Camacho Rodriguez, Regarding AggregateJoinTransposeRule and cycles, note that that I spent some time working with Ashutosh Chauhan yesterday. Since metadata calls were more expensive than the rule, if you generated a call stack Calcite would probably be evaluating metadata; but the real problem was the cycle in the Hep program. We solved the loop by adding HepProgramBuilder.addMatchLimit(1) to a program.

          Regarding actual cycles - which show up as StackOverflowError - I did quite a bit of work on this issue but did not manage to complete it. You can see my work in https://github.com/julianhyde/calcite/tree/794-cycles. The key innovation is that before making a metadata call you create an instance of RelMetadataQuery, and that has sufficient state to detect re-entrant calls. Once you are "inside" the metadata system - e.g. if you are implementing a provider for row count that needs to get selectivity - you use the RelMetadataQuery object that was passed in.

          This would be a big API change, but just about acceptable for a minor release, in my opinion.

          I can't remember what work was left to do. I think you should checkout 794-cycles and see which tests still fail. It wouldn't be very hard to rebase onto current master.

          Show
          julianhyde Julian Hyde added a comment - Jesus Camacho Rodriguez , Regarding AggregateJoinTransposeRule and cycles, note that that I spent some time working with Ashutosh Chauhan yesterday. Since metadata calls were more expensive than the rule, if you generated a call stack Calcite would probably be evaluating metadata; but the real problem was the cycle in the Hep program. We solved the loop by adding HepProgramBuilder.addMatchLimit(1) to a program. Regarding actual cycles - which show up as StackOverflowError - I did quite a bit of work on this issue but did not manage to complete it. You can see my work in https://github.com/julianhyde/calcite/tree/794-cycles . The key innovation is that before making a metadata call you create an instance of RelMetadataQuery, and that has sufficient state to detect re-entrant calls. Once you are "inside" the metadata system - e.g. if you are implementing a provider for row count that needs to get selectivity - you use the RelMetadataQuery object that was passed in. This would be a big API change, but just about acceptable for a minor release, in my opinion. I can't remember what work was left to do. I think you should checkout 794-cycles and see which tests still fail. It wouldn't be very hard to rebase onto current master.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Julian Hyde, as we introduced AggregateJoinTransposeRule in Hive, we are seeing huge stack calls for the areColumnsUnique method that lead to slow/never-ending query compilation.

          We have a temporary workaround to avoid that call in AggregateJoinTransposeRule in Hive (HIVE-12508), but we should fix this problem from the root so we do not hit it with other metadata providers.

          Please, let me know the status (I saw there are some related issues such as CALCITE-790 or CALCITE-604) and if you are working on these issues; otherwise, I could start taking a look at them. Thanks

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Julian Hyde , as we introduced AggregateJoinTransposeRule in Hive, we are seeing huge stack calls for the areColumnsUnique method that lead to slow/never-ending query compilation. We have a temporary workaround to avoid that call in AggregateJoinTransposeRule in Hive ( HIVE-12508 ), but we should fix this problem from the root so we do not hit it with other metadata providers. Please, let me know the status (I saw there are some related issues such as CALCITE-790 or CALCITE-604 ) and if you are working on these issues; otherwise, I could start taking a look at them. Thanks

            People

            • Assignee:
              julianhyde Julian Hyde
              Reporter:
              julianhyde Julian Hyde
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development