Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-987

Push limit 0 will result in an infinite loop

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.5.0
    • Fix Version/s: 1.6.0
    • Component/s: core
    • Labels:
      None

      Description

      We use "checkInputForCollationAndLimit" in RelMdUtil.java to check the input #rows. However, it calls RelMetadataQuery.getRowCount which will validate the #rows. The validation will change #row=0 to #row=1. This will result in an infinite loop to push limit. The affected rules include
      SortUnionTransposeRule and any Sort***TransposeRules that call checkInputForCollationAndLimit.

      1. CALCITE-987.01.patch
        5 kB
        Pengcheng Xiong
      2. CALCITE-987.02.patch
        4 kB
        Pengcheng Xiong
      3. CALCITE-987.03.patch
        11 kB
        Pengcheng Xiong
      4. CALCITE-987.04.patch
        11 kB
        Pengcheng Xiong

        Issue Links

          Activity

          Hide
          julianhyde Julian Hyde added a comment -

          Resolved in release 1.6.0 (2016-01-22).

          Show
          julianhyde Julian Hyde added a comment - Resolved in release 1.6.0 (2016-01-22).
          Hide
          julianhyde Julian Hyde added a comment -

          Mike Hinchey, Can you log a new issue for this? It sounds as if your issue is important and 987 is done.

          Show
          julianhyde Julian Hyde added a comment - Mike Hinchey , Can you log a new issue for this? It sounds as if your issue is important and 987 is done.
          Hide
          mikehinchey Mike Hinchey added a comment - - edited

          I moved my comment to CALCITE-1005.

          Show
          mikehinchey Mike Hinchey added a comment - - edited I moved my comment to CALCITE-1005 .
          Hide
          julianhyde Julian Hyde added a comment -
          Show
          julianhyde Julian Hyde added a comment - The above issues are fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/24b07471 .
          Hide
          julianhyde Julian Hyde added a comment -

          I disagree with you both about Intersect. If we have intersect(A, B, C), and A, B, C have max row counts of 1000, unknown, 40, then the answer is 40. Intersect max row count is the minimum of its inputs' row max row counts, and it is safe to treat unknown as +infinity.

          I agree regarding estimateJoinedRows. I will add the null check.

          Show
          julianhyde Julian Hyde added a comment - I disagree with you both about Intersect. If we have intersect(A, B, C), and A, B, C have max row counts of 1000, unknown, 40, then the answer is 40. Intersect max row count is the minimum of its inputs' row max row counts, and it is safe to treat unknown as +infinity. I agree regarding estimateJoinedRows. I will add the null check.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Julian Hyde, thanks, and sorry for checking this in without checking it too exhaustively. A couple of comments:

          • In estimateJoinedRows, a not null check should be done for Double left and Double right too.
          • I agree with Pengcheng Xiong that at least in RelMdMaxRowCount, as we are not returning an estimate, it would be better to return null if we cannot get the row count from all children?
          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Julian Hyde , thanks, and sorry for checking this in without checking it too exhaustively. A couple of comments: In estimateJoinedRows, a not null check should be done for Double left and Double right too. I agree with Pengcheng Xiong that at least in RelMdMaxRowCount, as we are not returning an estimate, it would be better to return null if we cannot get the row count from all children?
          Hide
          pxiong Pengcheng Xiong added a comment -

          Thanks Julian Hyde for the jumbo patch. Sorry that it is my bad to make those mistakes. I reviewed the code and I have a question. So for the intersect case, I think the logic should be

          Double rowCount = Double.POSITIVE_INFINITY;
           for (RelNode input : rel.getInputs()) {
                Double partialRowCount = RelMetadataQuery.getMaxRowCount(input);
                if (partialRowCount == null){
                     return null;
               }
                else{
                    if (partialRowCount < rowCount) {
                       rowCount = partialRowCount;
                    }
                }
               }
          return rowCount;
          

          ?

          Show
          pxiong Pengcheng Xiong added a comment - Thanks Julian Hyde for the jumbo patch. Sorry that it is my bad to make those mistakes. I reviewed the code and I have a question. So for the intersect case, I think the logic should be Double rowCount = Double .POSITIVE_INFINITY; for (RelNode input : rel.getInputs()) { Double partialRowCount = RelMetadataQuery.getMaxRowCount(input); if (partialRowCount == null ){ return null ; } else { if (partialRowCount < rowCount) { rowCount = partialRowCount; } } } return rowCount; ?
          Hide
          julianhyde Julian Hyde added a comment - - edited
          Show
          julianhyde Julian Hyde added a comment - - edited Pengcheng Xiong and Jesus Camacho Rodriguez , Please review https://github.com/julianhyde/calcite/tree/987-max-row-count . I believe I've fixed the above issues.
          Hide
          julianhyde Julian Hyde added a comment -

          I'm working on a fix for all of the above.

          Show
          julianhyde Julian Hyde added a comment - I'm working on a fix for all of the above.
          Hide
          julianhyde Julian Hyde added a comment -

          Plus, Jesus Camacho Rodriguez's subsequent rule for TableScan, calling Table.getRows(), is wrong.

          Show
          julianhyde Julian Hyde added a comment - Plus, Jesus Camacho Rodriguez 's subsequent rule for TableScan, calling Table.getRows(), is wrong.
          Hide
          julianhyde Julian Hyde added a comment -

          Sorry, some more post-commit review:

          • The rule for Join is wrong for outer joins. Should +1 each outer side. Also, left and right could be null.
          • The rule for Aggregate is wrong if the group key is empty. Always returns 1 row, even if the input is empty.
          • The rule for Aggregate is wrong if there are grouping sets.
          • The rule for Sort adds offset and limit. Should it not subtract offset from limit?
          • Add a rule for Values. Will be extremely useful. Especially since we represent the empty relation as a Values with 0 rows.
          • Rules for Intersect, Minus would be somewhat useful. Return getMaxRowCount(inputs[0]).
          • The description if CALCITE-987 in testSortUnionTranspose3 is wrong.
          • The javadoc for RelMetadataQuery.getMaxRowCount has a copy-paste error.

          I don't know whether it would be possible to add a test to RelMetadataTest. I see testRowCountEmp is disabled. I should investigate why.

          Show
          julianhyde Julian Hyde added a comment - Sorry, some more post-commit review: The rule for Join is wrong for outer joins. Should +1 each outer side. Also, left and right could be null. The rule for Aggregate is wrong if the group key is empty. Always returns 1 row, even if the input is empty. The rule for Aggregate is wrong if there are grouping sets. The rule for Sort adds offset and limit. Should it not subtract offset from limit? Add a rule for Values. Will be extremely useful. Especially since we represent the empty relation as a Values with 0 rows. Rules for Intersect, Minus would be somewhat useful. Return getMaxRowCount(inputs [0] ) . The description if CALCITE-987 in testSortUnionTranspose3 is wrong. The javadoc for RelMetadataQuery.getMaxRowCount has a copy-paste error. I don't know whether it would be possible to add a test to RelMetadataTest. I see testRowCountEmp is disabled. I should investigate why.
          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Thanks Pengcheng Xiong . Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/33f737e .
          Hide
          pxiong Pengcheng Xiong added a comment -

          Jesus Camacho Rodriguez and Julian Hyde, thanks for your comments. I have attached a new patch after addressing your comments. Thanks.

          Show
          pxiong Pengcheng Xiong added a comment - Jesus Camacho Rodriguez and Julian Hyde , thanks for your comments. I have attached a new patch after addressing your comments. Thanks.
          Hide
          julianhyde Julian Hyde added a comment -

          I don't think the default implementation of maxRowCount should call getRows. The default implementation should return Double.POSITIVE_INFINITY. We want to be able to remove limits based on maxRowCount, so it needs to be a safe bound. If getRows returns 999 it is not safe to remove a limit 1000, because getRows is allowed to be wrong. But if maxRowCount returns 999 we can safely remove a limit 1000.

          Show
          julianhyde Julian Hyde added a comment - I don't think the default implementation of maxRowCount should call getRows . The default implementation should return Double.POSITIVE_INFINITY. We want to be able to remove limits based on maxRowCount, so it needs to be a safe bound. If getRows returns 999 it is not safe to remove a limit 1000, because getRows is allowed to be wrong. But if maxRowCount returns 999 we can safely remove a limit 1000.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Pengcheng Xiong, I checked the new patch, LGTM, thanks!

          I would just change slightly the following block for Union:

          if (partialRowCount == null) {
            continue;
          }
          

          so it returns null in case it cannot get the count for one of the inputs. If you agree, I will just change it when I check in the patch.

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Pengcheng Xiong , I checked the new patch, LGTM, thanks! I would just change slightly the following block for Union: if (partialRowCount == null) { continue; } so it returns null in case it cannot get the count for one of the inputs. If you agree, I will just change it when I check in the patch.
          Hide
          julianhyde Julian Hyde added a comment -

          Yes, it needs to be a new metadata provider.

          Suppose r has an estimated row count of 5 rows. sort(r limit 10) will have an estimated row count of 5, maximum row count of 10. Est and Max are not always the same.

          Show
          julianhyde Julian Hyde added a comment - Yes, it needs to be a new metadata provider. Suppose r has an estimated row count of 5 rows. sort(r limit 10) will have an estimated row count of 5, maximum row count of 10. Est and Max are not always the same.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Julian Hyde, +1 to the idea of adding a maximum row count, I think we had even discussed about that previously.

          I have just checked the latest patch by Pengcheng Xiong. The patch relies on estimated row count to obtain the maximum row count. I think we should rather provide an actual different metadata provider and method.
          In some cases, estimated row count might be equal to max row count, but not in all cases. For instance, consider a join operator with two inputs that produce n and m tuples, respectively. If no PK/FK relationship is present, we rely on estimated selectivity to calculate estimated row count, but the maximum row count should not be the result of applying Math.ceil to estimated row count. Rather, the max row count should be n * m.
          We do not need to implement maximum row count for every operator in this patch; we can delegate the implementation of the method to the different systems using Calcite. But at least, the patch should provide the implementation for max row count for the Sort operator (that might be equal to estimated row count in this case) so it works as expected with LimitJoinTranspose and LimitUnionTranspose rules.

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Julian Hyde , +1 to the idea of adding a maximum row count, I think we had even discussed about that previously. I have just checked the latest patch by Pengcheng Xiong . The patch relies on estimated row count to obtain the maximum row count. I think we should rather provide an actual different metadata provider and method. In some cases, estimated row count might be equal to max row count, but not in all cases. For instance, consider a join operator with two inputs that produce n and m tuples, respectively. If no PK/FK relationship is present, we rely on estimated selectivity to calculate estimated row count, but the maximum row count should not be the result of applying Math.ceil to estimated row count. Rather, the max row count should be n * m. We do not need to implement maximum row count for every operator in this patch; we can delegate the implementation of the method to the different systems using Calcite. But at least, the patch should provide the implementation for max row count for the Sort operator (that might be equal to estimated row count in this case) so it works as expected with LimitJoinTranspose and LimitUnionTranspose rules.
          Hide
          julianhyde Julian Hyde added a comment -

          Now regarding limit 0 specifically. We knew it was a shortcut when we used estimated row count to avoid pushing limit twice. I like the idea that the estimated row count should never be less than one – it seems appropriate for the vast majority of the things people use estimated row count for.

          I think we should add a new kind of metadata, maximum row count. Use that, rather than estimated row count, to make sure that we don't push limit twice.

          Show
          julianhyde Julian Hyde added a comment - Now regarding limit 0 specifically. We knew it was a shortcut when we used estimated row count to avoid pushing limit twice. I like the idea that the estimated row count should never be less than one – it seems appropriate for the vast majority of the things people use estimated row count for. I think we should add a new kind of metadata, maximum row count. Use that, rather than estimated row count, to make sure that we don't push limit twice.
          Hide
          julianhyde Julian Hyde added a comment -

          Regarding metadata validation.

          Whether the validation is done by the metadata provider or consumer or by the framework is a matter of taste. But clearly someone has to do it!

          validateResult is the framework trying to apply a light sanity check. But consumers should also do some validation: I think it would be foolish for a consumer to divide by a value before they've checked that it is not null and greater than zero.

          It would seem to me that if a method has very few providers and a lot of consumers then it would be better to validate in the providers; and vice versa. I don't feel strongly about this. But if you decide to change the policy just make sure do it throughout calcite and change the documentation.

          Show
          julianhyde Julian Hyde added a comment - Regarding metadata validation. Whether the validation is done by the metadata provider or consumer or by the framework is a matter of taste. But clearly someone has to do it! validateResult is the framework trying to apply a light sanity check. But consumers should also do some validation: I think it would be foolish for a consumer to divide by a value before they've checked that it is not null and greater than zero. It would seem to me that if a method has very few providers and a lot of consumers then it would be better to validate in the providers; and vice versa. I don't feel strongly about this. But if you decide to change the policy just make sure do it throughout calcite and change the documentation.
          Hide
          pxiong Pengcheng Xiong added a comment -

          Jesus Camacho Rodriguez, thanks for your comments. i partially agree with your opinion. But please also note that we had a similar discussion before "whether to use 0 or 1 for explain when a hive operator outputs 0 rows". We followed what Postgres does, i.e., validate it to 1. So, if this "getRowCount" is used externally, e.g., in explain, i would prefer to validate it to 1. Else if it is used internally, e.g., in pushing limit through union/join, i would prefer not to validate it. Thus, i would like to have two functions, i.e., what i did in the patch. Let's see Julian Hyde's opinion on this.

          Show
          pxiong Pengcheng Xiong added a comment - Jesus Camacho Rodriguez , thanks for your comments. i partially agree with your opinion. But please also note that we had a similar discussion before "whether to use 0 or 1 for explain when a hive operator outputs 0 rows". We followed what Postgres does, i.e., validate it to 1. So, if this "getRowCount" is used externally, e.g., in explain, i would prefer to validate it to 1. Else if it is used internally, e.g., in pushing limit through union/join, i would prefer not to validate it. Thus, i would like to have two functions, i.e., what i did in the patch. Let's see Julian Hyde 's opinion on this.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          The reason to use validateResult method seems to be:

              // Never let the result go below 1, as it will result in incorrect
              // calculations if the row-count is used as the denominator in a
              // division expression.  Also, cap the value at the max double value
              // to avoid calculations using infinity.
          

          IMO, the metadata providers should not validate the result, as this is just metadata information. Instead, it should be the method that calls e.g. getDistinctRowCount, the one that should check that the value is not zero if it wants to use it as denominator in a division. Thus, I would actually change the original getDistinctRowCount method so it does not call validateResult.

          Julian Hyde, what's your take on this?

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - The reason to use validateResult method seems to be: // Never let the result go below 1, as it will result in incorrect // calculations if the row-count is used as the denominator in a // division expression. Also, cap the value at the max double value // to avoid calculations using infinity. IMO, the metadata providers should not validate the result, as this is just metadata information. Instead, it should be the method that calls e.g. getDistinctRowCount, the one that should check that the value is not zero if it wants to use it as denominator in a division. Thus, I would actually change the original getDistinctRowCount method so it does not call validateResult . Julian Hyde , what's your take on this?
          Hide
          pxiong Pengcheng Xiong added a comment -

          According to Laljo John Pullokkaran's suggestion, Jesus Camacho Rodriguez, could you please review it? Thanks.

          Show
          pxiong Pengcheng Xiong added a comment - According to Laljo John Pullokkaran 's suggestion, Jesus Camacho Rodriguez , could you please review it? Thanks.

            People

            • Assignee:
              pxiong Pengcheng Xiong
              Reporter:
              pxiong Pengcheng Xiong
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development