Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-1366

Metadata provider should not pull predicates up through GROUP BY ()

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.9.0
    • Fix Version/s: 1.9.0
    • Component/s: core
    • Labels:
      None

      Description

      In org.apache.calcite.rel.metadata.RelMdPredicates.java:

      ...
          for (RexNode r : inputInfo.pulledUpPredicates) {
            ImmutableBitSet rCols = RelOptUtil.InputFinder.bits(r);
            if (groupKeys.contains(rCols)) {
              r = r.accept(new RexPermuteInputsShuttle(m, input));
              aggPullUpPredicates.add(r);
            }
          }
      ...
      

      The check does not take into account that rCols might be empty, and then r cannot be pulled up e.g. count(*).

        Issue Links

          Activity

          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Julian Hyde, this is a small fix for a corner case: https://github.com/apache/calcite/pull/276

          Could you take a look? Thanks

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Julian Hyde , this is a small fix for a corner case: https://github.com/apache/calcite/pull/276 Could you take a look? Thanks
          Hide
          julianhyde Julian Hyde added a comment - - edited

          Can you justify why zero columns should be a special case? You can think of COUNT(*) is just the 0-column variant of COUNT(x) (1 column), COUNT(x, y) (2 columns), etc.

          Maybe we shouldn't be pulling up aggregates? In which case there would also be a problem with SELECT x, SUM(x) ... GROUP BY x.

          Show
          julianhyde Julian Hyde added a comment - - edited Can you justify why zero columns should be a special case? You can think of COUNT(*) is just the 0-column variant of COUNT(x) (1 column), COUNT(x, y) (2 columns), etc. Maybe we shouldn't be pulling up aggregates? In which case there would also be a problem with SELECT x, SUM(x) ... GROUP BY x .
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - - edited

          I have been thinking further about this.

          The example that took my attention was:

          select count(*) from emp where false;
          

          We should not pull up the predicate in that case, but currently we do.

          However, if the query is:

          select a, count(a) from emp where false group by a;
          

          We can pull it up.

          My understanding is that indeed we can pull a predicate through Aggregate iff all its references are to grouping columns; thus, the code highlighted in the description of this issue is correct. The special case is not about the predicate then, but about Aggregate operator without grouping columns. If you agree, I could change the fix accordingly.

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - - edited I have been thinking further about this. The example that took my attention was: select count(*) from emp where false; We should not pull up the predicate in that case, but currently we do. However, if the query is: select a, count(a) from emp where false group by a; We can pull it up. My understanding is that indeed we can pull a predicate through Aggregate iff all its references are to grouping columns ; thus, the code highlighted in the description of this issue is correct. The special case is not about the predicate then, but about Aggregate operator without grouping columns. If you agree, I could change the fix accordingly.
          Hide
          julianhyde Julian Hyde added a comment - - edited

          I'm not sure what it means to pull up a predicate on an aggregate function. We know that count(*) or count(a) is positive; we know that any predicate on a also applies to min(a); but not much else. (These don't work for GROUP BY (), because it can yield an empty group when applied to an empty table.)

          My understanding is that indeed we can pull a predicate through Aggregate iff all its references are to grouping columns; thus, the code highlighted in the description of this issue is correct.

          I agree. When thinking about such queries, I add group by ()), thus select count(*) from emp where false becomes select count(*) from emp where false group by ().

          Show
          julianhyde Julian Hyde added a comment - - edited I'm not sure what it means to pull up a predicate on an aggregate function. We know that count(*) or count(a) is positive; we know that any predicate on a also applies to min(a) ; but not much else. (These don't work for GROUP BY () , because it can yield an empty group when applied to an empty table.) My understanding is that indeed we can pull a predicate through Aggregate iff all its references are to grouping columns; thus, the code highlighted in the description of this issue is correct. I agree. When thinking about such queries, I add group by ()) , thus select count(*) from emp where false becomes select count(*) from emp where false group by () .
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Thus Julian Hyde, I do not know if I understand your point of view correctly. Is it that we should not pull up any predicate in the presence of aggregate functions (we would need to demonstrate or find evidence that this is correct)? Or do you think that adding a special case for the empty grouping columns should be enough to guarantee correctness?

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Thus Julian Hyde , I do not know if I understand your point of view correctly. Is it that we should not pull up any predicate in the presence of aggregate functions (we would need to demonstrate or find evidence that this is correct)? Or do you think that adding a special case for the empty grouping columns should be enough to guarantee correctness?
          Hide
          julianhyde Julian Hyde added a comment -

          I wasn't clear, because I'm confused about what it even means to pull up a predicate on an aggregate function. Can you revisit the example and explain what predicates are pulled up?

          By the way, it's probably not helpful, but the pedantic logician in me has to point this out: if you have a relational expression that you're sure returns 0 rows, you can validly put any predicate you like on it. Consider select a, b from t where false. If you put a predicate a < 10, it would be valid - no rows will violate that predicate. False implies anything. So, for these purposes, examples with where false are probably not the best.

          Show
          julianhyde Julian Hyde added a comment - I wasn't clear, because I'm confused about what it even means to pull up a predicate on an aggregate function. Can you revisit the example and explain what predicates are pulled up? By the way, it's probably not helpful, but the pedantic logician in me has to point this out: if you have a relational expression that you're sure returns 0 rows, you can validly put any predicate you like on it. Consider select a, b from t where false . If you put a predicate a < 10 , it would be valid - no rows will violate that predicate. False implies anything . So, for these purposes, examples with where false are probably not the best.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          false for the example is a deliberate choice, a predicate that does not contain input references and filters everything. true would be the other choice, but it is less interesting as it filters nothing.

          In the predicates metadata provider, when we pull up predicates though any operator op, the statement that we make is : predicate x applies on the op input, thus we know it transitively applies on op output too.

          I understand that Aggregate operator with no grouping columns is a special case, as it produces a single row with the values output by its Aggregate functions. Thus, I do not see how any predicate on its input can be inferred to be valid on its output too.

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - false for the example is a deliberate choice, a predicate that does not contain input references and filters everything. true would be the other choice, but it is less interesting as it filters nothing. In the predicates metadata provider, when we pull up predicates though any operator op , the statement that we make is : predicate x applies on the op input, thus we know it transitively applies on op output too . I understand that Aggregate operator with no grouping columns is a special case, as it produces a single row with the values output by its Aggregate functions. Thus, I do not see how any predicate on its input can be inferred to be valid on its output too.
          Hide
          julianhyde Julian Hyde added a comment - - edited

          WHERE FALSE and WHERE TRUE (or something that constant-reduces to these, such as WHERE 1 < 2) seem to be the only cases that could be pulled through GROUP BY() and both of them are so degenerate as to be useless.

          So, if we have

          SELECT k0, k1, k2, agg0(), agg1(m1), agg2(m1, m2), agg3(k1)
          FROM t
          GROUP BY k0, k1, k2
          

          then we can pull up predicates only on columns k0, k1, k2. The aggregates are new expressions, so have no predicates.

          For particular aggregate functions we may be able to generate new predicates, e.g. count(*) and count(c) are always non-negative (and in fact positive if we know groups are non-empty), and if c has a lower bound then it is also a lower bound for min(c). If you agree, I'll log a new JIRA case to infer predicates for aggregate functions.

          Show
          julianhyde Julian Hyde added a comment - - edited WHERE FALSE and WHERE TRUE (or something that constant-reduces to these, such as WHERE 1 < 2 ) seem to be the only cases that could be pulled through GROUP BY() and both of them are so degenerate as to be useless. So, if we have SELECT k0, k1, k2, agg0(), agg1(m1), agg2(m1, m2), agg3(k1) FROM t GROUP BY k0, k1, k2 then we can pull up predicates only on columns k0, k1, k2. The aggregates are new expressions, so have no predicates. For particular aggregate functions we may be able to generate new predicates, e.g. count(*) and count(c) are always non-negative (and in fact positive if we know groups are non-empty), and if c has a lower bound then it is also a lower bound for min(c) . If you agree, I'll log a new JIRA case to infer predicates for aggregate functions.
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          I think the follow-up on inferring predicates from the aggregate functions is a cool idea and can help us to fold some expressions even further, +1 on creating a new issue to tackle it.

          In the meantime, I think it makes sense this fix goes in.

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - I think the follow-up on inferring predicates from the aggregate functions is a cool idea and can help us to fold some expressions even further, +1 on creating a new issue to tackle it. In the meantime, I think it makes sense this fix goes in.
          Hide
          julianhyde Julian Hyde added a comment -

          Yes, it can go in. I am testing and will commit shortly.

          I believe this bug only occurs for one, degenerate case: WHERE FALSE GROUP BY (). Since GROUP BY () doesn't project any columns there are only two predicates possible: TRUE and FALSE. TRUE is fine (albeit trivial). FALSE is valid on the input to the Aggregate (the input is empty) but GROUP BY () has the mysterious effect of converting an empty relation to a single-row relation. So the predicate FALSE no longer holds.

          Show
          julianhyde Julian Hyde added a comment - Yes, it can go in. I am testing and will commit shortly. I believe this bug only occurs for one, degenerate case: WHERE FALSE GROUP BY () . Since GROUP BY () doesn't project any columns there are only two predicates possible: TRUE and FALSE . TRUE is fine (albeit trivial). FALSE is valid on the input to the Aggregate (the input is empty) but GROUP BY () has the mysterious effect of converting an empty relation to a single-row relation. So the predicate FALSE no longer holds.
          Hide
          julianhyde Julian Hyde added a comment -

          I've logged https://issues.apache.org/jira/browse/CALCITE-1368 for predicates on aggregate functions.

          Show
          julianhyde Julian Hyde added a comment - I've logged https://issues.apache.org/jira/browse/CALCITE-1368 for predicates on aggregate functions.
          Show
          julianhyde Julian Hyde added a comment - Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/29daa451 .
          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Resolved in release 1.9.0 (2016-09-22)

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Resolved in release 1.9.0 (2016-09-22)

            People

            • Assignee:
              jcamachorodriguez Jesus Camacho Rodriguez
              Reporter:
              jcamachorodriguez Jesus Camacho Rodriguez
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development