Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5671

Union node may evaluate all children even if limit is reached

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: Impala 2.9.0
    • Fix Version/s: None
    • Component/s: Backend
    • Labels:
      None
    • Epic Color:
      ghx-label-2

      Description

      The loop inside UnionNode::GetNextMaterialized() does not break if the limit has been reached. See here. The only way the loop can be broken is if either the children are exhausted, or the current row batch becomes full.

      If you have a union node with a limit of 1, and two children - the first of which is very cheap to evaluate and returns one row, but the second is very expensive - the union node will try to fill an entire row batch with rows, and end up waiting on the second child, even though the node could be finished after reading one row from the first child.

      The result is a query that takes much longer to complete than it should. Here's an example:

      with l as (select 1 from functional.alltypes group by month), r as
                (select count(*) from lineitem a CROSS JOIN lineitem b)
        SELECT * from l UNION ALL (select * from r) LIMIT 2
      

        Activity

        Hide
        henryr Henry Robinson added a comment -
        Show
        henryr Henry Robinson added a comment - FYI Taras Bobrovytsky .
        Hide
        tarasbob Taras Bobrovytsky added a comment -

        We thought that it would better if the code was simpler and more straightforward if we don't try to return partially filled row batches due to limit. Breaking out of the loop due to limit would improve performance rarely enough that it's not worth complicating the code. Maybe we can revisit this decision.

        Show
        tarasbob Taras Bobrovytsky added a comment - We thought that it would better if the code was simpler and more straightforward if we don't try to return partially filled row batches due to limit. Breaking out of the loop due to limit would improve performance rarely enough that it's not worth complicating the code. Maybe we can revisit this decision.
        Hide
        henryr Henry Robinson added a comment -

        I think it would be worth revisiting if the code doesn't get too much more complex - this issue can lead to arbitrary differences between the time that the query is ready to return rows, and the time the user actually sees those rows.

        What would become more complex if GetNextMaterialized() could return a half-full batch due to the limit? (I would guess it can already do that anyhow if the last child only provides a few rows).

        Show
        henryr Henry Robinson added a comment - I think it would be worth revisiting if the code doesn't get too much more complex - this issue can lead to arbitrary differences between the time that the query is ready to return rows, and the time the user actually sees those rows. What would become more complex if GetNextMaterialized() could return a half-full batch due to the limit? (I would guess it can already do that anyhow if the last child only provides a few rows).
        Hide
        tarmstrong Tim Armstrong added a comment -

        We don't provide any guarantees in general about the difference in time between when the row is produced internally and when it is returned to the user. I think there are lots of cases where producing the next row can be arbitrarily expensive (e.g. scans with selective predicates, moving to the next partition in a spilled join, etc).

        I can see why it would be nice to return sooner but it seems like something we'd need to have end-to-end guarantees in the whole query pipeline to do in general. E.g. I think a local change in the UnionNode would be ineffective if the limit was actually on a node higher in the pipeline. E.g. if the union is on the left side of a join node with a limit.

        Show
        tarmstrong Tim Armstrong added a comment - We don't provide any guarantees in general about the difference in time between when the row is produced internally and when it is returned to the user. I think there are lots of cases where producing the next row can be arbitrarily expensive (e.g. scans with selective predicates, moving to the next partition in a spilled join, etc). I can see why it would be nice to return sooner but it seems like something we'd need to have end-to-end guarantees in the whole query pipeline to do in general. E.g. I think a local change in the UnionNode would be ineffective if the limit was actually on a node higher in the pipeline. E.g. if the union is on the left side of a join node with a limit.
        Hide
        henryr Henry Robinson added a comment -

        I agree, we shouldn't make guarantees, but this in particular seems like a case that might confuse users who can look at the plan, see that the query produced enough rows quite quickly, and wonder what Impala is doing for the next however-long. There's also an efficiency argument: burning CPU and IO time on a plan sub-tree whose results are guaranteed to be discarded seems worth addressing. A smaller point is that the responsiveness of the query is highly dependent on the limit size - if it's a multiple of the row batch size (and the limit is satisfied by the first child) the query will return rows very quickly.

        I agree if the limit can't be pushed down (like in the join node case you mention) there's no easy optimization to be applied, but would the fix here be too complex to justify doing for this case?

        Show
        henryr Henry Robinson added a comment - I agree, we shouldn't make guarantees, but this in particular seems like a case that might confuse users who can look at the plan, see that the query produced enough rows quite quickly, and wonder what Impala is doing for the next however-long. There's also an efficiency argument: burning CPU and IO time on a plan sub-tree whose results are guaranteed to be discarded seems worth addressing. A smaller point is that the responsiveness of the query is highly dependent on the limit size - if it's a multiple of the row batch size (and the limit is satisfied by the first child) the query will return rows very quickly. I agree if the limit can't be pushed down (like in the join node case you mention) there's no easy optimization to be applied, but would the fix here be too complex to justify doing for this case?

          People

          • Assignee:
            Unassigned
            Reporter:
            henryr Henry Robinson
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development