Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-10347

Explore approaches to optimizing queries that will likely be short-circuited by limits

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Distributed Exec

    Description

      Based on discussion with amansinha, there are opportunities beyond IMPALA-10314 to optimize queries where there is a limit and the query is unlikely to scan many files.

      The problem is that we do all the work to generate scan ranges and schedule them upfront, which adds a lot of overhead if only a small number of files actually need to be processed.

      A couple of ideas we had:

      • Parallelize and/or otherwise optimize the scan range generation
      • Speculatively execute the query on a subset of files and then cancel and retry if we hit the limit
      • Incrementally generate scan ranges and assign them to executors so that scan range generation and execution can be overlapped. This is the most general solution but also has a lot of knock-on implications for other subsystems, like cardinality/memory estimation, scheduling, query execution, query coordination, etc.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tarmstrong Tim Armstrong
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: