Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-10314

Planning time for simple SELECT with LIMIT could be improved

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Impala 3.4.0
    • Impala 4.0.0
    • Frontend
    • None
    • ghx-label-4

    Description

      Consider a table t1 with following characteristics:

      HDFS, Parquet format, external table
      number of partitions in t1 : 39000 (2 level partitioning)
      number of column : 72
      number of files : 350000
      

      The planning time for the following query with LIMIT without order-by is fairly long:

      select * from t1 limit 10;
      
      Query Compilation: 4s411ms
         - Single node plan created: 3s812ms (3s259ms)
      

      The bulk of the time is spent in HdfsScanNode.computeScanRangeLocations() which iterates over all the partitions and file descriptors within the partitions to assign scan ranges based on data affinity. For trivial LIMIT queries especially with small LIMIT values, we should look at ways to improve the planning time.

      
      

      Attachments

        Issue Links

          Activity

            People

              amansinha Aman Sinha
              amansinha Aman Sinha
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: