Uploaded image for project: 'Apache HAWQ'
  1. Apache HAWQ
  2. HAWQ-923

More data skipping optimization for IO intensive performance enhancement

    XMLWordPrintableJSON

Details

    • Wish
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • backlog
    • Query Execution
    • None

    Description

      see email discussion here: http://mail-archives.apache.org/mod_mbox/hawq-dev/201607.mbox/%3CCA+F1uf=TjCiOezkvpHSPpAOG-jg0-0AzqTUsgr7RV+EsV44kFQ@mail.gmail.com%3E

      Data skipping technology can extremely avoiding unnecessary IO, so it can
      extremely enhance performance for IO intensive query. Including eliminating
      query on unnecessary table partition according to the partition key range ,
      I think more options are available now:

      (1) Parquet / ORC format introduce a lightweight meta data info like
      Min/Max/Bloom filter for each block, such meta data can be exploited when
      predicate/filter info can be fetched before executing scan.

      However now in HAWQ, all data in parquet need to be scanned into memory
      before processing predicate/filter. We don't generate the meta info when
      INSERT into parquet table, the scan executor doesn't utilize the meta info
      neither. Maybe some scan API need to be refactored so that we can get
      predicate/filter
      info before executing base relation scan.

      (2) Base on (1) technology, especially with Bloom filter, more optimizer
      technology can be explored furthur. E.g. Impala implemented Runtime
      filtering(*https://www.cloudera.com/documentation/enterprise/latest/topics/impala_runtime_filtering.html
      <https://www.cloudera.com/documentation/enterprise/latest/topics/impala_runtime_filtering.html>*
      ), which can be used at

      • dynamic partition pruning
      • converting join predicate to base relation predicate

      It tell the executor to wait for one moment(the interval time can be set in
      guc) before executing base relation scan, if the interested values(e.g. the
      column in join predicate only have very small set) arrived in time, it can
      use these value to filter this scan, if doesn't arrived in time, it scan
      without this filter, which doesn't impact result correctness.

      Unlike (1) technology, this technology cannot be used in any case, it only
      outperform in some cases. So it just add some more query plan
      choices/paths, and the optimizer need based on statistics info to calculate
      the cost, and apply it when cost down.

      All in one, maybe more similar technology can be adoptable for HAWQ now,
      let's start to think about performance related technology, moreover we need
      to instigate how these technology can be implemented in HAWQ.

      Any ideas or suggestions are welcomed? Thanks.

      Attachments

        Activity

          People

            lei_chang Lei Chang
            mli Ming Li
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: