Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3862

to_date() function used in analytic functions has serious performance impact

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: Impala 2.5.0
    • Fix Version/s: None
    • Component/s: Perf Investigation
    • Labels:
      None

      Description

      to_date() is usually used to make timestamp into a date that can be grouped by in analytic functions, but it has a serious impact on the performance (duration and memory) when used

      1. attempt around ~2sec, not using to_date()

      [morhidi-570-2.gce.cloudera.com:21000] > select dat, emp_no, salary, rank() OVER (PARTITION BY dat order by salary desc) from sample limit 10;
      Query: select dat, emp_no, salary, rank() OVER (PARTITION BY dat order by salary desc) from sample limit 10
      +---------------------+--------+--------+------------------+
      | dat                 | emp_no | salary | rank() OVER(...) |
      +---------------------+--------+--------+------------------+
      | 1985-01-14 00:00:00 | 110114 | 52070  | 1                |
      | 1985-02-01 00:00:00 | 87761  | 69338  | 1                |
      | 1985-02-01 00:00:00 | 297266 | 66536  | 2                |
      | 1985-02-01 00:00:00 | 273524 | 54860  | 3                |
      | 1985-02-01 00:00:00 | 278257 | 51142  | 4                |
      | 1985-02-01 00:00:00 | 229588 | 40463  | 5                |
      | 1985-02-01 00:00:00 | 235233 | 40000  | 6                |
      | 1985-02-01 00:00:00 | 200241 | 40000  | 6                |
      | 1985-02-01 00:00:00 | 234495 | 40000  | 6                |
      | 1985-02-03 00:00:00 | 453187 | 90217  | 1                |
      +---------------------+--------+--------+------------------+
      Fetched 10 row(s) in 2.07s
      [morhidi-570-2.gce.cloudera.com:21000] > summary;
      +--------------+--------+----------+----------+-------+------------+----------+---------------+------------------+
      | Operator     | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail           |
      +--------------+--------+----------+----------+-------+------------+----------+---------------+------------------+
      | 04:EXCHANGE  | 1      | 0ns      | 0ns      | 10    | 10         | 0 B      | -1 B          | UNPARTITIONED    |
      | 02:ANALYTIC  | 3      | 1.67ms   | 3.00ms   | 20    | -1         | 8.06 MB  | 0 B           |                  |
      | 01:SORT      | 3      | 861.03ms | 1.30s    | 4.10K | -1         | 64.06 MB | 0 B           |                  |
      | 03:EXCHANGE  | 3      | 23.00ms  | 35.00ms  | 1.91M | -1         | 0 B      | 0 B           | HASH(dat)        |
      | 00:SCAN HDFS | 3      | 240.01ms | 253.01ms | 2.84M | -1         | 33.86 MB | 64.00 MB      | employees.sample |
      +--------------+--------+----------+----------+-------+------------+----------+---------------+------------------+
      

      2. attempt around ~1min using to_date()

      [morhidi-570-2.gce.cloudera.com:21000] > select dat, emp_no, salary, rank() OVER (PARTITION BY to_date(dat) order by salary desc) from sample limit 10;
      Query: select dat, emp_no, salary, rank() OVER (PARTITION BY to_date(dat) order by salary desc) from sample limit 10
      +---------------------+--------+--------+------------------+
      | dat                 | emp_no | salary | rank() OVER(...) |
      +---------------------+--------+--------+------------------+
      | 1985-02-02 00:00:00 | 96308  | 94821  | 1                |
      | 1985-02-02 00:00:00 | 204237 | 82183  | 2                |
      | 1985-02-02 00:00:00 | 461001 | 80475  | 3                |
      | 1985-02-02 00:00:00 | 257813 | 77248  | 4                |
      | 1985-02-02 00:00:00 | 98351  | 75641  | 5                |
      | 1985-02-02 00:00:00 | 490500 | 69491  | 6                |
      | 1985-02-02 00:00:00 | 51773  | 69401  | 7                |
      | 1985-02-02 00:00:00 | 33092  | 65665  | 8                |
      | 1985-02-02 00:00:00 | 430238 | 64479  | 9                |
      | 1985-02-02 00:00:00 | 81360  | 64371  | 10               |
      +---------------------+--------+--------+------------------+
      Fetched 10 row(s) in 65.26s
      [morhidi-570-2.gce.cloudera.com:21000] > summary;
      +--------------+--------+----------+----------+-------+------------+----------+---------------+--------------------+
      | Operator     | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail             |
      +--------------+--------+----------+----------+-------+------------+----------+---------------+--------------------+
      | 04:EXCHANGE  | 1      | 0ns      | 0ns      | 10    | 10         | 0 B      | -1 B          | UNPARTITIONED      |
      | 02:ANALYTIC  | 3      | 8.67ms   | 22.00ms  | 10    | -1         | 8.18 MB  | 0 B           |                    |
      | 01:SORT      | 3      | 21.06s   | 63.17s   | 2.05K | -1         | 1.16 GB  | 0 B           |                    |
      | 03:EXCHANGE  | 3      | 15.33ms  | 29.00ms  | 2.84M | -1         | 0 B      | 0 B           | HASH(to_date(dat)) |
      | 00:SCAN HDFS | 3      | 177.34ms | 192.01ms | 2.84M | -1         | 33.82 MB | 64.00 MB      | employees.sample   |
      +--------------+--------+----------+----------+-------+------------+----------+---------------+--------------------+
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                morhidi Matyas Orhidi
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: