Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-1964

Improve Write Performance

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: Impala 2.3.0
    • Fix Version/s: None
    • Component/s: Perf Investigation
    • Labels:

      Description

      Impala write performance is bottlenecked by the single threaded HDFSTableSink. Consider the following usecase –

      — For a baremetal cluster with 3 managers & 7 datanodes; 11 drives/datanode 

      • Each drive gave me respectable ~100MB/s read and write performance

      ```
      [root@c3kuhdpnode1 ~]# hdparm -t -T /dev/sdk1
      /dev/sdk1:
      Timing cached reads: 18650 MB in 1.99 seconds = 9349.32 MB/sec
      Timing buffered disk reads: 348 MB in 3.00 seconds = 115.88 MB/sec

      [root@c3kuhdpnode1 ~]# time dd if=/dev/sdk1 of=/$drive/sdk1.zero bs=1024 count=10000000
      10000000+0 records in
      10000000+0 records out
      10240000000 bytes (10 GB) copied, 90.6001 s, 113 MB/s

      real
      1m30.602s
      user
      0m0.929s
      sys
      0m35.295s
      ```

      — For a single CTAS Query
      — The query running by itself on the cluster took 37 min. See Profile_Before_Hashing.txt for the explain plan
      — 10, modified versions of the query, each performing a 1/10 of the writes, took ~22m running concurrently. See Profile_After_Hashing.txt for one of the 10 explain plans.

      In both cases, I found majority of the time is spent in HDFSTableSink. I'd expect that portion of the query to be able to write nearly as fast as a Teragen on the cluster could, which is not what we are observing.

      This is pretty terrible for a few different reasons, primarily that, we can't use Hive to generate parquet because it might generate multiple hdfs blocks.

        Attachments

        1. Profile_Before_Hashing.txt
          136 kB
          Prateek Rungta
        2. Profile_After_Hashing.txt
          131 kB
          Prateek Rungta

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              prungta_impala_1124 Prateek Rungta
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: