Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-231

Impala HBase scan is very slow

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • Impala 0.6
    • Impala 1.0
    • None
    • None

    Description

      Impala's HBase scan is very slow. For scanning 40000 rows from an colocated HBase table, it took ~5.7sec.

      The query is: select count from (select * from hbasetbl limit 40000);

      Majority of the time is spent inside HBaseTableScanner::Next. I've done some experiment to figure out where we spent the time by adding a few more timers. Here's the runtime profile:

      HBASE_SCAN_NODE (id=0):(5s714ms 99.98%)

      • BytesRead: 4.26 MB
      • HBaseTableScanner.ScanSetup: 483.594us
      • HBaseTableScanner::ResultScanner_next: 5s710ms
      • MemoryUsed: 0.00
      • MyOwnTimer1: 1s387ms <-- We should trim this time.
      • MyOwnTimer2: 2s798ms <--
      • MyOwnTimer3: 688.179ms
      • MyOwnTimer4: 0ns
      • MyOwnTimer5: 0ns
      • NumDisksAccessed: 0
      • PerReadThreadRawHdfsThroughput: 5.14 MB/sec
      • RowsReturned: 40.00K (40000)
      • RowsReturnedRate: 7.00 K/sec
      • ScanRangesComplete: 0
      • ScannerThreadsInvoluntaryContextSwitches: 0
      • ScannerThreadsTotalWallClockTime: 0ns
      • MaterializeTupleTime: 0ns
      • ScannerThreadsSysTime: 0ns
      • ScannerThreadsUserTime: 0ns
      • ScannerThreadsVoluntaryContextSwitches: 0
      • TotalRawHdfsReadTime: 827.604ms <-- we spent only 800ms on fetching from HBase
      • TotalReadThroughput: 775.05 KB/sec

      I've attached the code with more timers in the attached file HBaseTAbleScanner.Next

      When I increase the limit to 80000, the perf is much worse. Here's the profile:
      HBASE_SCAN_NODE (id=0):(23s027ms 99.99%)

      • BytesRead: 8.51 MB
      • HBaseTableScanner.ScanSetup: 249.40us
      • HBaseTableScanner::ResultScanner_next: 23s018ms
      • MemoryUsed: 0.00
      • MyOwnTimer1: 5s680ms
      • MyOwnTimer2: 11s401ms
      • MyOwnTimer3: 2s829ms
      • MyOwnTimer4: 0ns
      • MyOwnTimer5: 0ns
      • NumDisksAccessed: 0
      • PerReadThreadRawHdfsThroughput: 2.76 MB/sec
      • RowsReturned: 80.00K (80000)
      • RowsReturnedRate: 3.47 K/sec
      • ScanRangesComplete: 0
      • ScannerThreadsInvoluntaryContextSwitches: 0
      • ScannerThreadsTotalWallClockTime: 0ns
      • MaterializeTupleTime: 0ns
      • ScannerThreadsSysTime: 0ns
      • ScannerThreadsUserTime: 0ns
      • ScannerThreadsVoluntaryContextSwitches: 0
      • TotalRawHdfsReadTime: 3s085ms
      • TotalReadThroughput: 376.52 KB/sec

      I didn't see any disk activity. So, I dont' think HBase is going to disk. Also, the read time is only 3sec. The big chuck of time is spent within the block of MyOwnTimer1 and MyOwnTimer2. We spent 5x more time there even though we only scan 2x more rows.

      Attachments

        1. HBaseTableScanner.Next
          3 kB
          Alan Choi

        Activity

          People

            alan@cloudera.com Alan Choi
            alan@cloudera.com Alan Choi
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: