Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-7827

Investigate increasing disk utilization by overlapping file open with reads

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • Impala 3.2.0
    • None
    • Backend
    • None
    • ghx-label-5

    Description

      Disk IO threads are responsible for doing both the HDFS file open and the reads for ScanRanges. Most HDFS file opens are served from the file handle cache. However, in case of a cache miss, the Disk IO thread is tied up waiting on a roundtrip to the NameNode. Depending on the number of Disk IO threads and the speed of the NameNode, all of the Disk IO threads could be blocked waiting on HDFS file open calls, even if there are ScanRanges that have file handles available in the cache. In particular, for spinning disks, there is a single Disk IO thread per disk. If this thread gets tied up in an open call, the disk will go idle.

      It might make sense for the open call to be serviced by a separate thread pool. The ScanRange would go through a separate state transition that opens the file handle. The Disk IO thread can process ScanRanges that already have an open file handle (cached or otherwise) while the open call is in progress.

      This is complicated by the fact that file handles can't be simultaneously used by multiple threads. In order to do the state transition properly, it needs to be clear whether a new file handle is necessary. Keeping a file handle cache at the RequestContext level and using preads (See IMPALA-6403) might make this clear.

      Attachments

        Activity

          People

            Unassigned Unassigned
            joemcdonnell Joe McDonnell
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: