Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-6364

Lock contention in FileHandleCache results in >2x slowdown for remote HDFS reads

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • Impala 2.10.0, Impala 2.11.0
    • Impala 2.12.0
    • None
    • None
    • ghx-label-6

    Description

      IMPALA-4623 introduced a locking schema to the file handle cache which has 16 buckets, this results in lock contention between IO threads which limits system throughput.

      Most IO threads end-up in one of these stacks.

      #0  0x0000000002085d47 in base::internal::SpinLockDelay(int volatile*, int, int) ()
      #1  0x0000000002085c29 in base::SpinLock::SlowLock() ()
      #2  0x00000000010fa76d in impala::io::FileHandleCache<16ul>::GetFileHandle(hdfs_internal* const&, std::string*, long, bool, bool*) ()
      #3  0x00000000010f6e22 in impala::io::DiskIoMgr::GetCachedHdfsFileHandle(hdfs_internal* const&, std::string*, long, impala::io::RequestContext*, bool) ()
      #4  0x00000000010fd514 in impala::io::ScanRange::Open(bool) ()
      #5  0x00000000010f691f in impala::io::DiskIoMgr::ReadRange(impala::io::DiskIoMgr::DiskQueue*, impala::io::RequestContext*, impala::io::ScanRange*) ()
      #6  0x00000000010f6dc4 in impala::io::DiskIoMgr::WorkLoop(impala::io::DiskIoMgr::DiskQueue*) ()
      #7  0x0000000000d13333 in impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*) ()
      #8  0x0000000000d13a74 in boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise<long>*> > > >::run() ()
      #9  0x000000000128ea3a in thread_proxy ()
      #10 0x00007f49f2bbadc5 in start_thread () from /lib64/libpthread.so.0
      #11 0x00007f49f28e976d in clone () from /lib64/libc.so.6
      
      #0  0x0000000002085d47 in base::internal::SpinLockDelay(int volatile*, int, int) ()
      #1  0x0000000002085c29 in base::SpinLock::SlowLock() ()
      #2  0x00000000010f9929 in impala::io::FileHandleCache<16ul>::ReleaseFileHandle(std::string*, impala::io::HdfsFileHandle*, bool) ()
      #3  0x00000000010fe69e in impala::io::ScanRange::Close() ()
      #4  0x00000000010f6565 in impala::io::DiskIoMgr::HandleReadFinished(impala::io::DiskIoMgr::DiskQueue*, impala::io::RequestContext*, std::unique_ptr<impala::io::BufferDescriptor, std::default_delete<impala::io::BufferDescriptor> >) ()
      #5  0x00000000010f695b in impala::io::DiskIoMgr::ReadRange(impala::io::DiskIoMgr::DiskQueue*, impala::io::RequestContext*, impala::io::ScanRange*) ()
      #6  0x00000000010f6dc4 in impala::io::DiskIoMgr::WorkLoop(impala::io::DiskIoMgr::DiskQueue*) ()
      #7  0x0000000000d13333 in impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*) ()
      #8  0x0000000000d13a74 in boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise<long>*> > > >::run() ()
      #9  0x000000000128ea3a in thread_proxy ()
      #10 0x00007f49f2bbadc5 in start_thread () from /lib64/libpthread.so.0
      #11 0x00007f49f28e976d in clone () from /lib64/libc.so.6
      

      Increasing the number of partitions to 256 made the contention go away, a simple fix would be to make the number of partitions a startup flag and change it to 256.

      Attachments

        1. remote_hdfs_scan_pstack.txt
          2.57 MB
          Mostafa Mokhtar
        2. d2402_cdh5.13_profile.txt
          75 kB
          Mostafa Mokhtar
        3. d2402_cdh5.12_profile.txt
          73 kB
          Mostafa Mokhtar

        Activity

          twood@cloudera.com Tim Wood added a comment -

          Usually you want a prime number of buckets in a hash table, so e.g., 257 instead of 256.

          twood@cloudera.com Tim Wood added a comment - Usually you want a prime number of buckets in a hash table, so e.g., 257 instead of 256.
          joemcdonnell Joe McDonnell added a comment -

          Code change up for code review:
          https://gerrit.cloudera.org/#/c/8945/

          This changes the code to bypass the file handle cache when the file handles are going to be used exclusively by a single ScanRange. This is true for all remote files. It is also true when the file handle cache is disabled. This avoids getting the lock entirely.

          This also allows makes the number of partitions used by the file handle cache configurable.

          joemcdonnell Joe McDonnell added a comment - Code change up for code review: https://gerrit.cloudera.org/#/c/8945/ This changes the code to bypass the file handle cache when the file handles are going to be used exclusively by a single ScanRange. This is true for all remote files. It is also true when the file handle cache is disabled. This avoids getting the lock entirely. This also allows makes the number of partitions used by the file handle cache configurable.
          joemcdonnell Joe McDonnell added a comment -

          commit d1a0510bfe0a168256d37904aca3a30994306454
          Author: Joe McDonnell <joemcdonnell@cloudera.com>
          Date: Wed Jan 3 19:02:19 2018 -0800

          IMPALA-6364: Bypass file handle cache for ineligible files

          Currently, all HdfsFileHandles are owned and constructed
          by the file handle cache. When the file handle cache
          is disabled or the file handle is not eligible for
          caching, the HdfsFileHandle is stored exclusively in
          ScanRange::exclusive_hdfs_fh_, but the HdfsFileHandle still
          comes from the file handle cache. It is created via a call to
          DiskIoMgr::GetCachedHdfsFileHandle() with 'require_new_handle'
          set to true and destroyed via
          DiskIoMgr::ReleaseCachedHdfsFileHandle() with 'destroy_handle'
          set to true.

          Recent testing has revealed that the lock on the file handle
          cache is a bottleneck for workloads with many small remote
          files. There is no benefit to storing these exclusive file
          handles in the file handle cache, as they do not participate
          in the caching.

          This change introduces DiskIoMgr::GetExclusiveHdfsFileHandle()
          and DiskIoMgr::ReleaseExclusiveHdfsFileHandle(). These are
          equivalent to the Get/ReleaseCachedHdfsFileHandle() calls, except
          they bypass the file handle cache and create/destroy the
          file handle directly. ScanRange::Open()/Close(), which
          populates and frees ScanRange::exclusive_hdfs_fh_, now uses
          these new calls rather than accessing the file handle cache.
          This avoids the locking entirely, solving the bottleneck.

          To draw a distinction between the two codepaths, HdfsFileHandle
          is now an abstract class with two subclasses:

          • CachedHdfsFileHandles cover all handles that live in file handle
            cache. Get/ReleaseCachedHdfsFileHandle() use this subclass.
          • ExclusiveHdfsFileHandles cover all cases where a file handle
            does not come from the cache. The new
            Get/ReleaseExclusiveHdfsFileHandle() use this subclass.

          Separately, testing revealed that increasing the number of
          partitions for the file handle cache also fixes the contention
          problem. This changes the file handle cache to make the number
          of partitions configurable via startup parameter
          num_file_handle_cache_partitions. This allows mitigation of
          future bottlenecks without a patch.

          Change-Id: I4ab52b0884a909a4faeb6692f32d45878ea2838f
          Reviewed-on: http://gerrit.cloudera.org:8080/8945
          Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
          Tested-by: Impala Public Jenkins

          joemcdonnell Joe McDonnell added a comment - commit d1a0510bfe0a168256d37904aca3a30994306454 Author: Joe McDonnell <joemcdonnell@cloudera.com> Date: Wed Jan 3 19:02:19 2018 -0800 IMPALA-6364 : Bypass file handle cache for ineligible files Currently, all HdfsFileHandles are owned and constructed by the file handle cache. When the file handle cache is disabled or the file handle is not eligible for caching, the HdfsFileHandle is stored exclusively in ScanRange::exclusive_hdfs_fh_, but the HdfsFileHandle still comes from the file handle cache. It is created via a call to DiskIoMgr::GetCachedHdfsFileHandle() with 'require_new_handle' set to true and destroyed via DiskIoMgr::ReleaseCachedHdfsFileHandle() with 'destroy_handle' set to true. Recent testing has revealed that the lock on the file handle cache is a bottleneck for workloads with many small remote files. There is no benefit to storing these exclusive file handles in the file handle cache, as they do not participate in the caching. This change introduces DiskIoMgr::GetExclusiveHdfsFileHandle() and DiskIoMgr::ReleaseExclusiveHdfsFileHandle(). These are equivalent to the Get/ReleaseCachedHdfsFileHandle() calls, except they bypass the file handle cache and create/destroy the file handle directly. ScanRange::Open()/Close(), which populates and frees ScanRange::exclusive_hdfs_fh_, now uses these new calls rather than accessing the file handle cache. This avoids the locking entirely, solving the bottleneck. To draw a distinction between the two codepaths, HdfsFileHandle is now an abstract class with two subclasses: CachedHdfsFileHandles cover all handles that live in file handle cache. Get/ReleaseCachedHdfsFileHandle() use this subclass. ExclusiveHdfsFileHandles cover all cases where a file handle does not come from the cache. The new Get/ReleaseExclusiveHdfsFileHandle() use this subclass. Separately, testing revealed that increasing the number of partitions for the file handle cache also fixes the contention problem. This changes the file handle cache to make the number of partitions configurable via startup parameter num_file_handle_cache_partitions. This allows mitigation of future bottlenecks without a patch. Change-Id: I4ab52b0884a909a4faeb6692f32d45878ea2838f Reviewed-on: http://gerrit.cloudera.org:8080/8945 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins

          People

            joemcdonnell Joe McDonnell
            mmokhtar Mostafa Mokhtar
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: