[IMPALA-6364] Lock contention in FileHandleCache results in >2x slowdown for remote HDFS reads - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: Impala 2.10.0, Impala 2.11.0
Fix Version/s: Impala 2.12.0
Component/s: None
Labels:
None

Epic Color:
ghx-label-6

Description

~~IMPALA-4623~~ introduced a locking schema to the file handle cache which has 16 buckets, this results in lock contention between IO threads which limits system throughput.

Most IO threads end-up in one of these stacks.

#0  0x0000000002085d47 in base::internal::SpinLockDelay(int volatile*, int, int) ()
#1  0x0000000002085c29 in base::SpinLock::SlowLock() ()
#2  0x00000000010fa76d in impala::io::FileHandleCache<16ul>::GetFileHandle(hdfs_internal* const&, std::string*, long, bool, bool*) ()
#3  0x00000000010f6e22 in impala::io::DiskIoMgr::GetCachedHdfsFileHandle(hdfs_internal* const&, std::string*, long, impala::io::RequestContext*, bool) ()
#4  0x00000000010fd514 in impala::io::ScanRange::Open(bool) ()
#5  0x00000000010f691f in impala::io::DiskIoMgr::ReadRange(impala::io::DiskIoMgr::DiskQueue*, impala::io::RequestContext*, impala::io::ScanRange*) ()
#6  0x00000000010f6dc4 in impala::io::DiskIoMgr::WorkLoop(impala::io::DiskIoMgr::DiskQueue*) ()
#7  0x0000000000d13333 in impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*) ()
#8  0x0000000000d13a74 in boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise<long>*> > > >::run() ()
#9  0x000000000128ea3a in thread_proxy ()
#10 0x00007f49f2bbadc5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f49f28e976d in clone () from /lib64/libc.so.6

#0  0x0000000002085d47 in base::internal::SpinLockDelay(int volatile*, int, int) ()
#1  0x0000000002085c29 in base::SpinLock::SlowLock() ()
#2  0x00000000010f9929 in impala::io::FileHandleCache<16ul>::ReleaseFileHandle(std::string*, impala::io::HdfsFileHandle*, bool) ()
#3  0x00000000010fe69e in impala::io::ScanRange::Close() ()
#4  0x00000000010f6565 in impala::io::DiskIoMgr::HandleReadFinished(impala::io::DiskIoMgr::DiskQueue*, impala::io::RequestContext*, std::unique_ptr<impala::io::BufferDescriptor, std::default_delete<impala::io::BufferDescriptor> >) ()
#5  0x00000000010f695b in impala::io::DiskIoMgr::ReadRange(impala::io::DiskIoMgr::DiskQueue*, impala::io::RequestContext*, impala::io::ScanRange*) ()
#6  0x00000000010f6dc4 in impala::io::DiskIoMgr::WorkLoop(impala::io::DiskIoMgr::DiskQueue*) ()
#7  0x0000000000d13333 in impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*) ()
#8  0x0000000000d13a74 in boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise<long>*> > > >::run() ()
#9  0x000000000128ea3a in thread_proxy ()
#10 0x00007f49f2bbadc5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f49f28e976d in clone () from /lib64/libc.so.6

Increasing the number of partitions to 256 made the contention go away, a simple fix would be to make the number of partitions a startup flag and change it to 256.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

remote_hdfs_scan_pstack.txt
03/Jan/18 23:14
2.57 MB
Mostafa Mokhtar
d2402_cdh5.13_profile.txt
03/Jan/18 23:18
75 kB
Mostafa Mokhtar
d2402_cdh5.12_profile.txt
03/Jan/18 23:09
73 kB
Mostafa Mokhtar

Activity

Ascending order - Click to sort in descending order

Tim Wood added a comment - 04/Jan/18 02:47

Usually you want a prime number of buckets in a hash table, so e.g., 257 instead of 256.

Tim Wood added a comment - 04/Jan/18 02:47 Usually you want a prime number of buckets in a hash table, so e.g., 257 instead of 256.

Joe McDonnell added a comment - 04/Jan/18 23:56

Code change up for code review:
https://gerrit.cloudera.org/#/c/8945/

This changes the code to bypass the file handle cache when the file handles are going to be used exclusively by a single ScanRange. This is true for all remote files. It is also true when the file handle cache is disabled. This avoids getting the lock entirely.

This also allows makes the number of partitions used by the file handle cache configurable.

Joe McDonnell added a comment - 04/Jan/18 23:56 Code change up for code review: https://gerrit.cloudera.org/#/c/8945/ This changes the code to bypass the file handle cache when the file handles are going to be used exclusively by a single ScanRange. This is true for all remote files. It is also true when the file handle cache is disabled. This avoids getting the lock entirely. This also allows makes the number of partitions used by the file handle cache configurable.

Joe McDonnell added a comment - 05/Jan/18 22:59

commit d1a0510bfe0a168256d37904aca3a30994306454
Author: Joe McDonnell <joemcdonnell@cloudera.com>
Date: Wed Jan 3 19:02:19 2018 -0800

~~IMPALA-6364~~: Bypass file handle cache for ineligible files

Currently, all HdfsFileHandles are owned and constructed
by the file handle cache. When the file handle cache
is disabled or the file handle is not eligible for
caching, the HdfsFileHandle is stored exclusively in
ScanRange::exclusive_hdfs_fh_, but the HdfsFileHandle still
comes from the file handle cache. It is created via a call to
DiskIoMgr::GetCachedHdfsFileHandle() with 'require_new_handle'
set to true and destroyed via
DiskIoMgr::ReleaseCachedHdfsFileHandle() with 'destroy_handle'
set to true.

Recent testing has revealed that the lock on the file handle
cache is a bottleneck for workloads with many small remote
files. There is no benefit to storing these exclusive file
handles in the file handle cache, as they do not participate
in the caching.

This change introduces DiskIoMgr::GetExclusiveHdfsFileHandle()
and DiskIoMgr::ReleaseExclusiveHdfsFileHandle(). These are
equivalent to the Get/ReleaseCachedHdfsFileHandle() calls, except
they bypass the file handle cache and create/destroy the
file handle directly. ScanRange::Open()/Close(), which
populates and frees ScanRange::exclusive_hdfs_fh_, now uses
these new calls rather than accessing the file handle cache.
This avoids the locking entirely, solving the bottleneck.

To draw a distinction between the two codepaths, HdfsFileHandle
is now an abstract class with two subclasses:

CachedHdfsFileHandles cover all handles that live in file handle
cache. Get/ReleaseCachedHdfsFileHandle() use this subclass.
ExclusiveHdfsFileHandles cover all cases where a file handle
does not come from the cache. The new
Get/ReleaseExclusiveHdfsFileHandle() use this subclass.

Separately, testing revealed that increasing the number of
partitions for the file handle cache also fixes the contention
problem. This changes the file handle cache to make the number
of partitions configurable via startup parameter
num_file_handle_cache_partitions. This allows mitigation of
future bottlenecks without a patch.

Change-Id: I4ab52b0884a909a4faeb6692f32d45878ea2838f
Reviewed-on: http://gerrit.cloudera.org:8080/8945
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins

Joe McDonnell added a comment - 05/Jan/18 22:59 commit d1a0510bfe0a168256d37904aca3a30994306454 Author: Joe McDonnell <joemcdonnell@cloudera.com> Date: Wed Jan 3 19:02:19 2018 -0800 IMPALA-6364 : Bypass file handle cache for ineligible files Currently, all HdfsFileHandles are owned and constructed by the file handle cache. When the file handle cache is disabled or the file handle is not eligible for caching, the HdfsFileHandle is stored exclusively in ScanRange::exclusive_hdfs_fh_, but the HdfsFileHandle still comes from the file handle cache. It is created via a call to DiskIoMgr::GetCachedHdfsFileHandle() with 'require_new_handle' set to true and destroyed via DiskIoMgr::ReleaseCachedHdfsFileHandle() with 'destroy_handle' set to true. Recent testing has revealed that the lock on the file handle cache is a bottleneck for workloads with many small remote files. There is no benefit to storing these exclusive file handles in the file handle cache, as they do not participate in the caching. This change introduces DiskIoMgr::GetExclusiveHdfsFileHandle() and DiskIoMgr::ReleaseExclusiveHdfsFileHandle(). These are equivalent to the Get/ReleaseCachedHdfsFileHandle() calls, except they bypass the file handle cache and create/destroy the file handle directly. ScanRange::Open()/Close(), which populates and frees ScanRange::exclusive_hdfs_fh_, now uses these new calls rather than accessing the file handle cache. This avoids the locking entirely, solving the bottleneck. To draw a distinction between the two codepaths, HdfsFileHandle is now an abstract class with two subclasses: CachedHdfsFileHandles cover all handles that live in file handle cache. Get/ReleaseCachedHdfsFileHandle() use this subclass. ExclusiveHdfsFileHandles cover all cases where a file handle does not come from the cache. The new Get/ReleaseExclusiveHdfsFileHandle() use this subclass. Separately, testing revealed that increasing the number of partitions for the file handle cache also fixes the contention problem. This changes the file handle cache to make the number of partitions configurable via startup parameter num_file_handle_cache_partitions. This allows mitigation of future bottlenecks without a patch. Change-Id: I4ab52b0884a909a4faeb6692f32d45878ea2838f Reviewed-on: http://gerrit.cloudera.org:8080/8945 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins

IMPALA

Lock contention in FileHandleCache results in >2x slowdown for remote HDFS reads

Details

Description

Attachments

Attachments

Activity

People

Dates