Details
-
Improvement
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Impala 3.2.0
-
None
-
ghx-label-4
Description
When running in public cloud (e.g. AWS with S3) or in certain private cloud settings (e.g. data stored in object store), the computation and storage are no longer co-located. This breaks the typical pattern in which Impala query fragment instances are scheduled at where the data is located. In this setting, the network bandwidth requirement of both the nics and the top of rack switches will go up quite a lot as the network traffic includes the data fetch in addition to the shuffling exchange traffic of intermediate results.
To mitigate the pressure on the network, one can build a storage backed cache at the compute nodes to cache the working set. With deterministic scan range scheduling, each compute node should hold non-overlapping partitions of the data set.
An initial prototype of the cache was posted here: https://gerrit.cloudera.org/#/c/12683/ but it probably can benefit from a better eviction algorithm (e.g. LRU instead of FIFO) and better locking (e.g. not holding the lock while doing IO).
Attachments
Issue Links
- breaks
-
IMPALA-8562 Data cache should skip scan range with mtime == -1
- Resolved
-
IMPALA-8496 test_data_cache.py is flaky
- Resolved
-
IMPALA-8512 Data cache tests failing on older CentOS 6 versions
- Resolved
- is related to
-
IMPALA-4568 Cache Parquet footer cache to speedup scans & predicate evaluation against Min/Max indexes
- Open
-
IMPALA-8691 Query hint for disabling data caching
- Open
-
IMPALA-9819 Separate data cache and HDFS scan node runtime profile metrics
- Open
-
IMPALA-8542 Access trace collection for data cache
- Resolved
-
IMPALA-8690 Better eviction algorithm for data cache
- Resolved
- relates to
-
IMPALA-6057 Cache Remote Reads
- Resolved
-
IMPALA-8495 Impala Doc: Document Data Read Cache
- Closed