Details
-
New Feature
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
ghx-label-8
Description
In HDFS, scheduling PlanFragments according to block locations can improve the locality of queries. However, every coin has two sides. There’re some scenarios that loading & keeping the block locations brings no benefits, sometimes even becomes a burden.
In a Hadoop cluster with ~1000 nodes, Impala cluster is only deployed on tens of computation nodes (i.e. with small disks but larger memory and powerful CPUs). Data locality is poor since most of the blocks have no replicas in the Impala nodes. Network bandwidth is 1Gbit/s so it’s ok for remote read. Queries are only required to finish within 5 mins.
Block location info is useless since the scheduler always comes up with the same plan.
load_catalog_in_background is set to false since there’re several PB of data in hive warehouse. If it’s set to true, the Impala cluster won’t be able to start up (will waiting for loading block locations and finally full fill the memory of catalogd and crash it).
Accessing a hive table containing >10,000 partitions at the first time will be stuck for a long time. Sometimes it can’t even finish for some large tables. Users are annoyed when they only want to describe the table or select a few partitions on this table.
Block location info is a burden here since its loading dominates the query time. Finally, only a little portion of the block location info can be used.
There’re many ETL pipelines ingesting data into Hive warehouse. Some tables are updated by replacing the whole data set. Some partitioned tables are updated by inserting new partitions.
Ad hoc queries are used to be served by Presto. When trying to introduce Impala to replace Presto, we should add a REFRESH table step at the end of each pipeline, which takes great efforts (many code changes on the existing warehouse).
IMPALA-4272 can solve this but has no progress. If file and block location metadata cache can be disabled, things will be simple.
IMPALA-3127 is relative. But we hope it's possible to not keep the block locations.
Attachments
Attachments
Issue Links
- is related to
-
IMPALA-7077 Add a configuration for the maximum number of partitions to load
- Open
- relates to
-
IMPALA-5931 Don't synthesize block metadata in the catalog for S3/ADLS
- Closed