Details
-
Improvement
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
None
-
None
-
ghx-label-1
Description
File names under a table usually share some substrings, e.g. query id, job id, task id, etc. We can compress them to save some memory space. Especially in the case of small files issue, the memory footprint of the metadata cache is occupied by encodedFileDescriptors.
An experiment shows that an HdfsTable with 67708 partitions and 3167561 files on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each encodedFileDescriptor is a byte array that takes 160B. Codes:
https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723
Files of that table are created by Spark jobs. Here are some file names inside the same partition:
part-00000-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000 part-00001-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000 part-00002-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000 part-00003-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000 part-00004-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000 part-00005-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000 part-00006-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000 part-00007-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000 part-00008-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000 part-00009-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000 part-00010-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000 part-00011-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000 part-00012-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000 part-00013-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000 part-00014-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000 part-00015-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
By compressing the encodedFileDescriptors inside the same partition, we should be able to save a significant memory space in this case. Compressing all of them inside the same table might be even better, but it impacts the performance when coordinator loading specific partitions from catalogd.
We can consider only do this for partitions whose number of files exceeds a threshold (e.g. 10).