Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-13177

Compress encodedFileDescriptors inside the same partition

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • None
    • None
    • Catalog
    • ghx-label-1

    Description

      File names under a table usually share some substrings, e.g. query id, job id, task id, etc. We can compress them to save some memory space. Especially in the case of small files issue, the memory footprint of the metadata cache is occupied by encodedFileDescriptors.

      An experiment shows that an HdfsTable with 67708 partitions and 3167561 files on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each encodedFileDescriptor is a byte array that takes 160B. Codes:
      https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723

      Files of that table are created by Spark jobs. Here are some file names inside the same partition:

      part-00000-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
      part-00001-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
      part-00002-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
      part-00003-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
      part-00004-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
      part-00005-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
      part-00006-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
      part-00007-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
      part-00008-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
      part-00009-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
      part-00010-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
      part-00011-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
      part-00012-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
      part-00013-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
      part-00014-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
      part-00015-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000 

      By compressing the encodedFileDescriptors inside the same partition, we should be able to save a significant memory space in this case. Compressing all of them inside the same table might be even better, but it impacts the performance when coordinator loading specific partitions from catalogd.

      We can consider only do this for partitions whose number of files exceeds a threshold (e.g. 10).

      Attachments

        1. Selection_124.png
          57 kB
          Quanlong Huang

        Activity

          People

            stigahuang Quanlong Huang
            stigahuang Quanlong Huang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: