Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-9695

Support incomplete partition spec in REFRESH statement

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Catalog
    • Labels:
      None
    • Epic Color:
      ghx-label-9

      Description

      We support explicitly specify a partition in the REFRESH statement. When users have several partitions to refresh, they have to trigger several REFRESH statements. Each REFRESH statement requires the table lock so they'll be executed in the catalogd one by one. What's worse, the table is updated (catalog version bumped) several times, which may cause catalogd propagates it several times to the coordinators. It's bad for huge tables that contain a large number of partitions. Their catalog objects have huge size since catalogd can't send incremental updates for only changed partitions.

      A possible scenario is hourly partitioned tables that have more than one level partition keys:

      create table hourly_part_tbl (id int, msg string)
      partitioned by (hour_id bigint, event_type bigint)
      

      Let's say there are 20 event_types. Every hour there will be 10 partitions generated with a new hour_id. If the retention time for this table is 2 years, the total number of partitions will be 2 * 365 * 24 * 20 = 175,200. The catalog object size for this table wil be huge, especially there will be many columns and hence incrementa stats in practise.

      Every hour, users have to run 20 REFRESH statements one by one on this table. The catalog server will send 20 updates to coordinators for this table. It's possible that catalogd is always busy in loading metadata for this table in a busy cluster (with many other tables).

      One possible solution is using REFRESH without the partition spec. Unfortunately, we still load FileStatus for all loaded partitions. It's possible that this single statement can't finish in an hour.

      Another solution is support REFRESH statement with incomplete partition spec. So users can use one statement:

      REFRESH hourly_part_tbl PARTITION(hour_id=xxx);
      

      Then catalogd only needs to acquire the table lock once and send its catalog update once.

      It'd also be usefull if we support non-equality predicates in the partition spec:

      REFRESH hourly_part_tbl PARTITION(hour_id >= xxx);
      

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              stigahuang Quanlong Huang
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: