Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-7961

Concurrent catalog heavy workloads can cause queries with SYNC_DDL to fail fast

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • Impala 2.12.0, Impala 3.1.0
    • Impala 3.2.0
    • Catalog
    • None
    • ghx-label-1

    Description

      When catalog server is under heavy load with concurrent updates to objects, queries with SYNC_DDL can fail with the following message.

      User facing error message:

      ERROR: CatalogException: Couldn't retrieve the catalog topic version for the SYNC_DDL operation after 3 attempts.The operation has been successfully executed but its effects may have not been broadcast to all the coordinators.
      

      Exception from the catalog server log:

      I1031 00:00:49.168761 1127039 CatalogServiceCatalog.java:1903] Operation using SYNC_DDL is waiting for catalog topic version: 236535. Time to identify topic version (msec): 1088
      I1031 00:00:49.168824 1125528 CatalogServiceCatalog.java:1903] Operation using SYNC_DDL is waiting for catalog topic version: 236535. Time to identify topic version (msec): 12625
      I1031 00:00:49.168851 1131986 jni-util.cc:230] org.apache.impala.catalog.CatalogException: Couldn't retrieve the catalog topic version for the SYNC_DDL operation after 3 attempts.The operation has been successfully executed but its effects may have not been broadcast to all the coordinators.
              at org.apache.impala.catalog.CatalogServiceCatalog.waitForSyncDdlVersion(CatalogServiceCatalog.java:1891)
              at org.apache.impala.service.CatalogOpExecutor.execDdlRequest(CatalogOpExecutor.java:336)
              at org.apache.impala.service.JniCatalog.execDdl(JniCatalog.java:146)
      ::::
      

      What this means

      The Catalog operation is actually successful (the change has been committed to HMS and Catalog server cache) but the Catalog server noticed that it is taking longer than expected time for it to broadcast the changes (for whatever reason) and instead of hanging in there, it fails fast. The coordinators are expected to eventually sync up in the background.

      Problem

      • This violates the contract of the SYNC_DDL query option since the query returns early.
      • This is a behavioral regression from pre IMPALA-5058 state where the queries would wait forever for SYNC_DDL based changes to propagate.

      Notes

      • Introduced by IMPALA-5058
      • Based on the occurrences of this issue, we narrowed it down to a specific kind of DDLs (see Jira comments).
      • My understanding is that this also applies to the Catalog V2 (or LocalCatalog mode) since we still rely on the CatalogServer for DDL orchestration and hence it takes this codepath.

      Attachments

        1. 0001-Repro-of-IMPALA-7961.patch
          2 kB
          Bharath Vissapragada

        Issue Links

          Activity

            People

              bharathv Bharath Vissapragada
              bharathv Bharath Vissapragada
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: