Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2201

Compute [incremental] stats may not persist the stats if the data was loaded from Hive with hive.stats.autogather=true.

    XMLWordPrintableJSON

Details

    Description

      Symptoms of This Bug

      • Stats have been computed, but the row count reverts back to -1 after an INVALIDATE METADATA
      • A compute [incremental] stats appears to not set the row count

      Example scenario where this bug may happen:
      1. A new partition with new data is loaded into a table via Hive
      2. Hive has hive.stats.autogather=true
      3. Stats on the new partition are computed in Impala with COMPUTE INCREMENTAL STATS <partition>
      4. At this point, SHOW TABLE STATS shows the correct row count
      5. INVALIDATE METADATA is run on the table in Impala
      6. The row count reverts back to -1 because the stats have not been persisted

      Explanation for This Bug
      Here is why the stats is reset to -1. When Hive hive.stats.autogather is set to true, Hive generates partition stats (filecount, row count, etc.) after creating it. If you run "compute incremental stats" in Impala again. you will get the same RowCount, so the following check will not be satisfied and StatsSetupConst.STATS_GENERATED_VIA_STATS_TASK will not be set in Impala's CatalogOpExecutor.java

      ...
            // Update table stats
            if (existingRowCount == null || !existingRowCount.equals(newRowCount)) {
              // The existing row count value wasn't set or has changed.
              msPartition.putToParameters(StatsSetupConst.ROW_COUNT, newRowCount);
              msPartition.putToParameters(StatsSetupConst.STATS_GENERATED_VIA_STATS_TASK,
                  StatsSetupConst.TRUE);
              updatedPartition = true;
            }
      ...
      

      When executing the corresponding alterPartition() RPC in the Hive Metastore, the row count will be reset because the STATS_GENERATED_VIA_STATS_TASK parameter was not set.
      Snipped from Hive's MetaStoreUtils.hava:

      ...
      public static boolean updatePartitionStatsFast(PartitionSpecProxy.PartitionIterator part, Warehouse wh,
            boolean madeDir, boolean forceRecompute) throws MetaException {
      ...
              if(!params.containsKey(StatsSetupConst.STATS_GENERATED_VIA_STATS_TASK)) {
                // invalidate stats requiring scan since this is a regular ddl alter case
                for (String stat : StatsSetupConst.statsRequireCompute) {
                  params.put(stat, "-1");
                }
                params.put(StatsSetupConst.COLUMN_STATS_ACCURATE, StatsSetupConst.FALSE);
              }
      ...
      

      So if partition stats already exists but not computed by impala, compute incremental stats will cause stats been reset back to -1.

      Note that in Hive versions after CDH 5.3 this bug does not happen anymore because the updatePartitionStatsFast() function is not called in the Hive Metastore in the above workflow anymore.

      Workarounds
      1. Disable stats autogathering in Hive when loading the data

      SET hive.stats.autogather=false;
      

      2. Manually alter the numRows to -1 before doing COMPUTE [INCREMENTAL] STATS in Impala

      ALTER TABLE <table_name> PARTITION <partition_spec> SET TBLPROPERTIES ('numRows'='-1');
      

      3. When already in the broken "-1" state, re-computing the stats for the affected partition fixes the problem

      Proposed Solution
      While this is arguably a Hive bug, I'd recommend that Impala should just unconditionally update the stats when running a COMPUTE STATS. Making the behavior dependent on the existing metadata state is brittle and hard to reason about and debug, esp. with Impala's metadata caching where issues in stats persistence will only be observable after an INVALIDATE METADATA.

      Attachments

        Issue Links

          Activity

            People

              alex.behm Alexander Behm
              alex.behm Alexander Behm
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: