Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5197

Parquet scan may incorrectly report "Corrupt Parquet file" in the logs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • Impala 2.9.0
    • Impala 2.9.0
    • Backend

    Description

      With IMPALA-5186, dhecht noticed messages like:

      I0407 12:57:05.306138 85140 status.cc:114] Corrupt Parquet file 'hdfs://vc0332.halxg.cloudera.com:8020/user/hive/warehouse/tpch_100_parquet.db/partsupp/3444dbb2ccec395e-45da764500000007_1009013170_data.0.parq': column 'ps_partkey' had 1024 remaining values but expected 0
      

      I spent a bit more time investigating this, and it seems possible but difficult to reproduce this, though it's non-deterministic from what I can tell.

      The stress test executes various COMPUTE STATS statements on the tables under test, with different MT_DOP settings. This is also in conjunction with a memory limit which the stress test applies to each statement.

      Sometimes, it's possible to trigger these corrupt parquet file warnings. When that happens, the COMPUTE STATS fails with "memory limit exceeded".

      For example, these queries reproduced the problem on the first try:

      set mem_limit=1225m;
      set mt_dop=16;
      compute stats tpcds_300_decimal_parquet.store_sales;
      
      set mem_limit=527m;
      set mt_dop=4;
      compute stats tpcds_300_decimal_parquet.store_sales;
      

      These memory limits are right on the edge of the apparent limits of the statement. Sometimes the statement would appear to completely succeed; other times it would not be able to under the memory limits, but no corrupt messages were printed.

      Attachments

        Issue Links

          Activity

            People

              kwho Michael Ho
              mikeb Michael Brown
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: