Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5197

Parquet scan may incorrectly report "Corrupt Parquet file" in the logs

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • Impala 2.9.0
    • Impala 2.9.0
    • Backend

    Description

      With IMPALA-5186, Daniel Hecht noticed messages like:

      I0407 12:57:05.306138 85140 status.cc:114] Corrupt Parquet file 'hdfs://vc0332.halxg.cloudera.com:8020/user/hive/warehouse/tpch_100_parquet.db/partsupp/3444dbb2ccec395e-45da764500000007_1009013170_data.0.parq': column 'ps_partkey' had 1024 remaining values but expected 0
      

      I spent a bit more time investigating this, and it seems possible but difficult to reproduce this, though it's non-deterministic from what I can tell.

      The stress test executes various COMPUTE STATS statements on the tables under test, with different MT_DOP settings. This is also in conjunction with a memory limit which the stress test applies to each statement.

      Sometimes, it's possible to trigger these corrupt parquet file warnings. When that happens, the COMPUTE STATS fails with "memory limit exceeded".

      For example, these queries reproduced the problem on the first try:

      set mem_limit=1225m;
      set mt_dop=16;
      compute stats tpcds_300_decimal_parquet.store_sales;
      
      set mem_limit=527m;
      set mt_dop=4;
      compute stats tpcds_300_decimal_parquet.store_sales;
      

      These memory limits are right on the edge of the apparent limits of the statement. Sometimes the statement would appear to completely succeed; other times it would not be able to under the memory limits, but no corrupt messages were printed.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            kwho Michael Ho
            mikeb Michael Brown
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment