Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3376

Off-by-one error with RLE encoded definition levels when writing Parquet files.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Impala 2.5.0
    • Impala 2.7.0
    • Backend

    Description

      It appears that Impala has an off-by-one error when writing RLE encoded definition levels of a Parquet file. The problem is that according to the data page statistics there should be N values, but then the sum of repeat values of the RLE encoded definition levels adds up to N+1.

      The best reproduction I could come up with relies on applying the attached patch that adds logging in the right places to make the problem obvious.

      The patch was produced by running "git diff > add_logging.patch"

      1. Apply the patch and compile
      2. Start impala with a single impalad
      3. Create a new parquet table as follows

      set num_nodes=1;
      set num_scanner_threads=1;
      create table bug_test stored as parquet as
      select l_linenumber from tpch_parquet.lineitem limit 180000;
      

      4. Run query that shows the problem:

      set num_nodes=1;
      set num_scanner_threads=1;
      select count(l_linenumber) from bug_test;
      

      5. You should find the following in the logs:

      E0419 00:33:50.665441  6551 hdfs-parquet-scanner.cc:1218] Bugchase num_buffered_values_=173600
      E0419 00:33:50.665606  6551 rle-encoding.h:251] Bugchase repeat_count_=173601 literal_count_=0
      

      6. As you can see, the repeat count is greater than the number of values in the data page.

      For the most part, this issue appears to be benign because Impala, Hive and parquet-tools produce the correct result. On the other hand, we will need to be cognizant of this discrepancy moving forward and possibly maintain workarounds on the read path.

      FWIW, I tried using parquet-tools to show the problem without having to patch Impala but I was not successful.

      I also tried writing the same test table with Hive and confirmed that Hive writes the file correctly.

      Attachments

        1. add_logging.patch
          1.0 kB
          Alexander Behm

        Activity

          People

            twmarshall Thomas Tauber-Marshall
            alex.behm Alexander Behm
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: