Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-11134

Impala returns "Couldn't skip rows in file" error for old Parquet file

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • Impala 4.1.0
    • None
    • None
    • ghx-label-9

    Description

      Impala returns "Couldn't skip rows in file" error for old Parquet file written by an old Impala (e.g. Impala 2.5, 2.6)

      In DEBUG build Impala crashes by a DCHECK:

      F0217 18:21:34.449540 24288 parquet-column-readers.cc:1611] d3407555528be8a8:5ea3fceb00000001] Check failed: num_buffered_values_ > 0 (-1 vs. 0)
      

      The problem is that in some old Parquet files there can be a mismatch between 'num_values' in a page and the encoded def/rep levels. There is usually one more def/rep levels encoded in these files.

      In SkipTopLevelRows() we skip values based on how many def levels left:
      https://github.com/apache/impala/blob/92ce6fe48e75d7780efe9a275122554e59aac916/be/src/exec/parquet/parquet-column-readers.cc#L1308-L1314

      Since there are more def levels than values, num_buferred_values_ becomes -1. I looked at Parquet files written by newer Impala and the number of def levels matches the number of values.

      The workaround is fairly easy, we could also take the value of num_buferred_values_ into account when calculating 'read_count', i.e. min(min(num_buffered_values_, num_rows - i), repeated_run_length); so we can deal with such files.

      Attachments

        Activity

          People

            boroknagyz Zoltán Borók-Nagy
            boroknagyz Zoltán Borók-Nagy
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: