Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4725

Wrong field resolution of nested Parquet fields

    XMLWordPrintableJSON

Details

    Description

      It seems like values from different definition levels gets mixed-up by the parquet reader when reading arrays of structs with nested structs if the scalar values in the structs are of the same type and have the same position.

      Consider the following schema:

      message hive_schema {
        optional group array_column (LIST) {
          repeated group bag {
            optional group array_element {
              optional group c11 {
                optional int32 c21;
                optional int32 c22;
                optional int32 c23;
                optional int64 c24;
              }
              optional int32 c12;
              optional int64 c13;
              optional int64 c14;
            }
          }
        }
      }
      

      If we populate a parquet-based table using Hive with just this one column and a single row we get the correct result when reading the values back using Hive. Impala though returns values from a higher definition level if they are stored on the same position and are of the same type. In the example below we have populated the fields with values matching the field names for clarity, e.g. field c13 holds the integer 13.

      Query in Hive:

      SELECT
        array_column[0].c11.c21,
        array_column[0].c11.c22,
        array_column[0].c11.c23,
        array_column[0].c11.c24,
        array_column[0].c12,
        array_column[0].c13,
        array_column[0].c14
      FROM sample;
      

      Result from Hive (as expected):

      c21 c22 c23 c24 c12 c13 c14
      21  22  23  24  12  13  14
      

      Query in Impala:

      SELECT
        c11.c21 AS c21,
        c11.c22 AS c22,
        c11.c23 AS c23,
        c11.c24 AS c24,
        c12,
        c13,
        c14
      FROM sample.array_column;
      

      Result from Impala (incorrect):

      c21 c22 c23 c24 c12 c13 c14
      21  12  23  14  12  13  14
      

      As can be seen above the value of c22 (which should be 22) is 12. The value is taken from c12. Both types are int32. Also the value of c24 (which should be 24) is 14. The value is taken from c14. Both types are int64. Also note that the value of c23 is correct. It has a different type than the corresponding field at a higher definition level. The pattern here seems to be that if there is a value at a higher definition level of the same type on the same position it is preferred by the reader rather than the correct value.

      We have only seen this issue when the struct c11 is placed as the first field in the outer struct.

      Attaching a sample parquet file. Statement for issuing a table on top of that file is:

      CREATE EXTERNAL TABLE sample (
      array_column array<struct<c11:struct<c21:int,c22:int,c23:int,c24:bigint>,c12:int,c13:bigint,c14:bigint>>
      )
      STORED AS PARQUET LOCATION "/path/to/sample";
      

      Attachments

        1. sample.parq
          1 kB
          Petter von Dolwitz

        Issue Links

          Activity

            People

              alex.behm Alexander Behm
              Pettax Petter von Dolwitz
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: