Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4725

Wrong field resolution of nested Parquet fields

    Details

    • Docs Text:
      This needs to be documented once it has been fixed.
    • Target Version:

      Description

      It seems like values from different definition levels gets mixed-up by the parquet reader when reading arrays of structs with nested structs if the scalar values in the structs are of the same type and have the same position.

      Consider the following schema:

      message hive_schema {
        optional group array_column (LIST) {
          repeated group bag {
            optional group array_element {
              optional group c11 {
                optional int32 c21;
                optional int32 c22;
                optional int32 c23;
                optional int64 c24;
              }
              optional int32 c12;
              optional int64 c13;
              optional int64 c14;
            }
          }
        }
      }
      

      If we populate a parquet-based table using Hive with just this one column and a single row we get the correct result when reading the values back using Hive. Impala though returns values from a higher definition level if they are stored on the same position and are of the same type. In the example below we have populated the fields with values matching the field names for clarity, e.g. field c13 holds the integer 13.

      Query in Hive:

      SELECT
        array_column[0].c11.c21,
        array_column[0].c11.c22,
        array_column[0].c11.c23,
        array_column[0].c11.c24,
        array_column[0].c12,
        array_column[0].c13,
        array_column[0].c14
      FROM sample;
      

      Result from Hive (as expected):

      c21 c22 c23 c24 c12 c13 c14
      21  22  23  24  12  13  14
      

      Query in Impala:

      SELECT
        c11.c21 AS c21,
        c11.c22 AS c22,
        c11.c23 AS c23,
        c11.c24 AS c24,
        c12,
        c13,
        c14
      FROM sample.array_column;
      

      Result from Impala (incorrect):

      c21 c22 c23 c24 c12 c13 c14
      21  12  23  14  12  13  14
      

      As can be seen above the value of c22 (which should be 22) is 12. The value is taken from c12. Both types are int32. Also the value of c24 (which should be 24) is 14. The value is taken from c14. Both types are int64. Also note that the value of c23 is correct. It has a different type than the corresponding field at a higher definition level. The pattern here seems to be that if there is a value at a higher definition level of the same type on the same position it is preferred by the reader rather than the correct value.

      We have only seen this issue when the struct c11 is placed as the first field in the outer struct.

      Attaching a sample parquet file. Statement for issuing a table on top of that file is:

      CREATE EXTERNAL TABLE sample (
      array_column array<struct<c11:struct<c21:int,c22:int,c23:int,c24:bigint>,c12:int,c13:bigint,c14:bigint>>
      )
      STORED AS PARQUET LOCATION "/path/to/sample";
      
      1. sample.parq
        1 kB
        Petter von Dolwitz

        Activity

        Hide
        jbapple Jim Apple added a comment -

        Petter, this is not still a Cloudera project. The fact that the issue tracker is on cloudera.org is a temporary situation that will be remedied ASAP. Impala is an Apache incubating project.

        Show
        jbapple Jim Apple added a comment - Petter, this is not still a Cloudera project. The fact that the issue tracker is on cloudera.org is a temporary situation that will be remedied ASAP. Impala is an Apache incubating project.
        Hide
        Pettax Petter von Dolwitz added a comment -

        Many thanks for your efforts on this one!

        Can I take the opportunity to ask (this still being a Cloudera project) if you think this patch is a candidate to be included in future maintenance releases of already released Impala versions, i,e Impala versions bundled with CDH 5.9.x and CDH 5.10.x or will it only be included in 2.9.0 and onwards?

        Show
        Pettax Petter von Dolwitz added a comment - Many thanks for your efforts on this one! Can I take the opportunity to ask (this still being a Cloudera project) if you think this patch is a candidate to be included in future maintenance releases of already released Impala versions, i,e Impala versions bundled with CDH 5.9.x and CDH 5.10.x or will it only be included in 2.9.0 and onwards?
        Hide
        alex.behm Alexander Behm added a comment -

        commit 7d8acee81437c040255c3219bcd3e72f10b3b772
        Author: Alex Behm <alex.behm@cloudera.com>
        Date: Fri Feb 24 19:04:59 2017 -0800

        IMPALA-4725: Query option to control Parquet array resolution.

        Summary of changes:
        Introduces a new query option PARQUET_ARRAY_RESOLUTION to
        control the path-resolution behavior for Parquet files
        with nested arrays. The values are:

        • THREE_LEVEL
          Assumes arrays are encoded with the 3-level representation.
          Also resolves arrays encoded with a single level.
          Does not attempt a 2-level resolution.
        • TWO_LEVEL
          Assumes arrays are encoded with the 2-level representation.
          Also resolves arrays encoded with a single level.
          Does not attempt a 3-level resolution.
        • TWO_LEVEL_THEN_THREE_LEVEL
          First tries to resolve assuming the 2-level representation,
          and if unsuccessful, tries the 3-level representation.
          Also resolves arrays encoded with a single level.
          This is the current Impala behavior and is used as the
          default value for compatibility.

        Note that 'failure' to resolve a schema path with a given
        array-resolution policy does not necessarily mean a warning or
        error is returned by the query. A mismatch might be treated
        like a missing field which is necessary to support schema
        evolution. There is no way to reliably distinguish the
        'bad resolution' and 'legitimately missing field' cases.

        The new query option is independent of and can be combined
        with the existing PARQUET_FALLBACK_SCHEMA_RESOLUTION.

        Background:
        Arrays can be represented in several ways in Parquet:

        Unfortunately, there is no reliable metadata within Parquet files
        to indicate which encoding was used. There is even the possibility
        of having mixed encodings within the same file if there are multiple
        arrays.

        As a result, Impala currently tries to auto-detect the file encoding
        when resolving a schema path in a Parquet file using the
        TWO_LEVEL_THEN_THREE_LEVEL policy.

        However, regardless of whether a Parquet data file uses the 2-level
        or 3-level encoding, the index-based resolution may return incorrect
        results if the representation in the Parquet file does not
        exactly match the attempted array-resoution policy. Intuitively,
        when attempting a 2-level resolution on a 3-level file, the matched
        schema node may not be deep enough in the schema tree, but could still
        be a scalar node with expected type. Similarly, when attempting a
        3-level resolution on a 2-level file a level may be incorrectly
        skipped.

        The name-based policy generally does not have this problem because it
        avoids traversing incorrect schema paths. However, the index-based
        resoution allows a different set of schema-evolution operations,
        so just using name-based resolution is not an acceptable workaround
        in all cases.

        Testing:

        • Added new Parquet data files that show how incorrect results
          can be returned with a mismatched file encoding and resolution
          policy. Added both 2-level and 3-level versions of the data.
        • Added a new test in test_nested_types.py that shows the behavior
          with the new PARQUET_ARRAY_RESOLUTION query option.
        • Locally ran test_scanners.py and test_nested_types.py on core.

        Change-Id: I4f32e19ec542d4d485154c9d65d0f5e3f9f0a907
        Reviewed-on: http://gerrit.cloudera.org:8080/6250
        Reviewed-by: Alex Behm <alex.behm@cloudera.com>
        Tested-by: Impala Public Jenkins

        Show
        alex.behm Alexander Behm added a comment - commit 7d8acee81437c040255c3219bcd3e72f10b3b772 Author: Alex Behm <alex.behm@cloudera.com> Date: Fri Feb 24 19:04:59 2017 -0800 IMPALA-4725 : Query option to control Parquet array resolution. Summary of changes: Introduces a new query option PARQUET_ARRAY_RESOLUTION to control the path-resolution behavior for Parquet files with nested arrays. The values are: THREE_LEVEL Assumes arrays are encoded with the 3-level representation. Also resolves arrays encoded with a single level. Does not attempt a 2-level resolution. TWO_LEVEL Assumes arrays are encoded with the 2-level representation. Also resolves arrays encoded with a single level. Does not attempt a 3-level resolution. TWO_LEVEL_THEN_THREE_LEVEL First tries to resolve assuming the 2-level representation, and if unsuccessful, tries the 3-level representation. Also resolves arrays encoded with a single level. This is the current Impala behavior and is used as the default value for compatibility. Note that 'failure' to resolve a schema path with a given array-resolution policy does not necessarily mean a warning or error is returned by the query. A mismatch might be treated like a missing field which is necessary to support schema evolution. There is no way to reliably distinguish the 'bad resolution' and 'legitimately missing field' cases. The new query option is independent of and can be combined with the existing PARQUET_FALLBACK_SCHEMA_RESOLUTION. Background: Arrays can be represented in several ways in Parquet: Three Level Encoding (standard) Two Level Encoding (legacy) One Level Encoding (legacy) More details are in the "Lists" section of the spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md Unfortunately, there is no reliable metadata within Parquet files to indicate which encoding was used. There is even the possibility of having mixed encodings within the same file if there are multiple arrays. As a result, Impala currently tries to auto-detect the file encoding when resolving a schema path in a Parquet file using the TWO_LEVEL_THEN_THREE_LEVEL policy. However, regardless of whether a Parquet data file uses the 2-level or 3-level encoding, the index-based resolution may return incorrect results if the representation in the Parquet file does not exactly match the attempted array-resoution policy. Intuitively, when attempting a 2-level resolution on a 3-level file, the matched schema node may not be deep enough in the schema tree, but could still be a scalar node with expected type. Similarly, when attempting a 3-level resolution on a 2-level file a level may be incorrectly skipped. The name-based policy generally does not have this problem because it avoids traversing incorrect schema paths. However, the index-based resoution allows a different set of schema-evolution operations, so just using name-based resolution is not an acceptable workaround in all cases. Testing: Added new Parquet data files that show how incorrect results can be returned with a mismatched file encoding and resolution policy. Added both 2-level and 3-level versions of the data. Added a new test in test_nested_types.py that shows the behavior with the new PARQUET_ARRAY_RESOLUTION query option. Locally ran test_scanners.py and test_nested_types.py on core. Change-Id: I4f32e19ec542d4d485154c9d65d0f5e3f9f0a907 Reviewed-on: http://gerrit.cloudera.org:8080/6250 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins
        Hide
        jbapple Jim Apple added a comment -

        Alex has +2ed but merge-conflicted patch here: https://gerrit.cloudera.org/#/c/6250/

        Show
        jbapple Jim Apple added a comment - Alex has +2ed but merge-conflicted patch here: https://gerrit.cloudera.org/#/c/6250/
        Hide
        alex.behm Alexander Behm added a comment -

        Thinking about this a little more, I don't think we should support the 3_THEN_2 variant. Incorrect results can be returned for files with mixed encodings, regardless of whether a 2-level or 3-level resolution is attempted. Having such mixed files is a really bad idea and we should not add another query option to encourage/support these bad files. We need to preserve 2_THEN_3 because that's Impala's current default, but we should seriously consider retiring that as well in the distant future.

        Show
        alex.behm Alexander Behm added a comment - Thinking about this a little more, I don't think we should support the 3_THEN_2 variant. Incorrect results can be returned for files with mixed encodings, regardless of whether a 2-level or 3-level resolution is attempted. Having such mixed files is a really bad idea and we should not add another query option to encourage/support these bad files. We need to preserve 2_THEN_3 because that's Impala's current default, but we should seriously consider retiring that as well in the distant future.
        Hide
        lv Lars Volker added a comment -

        Greg Rahn, Alexander Behm, and I discussed this in person. We agreed to add a query option to control the resolution order when using indexed field resolution. The possible strategies would be 2-level resolution and 3-level resolution, each with or without falling back to the other in case of failure:

        • 2_ONLY
        • 3_ONLY
        • 2_THEN_3 (This would be the current behavior)
        • 3_THEN_2

        Parquet specifies 3-level resolution as the default, so we should switch the default to 3_ONLY or 3_THEN_2 upon the next compatibility breaking release.
        We also came to the conclusion that this behavior should be well documented, so that it is easy to detect and workaround.

        Show
        lv Lars Volker added a comment - Greg Rahn , Alexander Behm , and I discussed this in person. We agreed to add a query option to control the resolution order when using indexed field resolution. The possible strategies would be 2-level resolution and 3-level resolution, each with or without falling back to the other in case of failure: 2_ONLY 3_ONLY 2_THEN_3 (This would be the current behavior) 3_THEN_2 Parquet specifies 3-level resolution as the default, so we should switch the default to 3_ONLY or 3_THEN_2 upon the next compatibility breaking release. We also came to the conclusion that this behavior should be well documented, so that it is easy to detect and workaround.
        Hide
        lv Lars Volker added a comment -

        Good idea, done.

        Show
        lv Lars Volker added a comment - Good idea, done.
        Hide
        jbapple Jim Apple added a comment -

        Do you want to mark that a P1 blocker, too?

        Show
        jbapple Jim Apple added a comment - Do you want to mark that a P1 blocker, too?
        Hide
        lv Lars Volker added a comment -

        Jim Apple - I started looking into this and talked to Alexander Behm about it. We figured that we will try to fix IMPALA-4675 first as a workaround to this issue.

        Alexander Behm - After looking into the issue here, why do we try 2-level resolution first instead of 3-level resolution? Parquet seems to suggest that the latter is the default and should be tried first. I also couldn't figure out why both work in this case, it is on my list to step through the code with a debugger and see why 2-level resolution yields a positive result in this case.

        Show
        lv Lars Volker added a comment - Jim Apple - I started looking into this and talked to Alexander Behm about it. We figured that we will try to fix IMPALA-4675 first as a workaround to this issue. Alexander Behm - After looking into the issue here, why do we try 2-level resolution first instead of 3-level resolution? Parquet seems to suggest that the latter is the default and should be tried first. I also couldn't figure out why both work in this case, it is on my list to step through the code with a debugger and see why 2-level resolution yields a positive result in this case.
        Hide
        jbapple Jim Apple added a comment -

        Lars Volker, this has a target version of 2.9. How is the investigation going?

        Show
        jbapple Jim Apple added a comment - Lars Volker , this has a target version of 2.9. How is the investigation going?
        Hide
        alex.behm Alexander Behm added a comment -

        Thanks for this fantastic report, Petter von Dolwitz.

        The bug is in the Parquet field-resolution logic and not in the scanning/interpretation of the definition levels. The bug stems from the fact that there are multiple ways to represent an array<struct<>> in Parquet, a 2-level or 3-level representation. See the "Backward-compatibility rules" in:
        https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists

        The problematic code is in ParquetSchemaResolver::ResolvePath() of parquet-metadata-utils.cc. We first try resolving according to the 2-level representation and then try the 3-level representation. Unfortunately, for some schemas there is ambiguity between the 2-level and 3-level representations. The provided schema uses the standard 3-level representation, but some field references resolve correctly assuming a 2-level representation, which means the wrong column values will be returned.

        I verified that Impala returns correct results when disabling the 2-level compatibility resolution.

        Show
        alex.behm Alexander Behm added a comment - Thanks for this fantastic report, Petter von Dolwitz . The bug is in the Parquet field-resolution logic and not in the scanning/interpretation of the definition levels. The bug stems from the fact that there are multiple ways to represent an array<struct<>> in Parquet, a 2-level or 3-level representation. See the "Backward-compatibility rules" in: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists The problematic code is in ParquetSchemaResolver::ResolvePath() of parquet-metadata-utils.cc. We first try resolving according to the 2-level representation and then try the 3-level representation. Unfortunately, for some schemas there is ambiguity between the 2-level and 3-level representations. The provided schema uses the standard 3-level representation, but some field references resolve correctly assuming a 2-level representation, which means the wrong column values will be returned. I verified that Impala returns correct results when disabling the 2-level compatibility resolution.
        Hide
        tarmstrong Tim Armstrong added a comment -

        This definitely looks like a bug, we should aim to fix this in the next release. Thanks for providing such a clear test case.

        Show
        tarmstrong Tim Armstrong added a comment - This definitely looks like a bug, we should aim to fix this in the next release. Thanks for providing such a clear test case.

          People

          • Assignee:
            alex.behm Alexander Behm
            Reporter:
            Pettax Petter von Dolwitz
          • Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development