Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-364

Parquet-avro cannot decode Avro/Thrift array of primitive array (e.g. array<array<int>>)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.5.0, 1.6.0, 1.7.0, 1.8.0
    • 1.8.2
    • parquet-mr
    • None

    Description

      The problematic Avro and Thrift schemas are:

      record AvroArrayOfArray {
        array<array<int>> int_arrays_column;
      }
      

      and

      struct ThriftListOfList {
        1: list<list<i32>> intArraysColumn;
      }
      

      They are converted to the following structurally equivalent Parquet schemas by parquet-avro 1.7.0 and parquet-thrift 1.7.0 respectively:

      message AvroArrayOfArray {
        required group int_arrays_column (LIST) {
          repeated group array (LIST) {
            repeated int32 array;
          }
        }
      }
      

      and

      message ParquetSchema {
        required group intListsColumn (LIST) {
          repeated group intListsColumn_tuple (LIST) {
            repeated int32 intListsColumn_tuple_tuple;
          }
        }
      }
      

      AvroIndexedRecordConverter cannot decode such records correctly. The reason is that the 2nd level repeated group array doesn't pass AvroIndexedRecordConverter.isElementType() check. We should check for field name "array" and field name suffix "_thrift" in isElementType() to fix this issue.

      Attachments

        1. bad-thrift.parquet
          1 kB
          Cheng Lian
        2. bad-avro.parquet
          0.7 kB
          Cheng Lian

        Issue Links

          Activity

            People

              rdblue Ryan Blue
              lian cheng Cheng Lian
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: