Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1879

Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field

    XMLWordPrintableJSON

Details

    Description

      From my StackOverflow in relation to an issue I'm having with getting Snowflake (Cloud DB) to load Parquet files written with version 1.11.0


      The problem only appears when using a map schema field in the Avro schema. For example:

          {
            "name": "FeatureAmounts",
            "type": {
              "type": "map",
              "values": "records.MoneyDecimal"
            }
          }
      

      When using Parquet-Avro to write the file, a bad Parquet schema ends up with, for example

      message record.ResponseRecord {
        required binary GroupId (STRING);
        required int64 EntryTime (TIMESTAMP(MILLIS,true));
        required int64 HandlingDuration;
        required binary Id (STRING);
        optional binary ResponseId (STRING);
        required binary RequestId (STRING);
        optional fixed_len_byte_array(12) CostInUSD (DECIMAL(28,15));
        required group FeatureAmounts (MAP) {
          repeated group map (MAP_KEY_VALUE) {
            required binary key (STRING);
            required fixed_len_byte_array(12) value (DECIMAL(28,15));
          }
        }
      }
      

      From the great answer to my StackOverflow, it seems the issue is that the 1.11.0 Parquet-Avro is still using the legacy MAP_KEY_VALUE converted type, that has no logical type equivalent. From the comment on LogicalTypeAnnotation

      // This logical type annotation is implemented to support backward compatibility with ConvertedType.
        // The new logical type representation in parquet-format doesn't have any key-value type,
        // thus this annotation is mapped to UNKNOWN. This type shouldn't be used.
      

      However, it seems this is being written with the latest 1.11.0, which then causes Apache Arrow to fail with

      Logical type Null can not be applied to group node
      

      As it appears that Arrow only looks for the new logical type of Map or List, therefore this causes an error.

      I have seen in Parquet Formats that LogicalTypes should be something like

      // Map<String, Integer>
      required group my_map (MAP) {
        repeated group key_value {
          required binary key (UTF8);
          optional int32 value;
        }
      }
      

      Is this on the correct path?

      Attachments

        Issue Links

          Activity

            People

              maccamlc Matthew McMahon
              maccamlc Matthew McMahon
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: