[SPARK-32639] Support GroupType parquet mapkey field - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.6, 3.0.0
Fix Version/s: 3.1.0
Component/s: SQL
Labels:
None

Description

I have a parquet file, and the MessageType recorded in the file is:

message parquet_schema {
  optional group value (MAP) {
    repeated group key_value {
      required group key {
        optional binary first (UTF8);
        optional binary middle (UTF8);
        optional binary last (UTF8);
      }
      optional binary value (UTF8);
    }
  }
}

Use spark.read.parquet("000.snappy.parquet") to read the file. Spark will throw an exception when converting Parquet MessageType to Spark SQL StructType:

AssertionError(Map key type is expected to be a primitive type, but found...)

Use spark.read.schema("value MAP<STRUCT<first:STRING, middle:STRING, last:STRING>, STRING>").parquet("000.snappy.parquet") to read the file, spark returns the correct result .

According to the parquet project document (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps), the mapKey in the parquet format does not need to be a primitive type.

Note: This parquet file is not written by spark, because spark will write additional sparkSchema string information in the parquet file. When Spark reads, it will directly use the additional sparkSchema information in the file instead of converting Parquet MessageType to Spark SQL StructType.

I will submit a PR later.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

000.snappy.parquet
17/Aug/20 15:12
0.8 kB
Chen Zhang

Issue Links

links to

[Github] Pull Request #29451 (izchen)

Activity

People

Assignee:: Chen Zhang

Reporter:: Chen Zhang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Aug/20 15:10

Updated:: 28/Aug/20 16:52

Resolved:: 28/Aug/20 16:51