[SPARK-16518] Schema Compatibility of Parquet Data Source - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.0.0
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed

Description

Currently, we are not checking the schema compatibility. Different file formats behave differently. This JIRA just summarizes what I observed for parquet data source tables.

Scenario 1 Data type mismatch:
The existing schema is (col1 int, col2 string)
The schema of appending dataset is (col1 int, col2 int)

Case 1: when spark.sql.parquet.mergeSchema is false, the error we got:

Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure:
 Lost task 0.0 in stage 4.0 (TID 4, localhost): java.lang.NullPointerException
	at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:231)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:62)

Case 2: when spark.sql.parquet.mergeSchema is true, the error we got:

Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): org.apache.spark.SparkException:
 Failed merging schema of file file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-4c2f0b69-ee05-4be1-91f0-0e54f89f2308/part-r-00000-6b76638c-a624-444c-9479-3c8e894cb65e.snappy.parquet:
root
 |-- a: integer (nullable = false)
 |-- b: string (nullable = true)

Scenario 2 More columns in append dataset:
The existing schema is (col1 int, col2 string)
The schema of appending dataset is (col1 int, col2 string, col3 int)

Case 1: when spark.sql.parquet.mergeSchema is false, the schema of the resultset is (col1 int, col2 string).
Case 2: when spark.sql.parquet.mergeSchema is true, the schema of the resultset is (col1 int, col2 string, col3 int).

Scenario 3 Less columns in append dataset:
The existing schema is (col1 int, col2 string)
The schema of appending dataset is (col1 int)

Case 1: when spark.sql.parquet.mergeSchema is false, the schema of the resultset is (col1 int, col2 string).
Case 2: when spark.sql.parquet.mergeSchema is true, the schema of the resultset is (col1 int).

Attachments

Issue Links

is related to

SPARK-16842 Concern about disallowing user-given schema for Parquet and ORC

Closed

Sub-Tasks

1.	Support for conversion from compatible schema for Parquet data source when data types are not matched		Resolved	Unassigned
2.	Do not allow downcast in INT32 based types for non-vectorized Parquet reader		Resolved	Unassigned

Schema Compatibility of Parquet Data Source

Details

Description

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates