Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.0.0
-
None
Description
Currently, we are not checking the schema compatibility. Different file formats behave differently. This JIRA just summarizes what I observed for parquet data source tables.
Scenario 1 Data type mismatch:
The existing schema is (col1 int, col2 string)
The schema of appending dataset is (col1 int, col2 int)
Case 1: when spark.sql.parquet.mergeSchema is false, the error we got:
Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 4, localhost): java.lang.NullPointerException at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:231) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:62)
Case 2: when spark.sql.parquet.mergeSchema is true, the error we got:
Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): org.apache.spark.SparkException: Failed merging schema of file file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-4c2f0b69-ee05-4be1-91f0-0e54f89f2308/part-r-00000-6b76638c-a624-444c-9479-3c8e894cb65e.snappy.parquet: root |-- a: integer (nullable = false) |-- b: string (nullable = true)
Scenario 2 More columns in append dataset:
The existing schema is (col1 int, col2 string)
The schema of appending dataset is (col1 int, col2 string, col3 int)
Case 1: when spark.sql.parquet.mergeSchema is false, the schema of the resultset is (col1 int, col2 string).
Case 2: when spark.sql.parquet.mergeSchema is true, the schema of the resultset is (col1 int, col2 string, col3 int).
Scenario 3 Less columns in append dataset:
The existing schema is (col1 int, col2 string)
The schema of appending dataset is (col1 int)
Case 1: when spark.sql.parquet.mergeSchema is false, the schema of the resultset is (col1 int, col2 string).
Case 2: when spark.sql.parquet.mergeSchema is true, the schema of the resultset is (col1 int).
Attachments
Issue Links
- is related to
-
SPARK-16842 Concern about disallowing user-given schema for Parquet and ORC
- Closed
1.
|
Support for conversion from compatible schema for Parquet data source when data types are not matched | Resolved | Unassigned | |
2.
|
Do not allow downcast in INT32 based types for non-vectorized Parquet reader | Resolved | Unassigned |