[HUDI-552] Fix the schema mismatch in Row-to-Avro conversion - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.5.1
Component/s: spark
Labels:
- pull-request-available

Description

When using the `FilebasedSchemaProvider` to provide the source schema in Avro, while ingesting data from `ParquetDFSSource` with the same schema, the DeltaStreamer failed. A new test case is added below to demonstrate the error:

Based on further investigation, the root cause is that when writing parquet files in Spark, all fields are automatically converted to be nullable for compatibility reasons. If the source Avro schema has non-null fields, `AvroConversionUtils.createRdd` still uses the `dataType` from the Dataframe to convert the Row to Avro record. The `dataType` has nullable fields based on Spark logic, even though the field names are identical as the source Avro schema. Thus the resulting Avro records from the conversion have different schema (only nullability difference) compared to the source schema file. Before inserting the records, there are other operations using the source schema file, causing failure of serialization/deserialization because of this schema mismatch.

The following screenshot shows the modified Avro schema in `AvroConversionUtils.createRdd`. The original source schema file is:

Note that for some Avro schema, the DeltaStreamer sync may succeed but generate corrupt data. This behavior of generating corrupt data is originally reported by liujinhui.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Screen Shot 2020-01-18 at 12.31.23 AM.png
18/Jan/20 08:31
160 kB
Ethan Guo (this is the old account; please use "yihua")
Screen Shot 2020-01-18 at 12.15.09 AM.png
18/Jan/20 08:16
588 kB
Ethan Guo (this is the old account; please use "yihua")
Screen Shot 2020-01-18 at 12.13.08 AM.png
18/Jan/20 08:16
56 kB
Ethan Guo (this is the old account; please use "yihua")
Screen Shot 2020-01-18 at 12.12.58 AM.png
18/Jan/20 08:16
167 kB
Ethan Guo (this is the old account; please use "yihua")

Issue Links

links to

GitHub Pull Request #1246

Activity

People

Assignee:: Ethan Guo (this is the old account; please use "yihua")

Reporter:: Ethan Guo (this is the old account; please use "yihua")

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 18/Jan/20 07:42

Updated:: 03/Feb/20 03:16

Resolved:: 19/Jan/20 00:48

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m