[SPARK-23007] Add schema evolution test suite for file-based data sources - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.2.1
Fix Version/s: 2.4.0
Component/s: SQL, Tests
Labels:
None

Description

A schema can evolve in several ways and the followings are already supported in file-based data sources.

1. Add a column
2. Remove a column
3. Change a column position
4. Change a column type

This issue aims to guarantee users a backward-compatible schema evolution coverage on file-based data sources and to prevent future regressions by adding schema evolution test suites explicitly.

Here, we consider safe evolution without data loss. For example, data type evolution should be from small types to larger types like `int`to`long`, not vice versa.

As of today, in the master branch, file-based data sources have schema evolution coverages like the followings.

File Format	Coverage	Note
TEXT	N/A	Schema consists of a single string column.
CSV	1, 2, 4
JSON	1, 2, 3, 4
ORC	1, 2, 3, 4	Native vectorized ORC reader has the widest coverage.
PARQUET	1, 2, 3

Attachments

Issue Links

blocks

SPARK-20901 Feature parity for ORC with Parquet

Open

is related to

SPARK-35461 Error when reading dictionary-encoded Parquet int column when read schema is bigint

Open

links to

[Github] Pull Request #20208 (dongjoon-hyun)

Activity

People

Assignee:: Dongjoon Hyun

Reporter:: Dongjoon Hyun

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 09/Jan/18 18:30

Updated:: 20/May/21 18:36

Resolved:: 12/Jul/18 21:10