[SPARK-11500] Not deterministic order of columns when using merging schemas. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.6.0, 2.0.0
Component/s: SQL
Labels:
None

Target Version/s:

1.6.0

Description

When executing

sqlContext.read.option("mergeSchema", "true").parquet(pathOne, pathTwo).printSchema()

The order of columns is not deterministic, showing up in a different order sometimes.

This is because of FileStatusCache in HadoopFsRelation (which ParquetRelation extends as you know). When FileStatusCache.listLeafFiles() is called, this returns Set[FileStatus] which messes up the order of Array[FileStatus].

So, after retrieving the list of leaf files including _metadata and _common_metadata, this starts to merge (separately and if necessary) the Set s of _metadata, _common_metadata and part-files in ParquetRelation.mergeSchemasInParallel(), which ends up in the different column order having the leading columns (of the first file) which the other files do not have.

I think this can be resolved by using LinkedHashSet.

in a simple view,
If A file has 1,2,3 fields, and B file column 3,4,5, we can not ensure which column shows first since It is not deterministic.

1. Read file list (A and B)

2. Not deterministic order of (A and B or B and A) as I said.

3. It merges by reduceOption with retrieved schemas of (A and B or B and A), (which maybe also should be reduceOptionRight or reduceOptionLeft).

4. The output columns would be 1,2,3,4,5 when A and B, or 3.4.5.1.2 when B and A.

Attachments

Issue Links

links to

[Github] Pull Request #9517 (HyukjinKwon)

Activity

People

Assignee:: Hyukjin Kwon

Reporter:: Hyukjin Kwon

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 04/Nov/15 06:39

Updated:: 12/Dec/22 18:11

Resolved:: 11/Nov/15 08:46