Description
When executing
sqlContext.read.option("mergeSchema", "true").parquet(pathOne, pathTwo).printSchema()
The order of columns is not deterministic, showing up in a different order sometimes.
This is because of FileStatusCache in HadoopFsRelation (which ParquetRelation extends as you know). When FileStatusCache.listLeafFiles() is called, this returns Set[FileStatus] which messes up the order of Array[FileStatus].
So, after retrieving the list of leaf files including _metadata and _common_metadata, this starts to merge (separately and if necessary) the Set s of _metadata, _common_metadata and part-files in ParquetRelation.mergeSchemasInParallel(), which ends up in the different column order having the leading columns (of the first file) which the other files do not have.
I think this can be resolved by using LinkedHashSet.
in a simple view,
If A file has 1,2,3 fields, and B file column 3,4,5, we can not ensure which column shows first since It is not deterministic.
1. Read file list (A and B)
2. Not deterministic order of (A and B or B and A) as I said.
3. It merges by reduceOption with retrieved schemas of (A and B or B and A), (which maybe also should be reduceOptionRight or reduceOptionLeft).
4. The output columns would be 1,2,3,4,5 when A and B, or 3.4.5.1.2 when B and A.