Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11500

Not deterministic order of columns when using merging schemas.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.6.0, 2.0.0
    • SQL
    • None

    Description

      When executing

      sqlContext.read.option("mergeSchema", "true").parquet(pathOne, pathTwo).printSchema()

      The order of columns is not deterministic, showing up in a different order sometimes.

      This is because of FileStatusCache in HadoopFsRelation (which ParquetRelation extends as you know). When FileStatusCache.listLeafFiles() is called, this returns Set[FileStatus] which messes up the order of Array[FileStatus].

      So, after retrieving the list of leaf files including _metadata and _common_metadata, this starts to merge (separately and if necessary) the Set s of _metadata, _common_metadata and part-files in ParquetRelation.mergeSchemasInParallel(), which ends up in the different column order having the leading columns (of the first file) which the other files do not have.

      I think this can be resolved by using LinkedHashSet.

      in a simple view,
      If A file has 1,2,3 fields, and B file column 3,4,5, we can not ensure which column shows first since It is not deterministic.

      1. Read file list (A and B)

      2. Not deterministic order of (A and B or B and A) as I said.

      3. It merges by reduceOption with retrieved schemas of (A and B or B and A), (which maybe also should be reduceOptionRight or reduceOptionLeft).

      4. The output columns would be 1,2,3,4,5 when A and B, or 3.4.5.1.2 when B and A.

      Attachments

        Activity

          People

            gurwls223 Hyukjin Kwon
            gurwls223 Hyukjin Kwon
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: