Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23007

Add schema evolution test suite for file-based data sources

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.1
    • Fix Version/s: 2.4.0
    • Component/s: SQL, Tests
    • Labels:
      None

      Description

      A schema can evolve in several ways and the followings are already supported in file-based data sources.

      1. Add a column
      2. Remove a column
      3. Change a column position
      4. Change a column type

      This issue aims to guarantee users a backward-compatible schema evolution coverage on file-based data sources and to prevent future regressions by adding schema evolution test suites explicitly.

      Here, we consider safe evolution without data loss. For example, data type evolution should be from small types to larger types like `int`to`long`, not vice versa.

      As of today, in the master branch, file-based data sources have schema evolution coverages like the followings.

      File Format Coverage Note
      TEXT N/A Schema consists of a single string column.
      CSV 1, 2, 4  
      JSON 1, 2, 3, 4  
      ORC 1, 2, 3, 4 Native vectorized ORC reader has the widest coverage.
      PARQUET 1, 2, 3  

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                dongjoon Dongjoon Hyun
                Reporter:
                dongjoon Dongjoon Hyun
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: