Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23007

Add schema evolution test suite for file-based data sources

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.2.1
    • 2.4.0
    • SQL, Tests
    • None

    Description

      A schema can evolve in several ways and the followings are already supported in file-based data sources.

      1. Add a column
      2. Remove a column
      3. Change a column position
      4. Change a column type

      This issue aims to guarantee users a backward-compatible schema evolution coverage on file-based data sources and to prevent future regressions by adding schema evolution test suites explicitly.

      Here, we consider safe evolution without data loss. For example, data type evolution should be from small types to larger types like `int`to`long`, not vice versa.

      As of today, in the master branch, file-based data sources have schema evolution coverages like the followings.

      File Format Coverage Note
      TEXT N/A Schema consists of a single string column.
      CSV 1, 2, 4  
      JSON 1, 2, 3, 4  
      ORC 1, 2, 3, 4 Native vectorized ORC reader has the widest coverage.
      PARQUET 1, 2, 3  

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            dongjoon Dongjoon Hyun
            dongjoon Dongjoon Hyun
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment