Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23786

CSV schema validation - column names are not checked

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.4.0
    • SQL
    • None

    Description

      Here is a csv file contains two columns of the same type:

      $cat marina.csv
      depth, temperature
      10.2, 9.0
      5.5, 12.3
      

      If we define the schema with correct types but wrong column names (reversed order):

      val schema = new StructType().add("temperature", DoubleType).add("depth", DoubleType)
      

      Spark reads the csv file without any errors:

      val ds = spark.read.schema(schema).option("header", "true").csv("marina.csv")
      ds.show
      

      and outputs wrong result:

      +-----------+-----+
      |temperature|depth|
      +-----------+-----+
      |       10.2|  9.0|
      |        5.5| 12.3|
      +-----------+-----+
      

      The correct behavior would be either output error or read columns according its names in the schema.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            maxgekk Max Gekk
            maxgekk Max Gekk
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified

                Slack

                  Issue deployment