Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23786

CSV schema validation - column names are not checked

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.4.0
    • SQL
    • None

    Description

      Here is a csv file contains two columns of the same type:

      $cat marina.csv
      depth, temperature
      10.2, 9.0
      5.5, 12.3
      

      If we define the schema with correct types but wrong column names (reversed order):

      val schema = new StructType().add("temperature", DoubleType).add("depth", DoubleType)
      

      Spark reads the csv file without any errors:

      val ds = spark.read.schema(schema).option("header", "true").csv("marina.csv")
      ds.show
      

      and outputs wrong result:

      +-----------+-----+
      |temperature|depth|
      +-----------+-----+
      |       10.2|  9.0|
      |        5.5| 12.3|
      +-----------+-----+
      

      The correct behavior would be either output error or read columns according its names in the schema.

      Attachments

        Issue Links

          Activity

            People

              maxgekk Max Gekk
              maxgekk Max Gekk
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 24h
                  24h
                  Remaining:
                  Remaining Estimate - 24h
                  24h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified