Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23786

CSV schema validation - column names are not checked

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.4.0
    • Component/s: SQL
    • Labels:
      None

      Description

      Here is a csv file contains two columns of the same type:

      $cat marina.csv
      depth, temperature
      10.2, 9.0
      5.5, 12.3
      

      If we define the schema with correct types but wrong column names (reversed order):

      val schema = new StructType().add("temperature", DoubleType).add("depth", DoubleType)
      

      Spark reads the csv file without any errors:

      val ds = spark.read.schema(schema).option("header", "true").csv("marina.csv")
      ds.show
      

      and outputs wrong result:

      +-----------+-----+
      |temperature|depth|
      +-----------+-----+
      |       10.2|  9.0|
      |        5.5| 12.3|
      +-----------+-----+
      

      The correct behavior would be either output error or read columns according its names in the schema.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                maxgekk Maxim Gekk
                Reporter:
                maxgekk Maxim Gekk
              • Votes:
                1 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 24h
                  24h
                  Remaining:
                  Remaining Estimate - 24h
                  24h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified