Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40808

Infer schema for CSV files - wrong behavior using header + merge schema

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.2.2
    • None
    • SQL

    Description

      Hello. 
      I am writing unit-tests to some functionality in my application that reading data from CSV files using Spark.

      I am reading the data using:

      header=True
      mergeSchema=True
      inferSchema=True

      When I am reading this single file:

      File1:
      "int_col","string_col","decimal_col","date_col"
      1,"hello",1.43,2022-02-23
      2,"world",5.534,2021-05-05
      3,"my name",86.455,2011-08-15
      4,"is ohad",6.234,2002-03-22

      I am getting this schema:

      int_col=int
      string_col=string
      decimal_col=double
      date_col=string

      When I am duplicating this file, I am getting the same schema.

      The strange part is when I am adding new int column, it looks like spark is getting confused and think that the column that already identified as int are now string:

      File1:
      "int_col","string_col","decimal_col","date_col"
      1,"hello",1.43,2022-02-23
      2,"world",5.534,2021-05-05
      3,"my name",86.455,2011-08-15
      4,"is ohad",6.234,2002-03-22
      File2:
      "int_col","string_col","decimal_col","date_col","int2_col"
      1,"hello",1.43,2022-02-23,234
      2,"world",5.534,2021-05-05,5
      3,"my name",86.455,2011-08-15,32
      4,"is ohad",6.234,2002-03-22,2
      

      result:

      int_col=string
      string_col=string
      decimal_col=string
      date_col=string
      int2_col=int

      When I am reading only the second file, it looks fine:

      File2:
      "int_col","string_col","decimal_col","date_col","int2_col"
      1,"hello",1.43,2022-02-23,234
      2,"world",5.534,2021-05-05,5
      3,"my name",86.455,2011-08-15,32
      4,"is ohad",6.234,2002-03-22,2

      result:

      int_col=int
      string_col=string
      decimal_col=double
      date_col=string
      int2_col=int

      For conclusion, it looks like there is a bug mixing the two features: header recognition and merge schema.

      Attachments

        1. test_csv.py
          4 kB
          ohad

        Activity

          People

            Unassigned Unassigned
            ohadm ohad
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: