Description
Hello.
I am writing unit-tests to some functionality in my application that reading data from CSV files using Spark.
I am reading the data using:
header=True mergeSchema=True inferSchema=True
When I am reading this single file:
File1: "int_col","string_col","decimal_col","date_col" 1,"hello",1.43,2022-02-23 2,"world",5.534,2021-05-05 3,"my name",86.455,2011-08-15 4,"is ohad",6.234,2002-03-22
I am getting this schema:
int_col=int string_col=string decimal_col=double date_col=string
When I am duplicating this file, I am getting the same schema.
The strange part is when I am adding new int column, it looks like spark is getting confused and think that the column that already identified as int are now string:
File1: "int_col","string_col","decimal_col","date_col" 1,"hello",1.43,2022-02-23 2,"world",5.534,2021-05-05 3,"my name",86.455,2011-08-15 4,"is ohad",6.234,2002-03-22 File2: "int_col","string_col","decimal_col","date_col","int2_col" 1,"hello",1.43,2022-02-23,234 2,"world",5.534,2021-05-05,5 3,"my name",86.455,2011-08-15,32 4,"is ohad",6.234,2002-03-22,2
result:
int_col=string
string_col=string
decimal_col=string
date_col=string
int2_col=int
When I am reading only the second file, it looks fine:
File2: "int_col","string_col","decimal_col","date_col","int2_col" 1,"hello",1.43,2022-02-23,234 2,"world",5.534,2021-05-05,5 3,"my name",86.455,2011-08-15,32 4,"is ohad",6.234,2002-03-22,2
result:
int_col=int string_col=string decimal_col=double date_col=string int2_col=int
For conclusion, it looks like there is a bug mixing the two features: header recognition and merge schema.