[SPARK-40808] Infer schema for CSV files - wrong behavior using header + merge schema - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.2.2
Fix Version/s: None
Component/s: SQL
Labels:
- CSVReader
- csv
- csvparser

Description

Hello.
I am writing unit-tests to some functionality in my application that reading data from CSV files using Spark.

I am reading the data using:

header=True
mergeSchema=True
inferSchema=True

When I am reading this single file:

File1:
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22

I am getting this schema:

int_col=int
string_col=string
decimal_col=double
date_col=string

When I am duplicating this file, I am getting the same schema.

The strange part is when I am adding new int column, it looks like spark is getting confused and think that the column that already identified as int are now string:

File1:
"int_col","string_col","decimal_col","date_col"
1,"hello",1.43,2022-02-23
2,"world",5.534,2021-05-05
3,"my name",86.455,2011-08-15
4,"is ohad",6.234,2002-03-22
File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2

result:

int_col=string
string_col=string
decimal_col=string
date_col=string
int2_col=int

When I am reading only the second file, it looks fine:

File2:
"int_col","string_col","decimal_col","date_col","int2_col"
1,"hello",1.43,2022-02-23,234
2,"world",5.534,2021-05-05,5
3,"my name",86.455,2011-08-15,32
4,"is ohad",6.234,2002-03-22,2

result:

int_col=int
string_col=string
decimal_col=double
date_col=string
int2_col=int

For conclusion, it looks like there is a bug mixing the two features: header recognition and merge schema.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

test_csv.py
17/Oct/22 11:35
4 kB
ohad

Activity

People

Assignee:: Unassigned

Reporter:: ohad

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 16/Oct/22 08:24

Updated:: 12/Dec/22 18:10

Resolved:: 29/Oct/22 07:01