[SPARK-13309] Incorrect type inference for CSV data. - ASF JIRA

XML

Word

Printable

JSON

Type inference for CSV data does not work as expected when the data is sparse.
For instance: Consider the following datasets and the inferred schema:

A,B,C,D
1,,,
,1,,
,,1,
,,,1

root
|-- A: integer (nullable = true)
|-- B: integer (nullable = true)
|-- C: string (nullable = true)
|-- D: string (nullable = true)

Here all the fields should have been inferred as Integer types, but clearly the inferred schema is different.

Another dataset:

A,B,C,D
1,,1,

and the inferred schema:

root
|-- A: string (nullable = true)
|-- B: string (nullable = true)
|-- C: string (nullable = true)
|-- D: string (nullable = true)

Here, fields A & C should be inferred as Integer types.

Same issue has been discussed on spark-csv package. Please take a look at https://github.com/databricks/spark-csv/issues/216 for reference.

links to

[Github] Pull Request #11194 (tanwanirahul)