Description
Type inference for CSV data does not work as expected when the data is sparse.
For instance: Consider the following datasets and the inferred schema:
A,B,C,D 1,,, ,1,, ,,1, ,,,1
root |-- A: integer (nullable = true) |-- B: integer (nullable = true) |-- C: string (nullable = true) |-- D: string (nullable = true)
Here all the fields should have been inferred as Integer types, but clearly the inferred schema is different.
Another dataset:
A,B,C,D 1,,1,
and the inferred schema:
root |-- A: string (nullable = true) |-- B: string (nullable = true) |-- C: string (nullable = true) |-- D: string (nullable = true)
Here, fields A & C should be inferred as Integer types.
Same issue has been discussed on spark-csv package. Please take a look at https://github.com/databricks/spark-csv/issues/216 for reference.
The issue was fixed with https://github.com/databricks/spark-csv/commit/8704b26030da88ac6e18b955a81d5c22ca3b480d. I will try to submit PR with the patch soon.