Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13309

Incorrect type inference for CSV data.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.6.0
    • 2.0.0
    • SQL
    • None

    Description

      Type inference for CSV data does not work as expected when the data is sparse.
      For instance: Consider the following datasets and the inferred schema:

      A,B,C,D
      1,,,
      ,1,,
      ,,1,
      ,,,1
      
      root
      |-- A: integer (nullable = true)
      |-- B: integer (nullable = true)
      |-- C: string (nullable = true)
      |-- D: string (nullable = true)
      

      Here all the fields should have been inferred as Integer types, but clearly the inferred schema is different.

      Another dataset:

      A,B,C,D
      1,,1,
      

      and the inferred schema:

      root
      |-- A: string (nullable = true)
      |-- B: string (nullable = true)
      |-- C: string (nullable = true)
      |-- D: string (nullable = true)
      

      Here, fields A & C should be inferred as Integer types.

      Same issue has been discussed on spark-csv package. Please take a look at https://github.com/databricks/spark-csv/issues/216 for reference.

      The issue was fixed with https://github.com/databricks/spark-csv/commit/8704b26030da88ac6e18b955a81d5c22ca3b480d. I will try to submit PR with the patch soon.

      Attachments

        Activity

          People

            tanwanirahul Rahul Tanwani
            tanwanirahul Rahul Tanwani
            Hossein Falaki Hossein Falaki
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: