Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12420 Have a built-in CSV data source implementation
  3. SPARK-13108

Encoding not working with non-ascii compatible encodings (UTF-16/32 etc.)

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Minor
    • Resolution: Later
    • 2.0.0
    • None
    • SQL
    • None

    Description

      This library uses Hadoop's TextInputFormat, which uses LineRecordReader.

      According to MAPREDUCE-232, it looks TextInputFormat does not guarantee all encoding types but officially only UTF-8 (as commented in LineRecordReader#L147).

      According to MAPREDUCE-232#comment-13183601, it still looks fine with most encodings though but without UTF-16/32.

      In more details,

      I tested this in Max OS. I converted `cars_iso-8859-1.csv` into `cars_utf-16.csv` as below:

      iconv -f iso-8859-1 -t utf-16 < cars_iso-8859-1.csv > cars_utf-16.csv
      

      and run the codes below:

      val cars = "cars_utf-16.csv"
      sqlContext.read
        .format("csv")
        .option("charset", "utf-16")
        .option("delimiter", 'þ')
        .load(cars)
        .show()
      

      This produces a wrong results below:

      +----+-----+-----+--------------------+------+
      |year| make|model|             comment|blank�|
      +----+-----+-----+--------------------+------+
      |2012|Tesla|    S|          No comment|     �|
      |   �| null| null|                null|  null|
      |1997| Ford| E350|Go get one now th...|     �|
      |2015|Chevy|Volt�|                null|  null|
      |   �| null| null|                null|  null|
      +----+-----+-----+--------------------+------+
      

      Instead of the correct results below:

      +----+-----+-----+--------------------+-----+
      |year| make|model|             comment|blank|
      +----+-----+-----+--------------------+-----+
      |2012|Tesla|    S|          No comment|     |
      |1997| Ford| E350|Go get one now th...|     |
      |2015|Chevy| Volt|                null| null|
      +----+-----+-----+--------------------+-----+
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            gurwls223 Hyukjin Kwon
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: