Details
-
Sub-task
-
Status: Closed
-
Minor
-
Resolution: Later
-
2.0.0
-
None
-
None
Description
This library uses Hadoop's TextInputFormat, which uses LineRecordReader.
According to MAPREDUCE-232, it looks TextInputFormat does not guarantee all encoding types but officially only UTF-8 (as commented in LineRecordReader#L147).
According to MAPREDUCE-232#comment-13183601, it still looks fine with most encodings though but without UTF-16/32.
In more details,
I tested this in Max OS. I converted `cars_iso-8859-1.csv` into `cars_utf-16.csv` as below:
iconv -f iso-8859-1 -t utf-16 < cars_iso-8859-1.csv > cars_utf-16.csv
and run the codes below:
val cars = "cars_utf-16.csv" sqlContext.read .format("csv") .option("charset", "utf-16") .option("delimiter", 'þ') .load(cars) .show()
This produces a wrong results below:
+----+-----+-----+--------------------+------+ |year| make|model| comment|blank�| +----+-----+-----+--------------------+------+ |2012|Tesla| S| No comment| �| | �| null| null| null| null| |1997| Ford| E350|Go get one now th...| �| |2015|Chevy|Volt�| null| null| | �| null| null| null| null| +----+-----+-----+--------------------+------+
Instead of the correct results below:
+----+-----+-----+--------------------+-----+ |year| make|model| comment|blank| +----+-----+-----+--------------------+-----+ |2012|Tesla| S| No comment| | |1997| Ford| E350|Go get one now th...| | |2015|Chevy| Volt| null| null| +----+-----+-----+--------------------+-----+