[SPARK-14194] spark csv reader not working properly if CSV content contains CRLF character (newline) in the intermediate cell - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.5.2, 2.1.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

We have CSV content like below,

Sl.NO, Employee_Name, Company, Address, Country, ZIP_Code\n\r
"1", "ABCD", "XYZ", "1234", "XZ Street \n\r(CRLF charater), Municapality,....","USA", "1234567"

Since there is a '\n\r' character in the row middle (to be exact in the Address Column), when we execute the below spark code, it tries to create the dataframe with two rows (excluding header row), which is wrong. Since we have specified delimiter as quote (") character , why it takes the middle character as newline character ? This creates an issue while processing the created dataframe.

DataFrame df = sqlContextManager.getSqlContext().read().format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", delim)
.option("quote", quote)
.option("escape", escape)
.load(sourceFile);

Attachments

Issue Links

duplicates

SPARK-19610 multi line support for CSV

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Kumaresh C R

Votes:: 3 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 28/Mar/16 10:42

Updated:: 12/Dec/22 18:11

Resolved:: 28/Feb/17 23:08