Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14194

spark csv reader not working properly if CSV content contains CRLF character (newline) in the intermediate cell

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.5.2, 2.1.0
    • None
    • SQL
    • None

    Description

      We have CSV content like below,

      Sl.NO, Employee_Name, Company, Address, Country, ZIP_Code\n\r
      "1", "ABCD", "XYZ", "1234", "XZ Street \n\r(CRLF charater), Municapality,....","USA", "1234567"

      Since there is a '\n\r' character in the row middle (to be exact in the Address Column), when we execute the below spark code, it tries to create the dataframe with two rows (excluding header row), which is wrong. Since we have specified delimiter as quote (") character , why it takes the middle character as newline character ? This creates an issue while processing the created dataframe.

      DataFrame df = sqlContextManager.getSqlContext().read().format("com.databricks.spark.csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .option("delimiter", delim)
      .option("quote", quote)
      .option("escape", escape)
      .load(sourceFile);

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              crkumaresh24 Kumaresh C R
              Votes:
              3 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: