Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34529

spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" when parsing windows line feed (CR LF)

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 3.0.3, 3.1.1, 3.2.0
    • Fix Version/s: None
    • Component/s: PySpark, SQL
    • Labels:
      None

      Description

      lineSep documentation says - 

      `lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line separator that should be used for parsing. Maximum length is 1 character.

      Reference: 

       https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader

      When reading csv file using spark

      src_df = (spark.read
      .option("header", "true")
      .option("multiLine","true")
      .option("escape", "ǁ")
      .option("lineSep","\r\n")
      .schema(materialusetype_Schema)
      .option("badRecordsPath","/fh_badfile")
      .csv("<path-to-csv>/crlf.csv")
      )

      Below is the stack trace:

      java.lang.IllegalArgumentException: requirement failed: 'lineSep' can contain only 1 character.java.lang.IllegalArgumentException: requirement failed: 'lineSep' can contain only 1 character. at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.catalyst.csv.CSVOptions.$anonfun$lineSeparator$1(CSVOptions.scala:209) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.catalyst.csv.CSVOptions.<init>(CSVOptions.scala:207) at org.apache.spark.sql.catalyst.csv.CSVOptions.<init>(CSVOptions.scala:58) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.buildReader(CSVFileFormat.scala:108) at org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues(FileFormat.scala:132) at org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues$(FileFormat.scala:123) at org.apache.spark.sql.execution.datasources.TextBasedFileFormat.buildReaderWithPartitionValues(FileFormat.scala:162) at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:510) at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:497) at org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:692) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:196) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:240) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:236) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:192) at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:79) at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88) at org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:61) at org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:57) at org.apache.spark.sql.execution.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:483) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:483) at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:427) at org.apache.spark.sql.execution.CollectLimitExec.executeCollectResult(limit.scala:58) at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3013) at org.apache.spark.sql.Dataset.$anonfun$collectResult$1(Dataset.scala:3004) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3728) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:841) at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:198) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3726) at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3003)

        Attachments

        1. image-2021-08-26-14-04-47-464.png
          21 kB
          Syedhamjath
        2. image-2021-08-26-14-06-41-055.png
          30 kB
          Syedhamjath
        3. TestData.csv
          0.5 kB
          Syedhamjath
        4. image-2021-08-26-14-12-30-397.png
          19 kB
          Syedhamjath
        5. image-2021-08-26-14-42-23-042.png
          19 kB
          Syedhamjath

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              kc.shanmugavel Shanmugavel Kuttiyandi Chandrakasu
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated: