Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-39842

Inconsistent parsing of unusually encoded CSV

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.2.2
    • None
    • Input/Output
    • None
    • pyspark 3.3.2, local mode

    Description

      MCVE with PySpark code (but inherently in Spark itself, not PySpark):

      spark = SparkSession.builder.appName("...").getOrCreate()
      
      min_schema = StructType(
          [
              StructField("dummy_col", StringType(), True),
              StructField("record_id", IntegerType(), nullable=False),
              StructField("dummy_after", StringType(), nullable=False),
          ]
      )
      
      
      df = (
          spark.read.option("mode", "FAILFAST")
          .option("quote", '"')
          .option("escape", '"')
          .option("inferSchema", "false")
          .option("multiline", "true")
          .option("ignoreLeadingWhiteSpace", "true")
          .option("ignoreTrailingWhiteSpace", "true")
          .schema(min_schema)
          .csv(f'min_repro.csv', header=True)
      ) 

      min_repro.csv:

      dummy_col,record_id,dummy_after
      "",1,", Unusual value with comma included"
      B,2,"Unusual value with escaped quote and comma ""like, this" 

      Spark behaves inconsistently

       

      1. Collect works fine:

      df.collect()
      
      [Row(dummy_col=None, record_id=1, dummy_after=', Unusual value with comma included'),
      Row(dummy_col='B', record_id=2, dummy_after='Unusual value with escaped quote and comma "like, this')] 

      2. Trivial query on same DF fails

      if df.count() != df.select('record_id').distinct().count():
          pass 

      Error:

      Py4JJavaError: An error occurred while calling o357.count.
      : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 1 times, most recent failure: Lost task 0.0 in stage 17.0 (TID 13, localhost, executor driver): org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST.
      ...
      Caused by: java.lang.NumberFormatException: For input string: "Unusual value with comma included""
          at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) 

       

      I understand that input file encoding is unusual, but Spark still shouldn't parse it differently between one or other dataframe method.

      Attachments

        Activity

          People

            Unassigned Unassigned
            rogalski Łukasz Rogalski
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: