Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24147

.count() reports wrong size of dataframe when filtering dataframe on corrupt record field

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 2.2.1
    • Fix Version/s: None
    • Component/s: PySpark, Spark Core
    • Labels:
      None
    • Environment:

      Spark version 2.2.1

      Pyspark 

      Python version 3.6.4

      Description

      Spark reports the wrong size of dataframe using .count() after filtering on a corruptField field.

      Example file that shows the problem:

       

      from pyspark.sql import SparkSession
      from pyspark.sql.types import StringType, StructType, StructField, DoubleType
      from pyspark.sql.functions import col, lit
      
      spark = SparkSession.builder.master("local[3]").appName("pyspark-unittest").getOrCreate()
      spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
      
      
      SCHEMA = StructType([
          StructField("headerDouble", DoubleType(), False),
          StructField("ErrorField", StringType(), False)
      ])
      
      dataframe = (
          spark.read
          .option("header", "true")
          .option("mode", "PERMISSIVE")
          .option("columnNameOfCorruptRecord", "ErrorField")
          .schema(SCHEMA).csv("./x.csv")
      )
      
      total_row_count = dataframe.count()
      print("total_row_count = " + str(total_row_count))
      
      errors = dataframe.filter(col("ErrorField").isNotNull())
      errors.show()
      error_count = errors.count()
      print("errors count = " + str(error_count))
      

       

       

      Using input file x.csv:

       

      headerDouble
      wrong
      

       

       

      Output text. As shown, contents of dataframe contains a row, but .count() reports 0.

       

      total_row_count = 1
      +------------+----------+
      |headerDouble|ErrorField|
      +------------+----------+
      |        null|     wrong|
      +------------+----------+
      
      errors count = 0
      

       

       

      Also discussed briefly on StackOverflow: 

      https://stackoverflow.com/questions/50121899/how-can-sparks-count-function-be-different-to-the-contents-of-the-dataframe

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              richsmith Rich Smith
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: