Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-36277

Issue with record count of data frame while reading in DropMalformed mode

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.4.3
    • None
    • PySpark
    • None

    Description

      I am writing the steps to reproduce the issue for "count" pyspark api while using mode as dropmalformed.

      I have a csv sample file in s3 bucket . I am reading the file using pyspark api for csv . I am reading the csv "without schema" and "with schema using mode 'dropmalformed' options in two different dataframes . While displaying the "with schema using mode 'dropmalformed'" dataframe , the display looks good ,it is not showing the malformed records .But when we apply count api on the dataframe it gives the record count of actual file. I am expecting it should give me valid record count .

      here is the code used:-

      without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
      schema = StructType([ \
          StructField("firstname",StringType(),True), \
          StructField("middlename",StringType(),True), \
          StructField("lastname",StringType(),True), \
          StructField("id", StringType(), True), \
          StructField("gender", StringType(), True), \
          StructField("salary", IntegerType(), True) \
        ])
      with_schema_df = spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
      print("The dataframe with schema")
      with_schema_df.show()
      print("The dataframe without schema")
      without_schema_df.show()
      cnt_with_schema=with_schema_df.count()
      print("The  records count from with schema df :"+str(cnt_with_schema))
      cnt_without_schema=without_schema_df.count()
      print("The  records count from without schema df: "+str(cnt_without_schema))
      

      here is the outputs screen shot 111.PNG is the outputs of the code and inputfile.csv is the input to the code

      Attachments

        1. 111.PNG
          57 kB
          anju
        2. Inputfile.PNG
          17 kB
          anju
        3. sample.csv
          0.2 kB
          anju

        Activity

          People

            Unassigned Unassigned
            datumgirl anju
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: