Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29058

Reading csv file with DROPMALFORMED showing incorrect record count

Rank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.3.0
    • None
    • PySpark, SQL

    Description

      The spark sql csv reader is dropping malformed records as expected, but the record count is showing as incorrect.

      Consider this file (fruit.csv)

      apple,red,1,3
      banana,yellow,2,4.56
      orange,orange,3,5
      

      Defining schema as follows:

      schema = "Fruit string,color string,price int,quantity int"
      

      Notice that the "quantity" field is defined as integer type, but the 2nd row in the file contains a floating point value, hence it is a corrupt record.

      >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema)
      >>> df.show()
      +------+------+-----+--------+
      | Fruit| color|price|quantity|
      +------+------+-----+--------+
      | apple|   red|    1|       3|
      |orange|orange|    3|       5|
      +------+------+-----+--------+
      
      >>> df.count()
      3
      

      Malformed record is getting dropped as expected, but incorrect record count is getting displayed.

      Here the df.count() should give value as 2

       

       

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            Patnaik Suchintak Patnaik
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment