Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46862

Incorrect count() of a dataframe loaded from CSV datasource

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    Description

      The example below portraits the issue:

      >>> df=spark.read.option("multiline", "true").option("header", "true").option("escape", '"').csv("es-939111-data.csv")
      >>> df.count()
      4
      >>> df.cache()
      DataFrame[jobID: string, Name: string, City: string, Active: string]
      >>> df.count()
      5

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            maxgekk Max Gekk Assign to me
            maxgekk Max Gekk
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment