[SPARK-46862] Incorrect count() of a dataframe loaded from CSV datasource - ASF JIRA

Log work

Agile Board

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Move

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

Delete

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 4.0.0
Fix Version/s: 4.0.0, 3.5.1, 3.4.3
Component/s: SQL
Labels:
- correctness
- pull-request-available

Description

The example below portraits the issue:

>>> df=spark.read.option("multiline", "true").option("header", "true").option("escape", '"').csv("es-939111-data.csv")
>>> df.count()
4
>>> df.cache()
DataFrame[jobID: string, Name: string, City: string, Active: string]
>>> df.count()
5

Attachments

es-939111-data.csv
Delete this attachment
25/Jan/24 14:14
0.1 kB
Max Gekk

Issue Links

Add Link

links to

GitHub Pull Request #44872

Delete this link

GitHub Pull Request #44910

Delete this link

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Max Gekk Assign to me

Reporter:: Max Gekk

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 25/Jan/24 14:14

Updated:: 27/Feb/24 03:25

Resolved:: 26/Jan/24 08:03

Agile

View on Board

Incorrect count() of a dataframe loaded from CSV datasource

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment