[SPARK-32888] reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Documentation
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
Fix Version/s: 2.4.8, 3.0.2, 3.1.0
Component/s: Spark Core
Labels:
None

Description

Imagine a two-row csv file like so (where the header and first record are duplicate rows):

aaa,bbb

aaa,bbb

The following is pyspark code
create a parallelized rdd like: prdd = spark.read.text("test.csv").rdd.flatMap(lambda x : x)
create a df like so: mydf = spark.read.csv(prdd, header=True)
* df.count() will result in a record count of zero (when it should be 1)

Attachments

Issue Links

links to

[Github] Pull Request #29765 (viirya)

Activity

People

Assignee:: L. C. Hsieh

Reporter:: Punit Shah

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 15/Sep/20 09:43

Updated:: 12/Dec/22 18:10

Resolved:: 16/Sep/20 11:17