Details
-
Documentation
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
-
None
Description
- Imagine a two-row csv file like so (where the header and first record are duplicate rows):
aaa,bbb
aaa,bbb
- The following is pyspark code
- create a parallelized rdd like: prdd = spark.read.text("test.csv").rdd.flatMap(lambda x : x)
- create a df like so: mydf = spark.read.csv(prdd, header=True)
* df.count() will result in a record count of zero (when it should be 1)