Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32888

reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv

    XMLWordPrintableJSON

Details

    • Documentation
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
    • 2.4.8, 3.0.2, 3.1.0
    • Spark Core
    • None

    Description

      • Imagine a two-row csv file like so (where the header and first record are duplicate rows):

      aaa,bbb

      aaa,bbb

      • The following is pyspark code
      • create a parallelized rdd like: prdd = spark.read.text("test.csv").rdd.flatMap(lambda x : x)
      • create a df like so: mydf = spark.read.csv(prdd, header=True)
        * df.count() will result in a record count of zero (when it should be 1)

      Attachments

        Activity

          People

            viirya L. C. Hsieh
            bullsoverbears Punit Shah
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: