Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16589

Chained cartesian produces incorrect number of records

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.4.0, 1.5.0, 1.6.0, 2.0.0
    • 2.0.3, 2.1.0
    • PySpark

    Description

      Chaining cartesian calls in PySpark results in the number of records lower than expected. It can be reproduced as follows:

      rdd = sc.parallelize(range(10), 1)
      rdd.cartesian(rdd).cartesian(rdd).count()
      ## 355
      
      rdd.cartesian(rdd).cartesian(rdd).distinct().count()
      ## 251
      

      It looks like it is related to serialization. If we reserialize after initial cartesian:

      rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), 1)).cartesian(rdd).count()
      ## 1000
      

      or insert identity map:

      rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count()
      ## 1000
      

      it yields correct results.

      Attachments

        Issue Links

          Activity

            People

              a1ray Andrew Ray
              zero323 Maciej Szymkiewicz
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: