Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.4.0, 1.5.0, 1.6.0, 2.0.0
Description
Chaining cartesian calls in PySpark results in the number of records lower than expected. It can be reproduced as follows:
rdd = sc.parallelize(range(10), 1) rdd.cartesian(rdd).cartesian(rdd).count() ## 355 rdd.cartesian(rdd).cartesian(rdd).distinct().count() ## 251
It looks like it is related to serialization. If we reserialize after initial cartesian:
rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), 1)).cartesian(rdd).count() ## 1000
or insert identity map:
rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count() ## 1000
it yields correct results.
Attachments
Issue Links
- is related to
-
SPARK-17756 java.lang.ClassCastException when using cartesian with DStream.transform
- Resolved
- links to