Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.6.0, 2.0.0
-
None
Description
Spark currently assumes # of partitions to be less than 100000 and uses %05d padding.
If we exceed this no., the sort logic in ReliableCheckpointRDD gets messed up and fails. This is because of part-files are sorted and compared as strings.
This leads filename order to be part-10000, part-100000, ... instead of part-10000, part-10001, ..., part-100000 and while reconstructing the checkpointed RDD the job fails.
Possible solutions:
- Bump the padding to allow more partitions or
- Sort the part files extracting a sub-portion as string and then verify the RDD