Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17417

Fix sorting of part files while reconstructing RDD/partition from checkpointed files.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.0, 2.0.0
    • 2.0.2, 2.1.0
    • Spark Core
    • None

    Description

      Spark currently assumes # of partitions to be less than 100000 and uses %05d padding.

      If we exceed this no., the sort logic in ReliableCheckpointRDD gets messed up and fails. This is because of part-files are sorted and compared as strings.

      This leads filename order to be part-10000, part-100000, ... instead of part-10000, part-10001, ..., part-100000 and while reconstructing the checkpointed RDD the job fails.

      Possible solutions:

      • Bump the padding to allow more partitions or
      • Sort the part files extracting a sub-portion as string and then verify the RDD

      Attachments

        Activity

          People

            Dhruve Ashar Dhruve Ashar
            Dhruve Ashar Dhruve Ashar
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: