Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12590

Inconsistent behavior of randomSplit in YARN mode

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 1.5.2
    • None
    • MLlib, Spark Core
    • None
    • YARN mode

    Description

      I noticed an inconsistent behavior when using rdd.randomSplit when the source rdd is repartitioned, but only in YARN mode. It works fine in local mode though.

      Code:
      val rdd = sc.parallelize(1 to 1000000)
      val rdd2 = rdd.repartition(64)
      rdd.partitions.size
      rdd2.partitions.size
      val Array(train, test) = rdd2.randomSplit(Array(70, 30), 1)
      train.takeOrdered(10)
      test.takeOrdered(10)

      Master: local
      Both the take statements produce consistent results and have no overlap in numbers being outputted.

      Master: YARN
      However, when these are run on YARN mode, these produce random results every time and also the train and test have overlap in the numbers being outputted.
      If I use rdd.randomSplit, then it works fine even on YARN.

      So, it concludes that the repartition is being evaluated every time the splitting occurs.

      Interestingly, if I cache the rdd2 before splitting it, then we can expect consistent behavior since repartition is not evaluated again and again.

      Attachments

        Activity

          People

            Unassigned Unassigned
            gauravkumar37 Gaurav Kumar
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: