Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2981

PartitionStrategy: VertexID hash overflow

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.0.2
    • 1.0.3, 1.1.0, 1.2.0
    • GraphX

    Description

      In EdgePartition1D, a PartitionID is calculated by multiplying VertexId with a mixingPrime (1125899906842597L) then cast to Int, and mod numParts.

      The Long is overflowed, and when cast to Int:

      scala> (1125899906842597L*1).toInt
      res1: Int = -27

      scala> (1125899906842597L*2).toInt
      res2: Int = -54

      scala> (1125899906842597L*3).toInt
      res3: Int = -81

      As the cast produce number that are multiplies of 3, the partition is not useable when partitioning to multiples of 3.

      for example when you partition to 6 or 9 parts:

      14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: psrc Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0))
      14/08/12 09:26:21 INFO GraphXPartition: GRAPHX: pdst Array((0,4347084), (1,0), (2,0), (3,3832578), (4,0), (5,0))

      14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: psrc Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0))
      14/08/12 09:21:46 INFO GraphXPartition: GRAPHX: pdst Array((0,8179662), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0))

      so the vertices are partitioned to 0,3 for 6; and 0 for 9

      I think solution is to cast after mod.

      scala> (1125899906842597L*3)
      res4: Long = 3377699720527791

      scala> (1125899906842597L*3) % 9
      res5: Long = 3

      scala> ((1125899906842597L*3) % 9).toInt
      res5: Int = 3

      Attachments

        Activity

          People

            Unassigned Unassigned
            larryxiao Di Xiao
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified