Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-2590

DataSetUtils.zipWithUniqueID creates duplicate IDs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 0.10.0
    • None
    • API / Scala
    • None

    Description

      The function creates IDs using the following code:

      shifter = log2(numberOfParallelSubtasks)
      id = counter << shifter + taskId;
      

      As the binary function + is executed before the bitshift <<, this results in cases where different tasks create the same ID. It essentially calculates

      counter*2^(shifter+taskId)
      

      which is 0 for counter = 0 and all values of shifter and taskID.

      Consider the following example.

      numberOfParallelSubtaks = 8
      shifter = log2(8) = 4 (maybe rename the function?)
      produces:

      start: 1, shifter: 4 taskId: 4 label: 256
      start: 2, shifter: 4 taskId: 3 label: 256
      start: 4, shifter: 4 taskId: 2 label: 256
      

      I would suggest the following:

      counter*2^(shifter)+taskId
      

      which in code is equivalent to

      shifter = log2(numberOfParallelSubtasks);
      id = (counter << shifter) + taskId;
      

      and for our example produces:

      start: 1, shifter: 4 taskId: 4 label: 20
      start: 2, shifter: 4 taskId: 3 label: 35
      start: 4, shifter: 4 taskId: 2 label: 66
      

      So we move the counter to the left and add the task id. As there is space for 2^shifter numbers, this prevents collisions.

      Attachments

        Activity

          People

            mju Martin Junghanns
            mju Martin Junghanns
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: