Uploaded image for project: 'Samza'
  1. Samza
  2. SAMZA-2789

Remove cap on intermediate stream partition count for stream mode

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Problem: Intermediate stream partition count inference logic caps the partition size to 256 resulting in imbalances in work assignments to tasks

      Description: As part of the intermediate partition size inference logic, we currently employ the following algorithm.

      • partitionCount = Math.max(maxPartitionSize(inputStreams), maxPartitionSize(outputStreams))
      • cap the partitionCount to MAX_INFERRED_PARTITIONS defined in the `IntermediateStreamManager` which is 256
      • apply the inferred partition count to intermediate streams whose partition count is uninitialized

      The logic above always caps the partition size of intermediate streams to 256 for all auto-created intermediate streams. This can prevent the job from scaling up uniformly as the intermediate partition assignment is capped to 256 tasks thereby rendering other tasks imbalanced in case of number tasks > 256.

      Changes:

      • Apply the cap only for batch mode as 256 limit was introduced for batch mode where number of files (partition) could be large
      • Add unit tests for `IntermediateStreamManager`

      Tests: Added unit tests for the code changes

      API Changes: None

      Upgrade Instructions:

      • Jobs that are temporarily worked around this constraint by setting `job.intermediate.stream.partitions` should remove the configuration in order for samza to infer and apply the partition count as described above
      • Jobs that don't use `job.intermediate.stream.partitions` need no changes.

      Usage Instructions: Refer to upgrade instruction.

      Attachments

        Activity

          People

            bharathkk Bharath Kumarasubramanian
            bharathkk Bharath Kumarasubramanian
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1h 10m
                1h 10m