Uploaded image for project: 'Crunch (Retired)'
  1. Crunch (Retired)
  2. CRUNCH-680

Kafka Source should split very large partitions

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0.0
    • Component/s: IO
    • Labels:
      None

      Description

      If a single Kafka partition has a very large number of messages, the map task for that partition can take a long time to run leading to potential timeout problems. We should limit the number of messages assigned to each split so that the workload is more evenly balanced.

      Based on our testing we have determined that 5 million messages should be a generally reasonable default for the maximum split size, with a configuration property (org.apache.crunch.kafka.split.max) provided to optionally override that value.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                mkwhitacre Micah Whitacre
                Reporter:
                noslowerdna Andrew Olson

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m

                    Issue deployment