Uploaded image for project: 'Crunch (Retired)'
  1. Crunch (Retired)
  2. CRUNCH-680

Kafka Source should split very large partitions

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 1.0.0
    • IO
    • None

    Description

      If a single Kafka partition has a very large number of messages, the map task for that partition can take a long time to run leading to potential timeout problems. We should limit the number of messages assigned to each split so that the workload is more evenly balanced.

      Based on our testing we have determined that 5 million messages should be a generally reasonable default for the maximum split size, with a configuration property (org.apache.crunch.kafka.split.max) provided to optionally override that value.

      Attachments

        Issue Links

          Activity

            People

              mkwhitacre Micah Whitacre
              noslowerdna Andrew Olson
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m