Uploaded image for project: 'Giraph (Retired)'
  1. Giraph (Retired)
  2. GIRAPH-246

Periodic worker calls to context.progress() will prevent timeout on some Hadoop clusters during barrier waits

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.0.0
    • 1.0.0
    • bsp

    Description

      This simple change creates a command-line configurable option in GiraphJob to control the time between calls to context().progress() that allows workers to avoid timeouts during long data load-ins in which some works complete their input split reads much faster than others, or finish a super step faster. I found this allowed jobs that were large-scale but with low memory overhead to complete even when they would previously time out during runs on a Hadoop cluster. Timeout is still possible when the worker crashes or runs out of memory or has other GC or RPC trouble that is legitimate, but prevents unintentional crashes when the worker is actually still healthy.

      Attachments

        1. GIRAPH-246-NEW-FIX-2.patch
          6 kB
          Eli Reisman
        2. GIRAPH-246-NEW-FIX.patch
          6 kB
          Eli Reisman
        3. GIRAPH-246-9.patch
          7 kB
          Eli Reisman
        4. GIRAPH-246-8.patch
          7 kB
          Eli Reisman
        5. GIRAPH-246-7.patch
          11 kB
          Eli Reisman
        6. GIRAPH-246-7_rebase2.patch
          11 kB
          Eli Reisman
        7. GIRAPH-246-7_rebase1.patch
          10 kB
          Eli Reisman
        8. GIRAPH-246-6.patch
          11 kB
          Eli Reisman
        9. GIRAPH-246-5.patch
          11 kB
          Eli Reisman
        10. GIRAPH-246-4.patch
          4 kB
          Eli Reisman
        11. GIRAPH-246-3.patch
          4 kB
          Eli Reisman
        12. GIRAPH-246-2.patch
          4 kB
          Eli Reisman
        13. GIRAPH-246-11.patch
          5 kB
          Eli Reisman
        14. GIRAPH-246-10.patch
          5 kB
          Eli Reisman
        15. GIRAPH-246-1.patch
          5 kB
          Eli Reisman

        Issue Links

          Activity

            People

              initialcontext Eli Reisman
              initialcontext Eli Reisman
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Slack

                  Issue deployment