Uploaded image for project: 'Apache YuniKorn'
  1. Apache YuniKorn
  2. YUNIKORN-1218

Scheduler crashed with concurrent map access error in health checker

    XMLWordPrintableJSON

Details

    Description

      After YUNIKORN-1107, the health checker runs as a background thread in 30s interval. We observed a few scheduler restarts in the past week that seems to be caused by this thread, because it has an unsafe access to the partition context without proper read lock. I have uploaded a patch to reproduce this locally, and a file of the stack trace when crash happens.

      Attachments

        1. stacktrace.log
          6 kB
          Weiwei Yang
        2. reproduce.patch
          3 kB
          Weiwei Yang

        Issue Links

          Activity

            People

              wwei Weiwei Yang
              wwei Weiwei Yang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: