Uploaded image for project: 'Apache YuniKorn'
  1. Apache YuniKorn
  2. YUNIKORN-1628

[Umbrella] YuniKorn application traceability

    XMLWordPrintableJSON

Details

    Description

      The current implementation of YuniKorn is focused on the application and the states of the application. K8s does not and cannot provide details on what happens inside the application. This limits what we can offer at a YuniKorn level for applications.

      To increase supportability, we need to understand what happens inside the core scheduler and how we got into a certain state.

      Requirements:

      1. We want to record a stream of events in memory when something relevant happens which is related to the application or nodes:
        • Partition changed (nodes added / removed, capacity changed, etc.)
        • Application created / removed
        • An ask is created / removed
        • An allocation is created / removed
        • Reservation occurs
        • Placeholder is replaced, etc.
      2. The recorded events should be available from the REST interface
      3. The number of stored events can be limited by two settings: maximum number of events or expiration time (eg. events from the past 5 minutes).
      4. Take advantage of Go channels to avoid any potential blocking

       

      Attachments

        Issue Links

          1.
          Create design document Sub-task Closed Wilfred Spiegelenburg
          2.
          Event cache: misc cleanup Sub-task Closed Peter Bacsko
          3.
          Extend si.EventRecord type Sub-task Closed Peter Bacsko
          4.
          Re-write event storage from map to slice Sub-task Closed Peter Bacsko
          5.
          Create basic ringbuffer implementation Sub-task Closed Peter Bacsko
          6.
          Remove error check from createEventRecord() Sub-task Closed Peter Bacsko
          7.
          Extend Application event wrapper with new events Sub-task Closed Rainie Li
          8.
          Add REST endpoint for batch event retrieval Sub-task Closed Peter Bacsko
          9.
          Add allocation events Sub-task Closed PoAn Yang
          10.
          Add node events Sub-task Closed Peter Bacsko
          11.
          Add queue events Sub-task Closed Peter Bacsko
          12.
          Rationalize event verification in partition_test.go Sub-task Closed Peter Bacsko
          13.
          Rewrite event verification in application_test.go Sub-task Closed Peter Bacsko
          14.
          Add new configuration entries Sub-task Closed Mit Desai
          15.
          Add reservation/unreservation events Sub-task Closed Peter Bacsko
          16.
          Handle config reload in the event code Sub-task Closed Peter Bacsko
          17.
          Rename CreateAndSetEventSystem() and reduce code duplication Sub-task Closed Peter Bacsko
          18.
          Don't send all node events to the shim Sub-task Closed Peter Bacsko
          19.
          Create smoke test to validate application tracking via REST interface Sub-task Closed Peter Bacsko
          20.
          Null Batch API Response when we reset ring buffer size Sub-task Closed Peter Bacsko
          21.
          Add missing application state transition event Sub-task Closed Peter Bacsko
          22.
          Add retrieving recent events to the REST interface Sub-task Closed Peter Bacsko
          23.
          Extract URL query properly in getEvents() Sub-task Closed Peter Bacsko
          24.
          Pass EventRecord as value instead of pointer Sub-task Closed Peter Bacsko
          25.
          Handle event streaming clients with websocket Sub-task Closed Peter Bacsko
          26.
          Add missing states Expired and Resuming Sub-task Closed Peter Bacsko
          27.
          Display a unique id on the REST interface when returning events Sub-task Closed Peter Bacsko

          Activity

            People

              pbacsko Peter Bacsko
              pbacsko Peter Bacsko
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: