Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Not A Bug
-
None
-
None
Description
My team needs to accurately track the resources used by Spark jobs. We're currently using YuniKorn's app summary log emitted by the scheduler when a job completes. However, we're aware that this log is inaccurate if YuniKorn restarts while the job is running since YuniKorn keeps track of app resources in memory only. To address this, we created a sidecar pod that connects to YuniKorn's streaming event endpoint and saves the events to a Kafka topic for persistence, should YuniKorn crash, allowing us to calculate our own app summaries.
However, we have noticed that if executor pods complete while YuniKorn is down, YuniKorn never emits an allocation cancellation event. Thus, we cannot determine when the executor pod stopped using resources. Using the last event timestamp from the job provides an upper bound on the executor's resource usage, and ignoring the executor entirely, as YuniKorn seems to do, provides a lower bound.
Below are the results from my testing:
Test without YuniKorn Restart
I ran a job for about 5 minutes with a driver pod creating roughly 100 executor pods. The first execution was without restarting YuniKorn.
My results calculated using the events in the Kafka topic:
Total aggregated resources usage: memory: 126643967751900.53 pods: 9102.207251182002 vcore: 35909809.17691601
Yunikorn's App Summary Log:
2024-07-30T23:04:58.526Z INFO core.scheduler.application.usage objects/application_summary.go:60 YK_APP_SUMMARY: {ResourceUsage: TrackedResource{UNKNOWN:pods=9048,UNKNOWN:vcore=35694000,UNKNOWN:memory=125880530632704}, PreemptedResource: TrackedResource{}, PlaceholderResource: TrackedResource{}}
The difference (my value - yunikorn app summary value):
memory: 126643967751900.53 - 125880530632704 = 125880530632704 (my value is 0.60647% greater) pods: 9102.207251182002 - 9048 = 54.207251182 (my value is 0.599% greater) vcore: 35909809.17691601 - 35694000 = 215809.176916 (my value is 0.6046% greater)
My value is slightly different because I'm using the event timestamps and not the resource timestamps (if you think it's something else then please share).
Test with YuniKorn Restart
I then ran the same job but shut YuniKorn down for about 30 seconds after allocating resources to the driver and all executors, as the executors were nearing completion. Then, I restarted YuniKorn.
Ignoring pods without cancellation events
My results calculated using the events in the Kafka topic:
Total aggregated resources usage: memory: 13299125469337.467 pods: 945.3453441859999 vcore: 3760461.7715400006
Yunikorn's App Summary Log:
2024-07-30T23:48:41.044Z INFO core.scheduler.application.usage objects/application_summary.go:60 YK_APP_SUMMARY: {ResourceUsage: TrackedResource{UNKNOWN:memory=12561602838528,UNKNOWN:vcore=3552000,UNKNOWN:pods=893}, PreemptedResource: TrackedResource{}, PlaceholderResource: TrackedResource{}}
The difference (my value - yunikorn app summary value):
memory: 13299125469337.467 - 12561602838528 = 737522630809 (my value is 5.87124% greater) pods: 945.3453441859999 - 893 = 52.345344186 (my value is 5.8617% greater) vcore: 3760461.7715400006 - 3552000 = 208461.77154 (my value is 5.8688% greater)
There's a larger discrepancy this time. Notably, the number of pods shows a significant drop. In typical runs without restarting YuniKorn, the job's pod resource usage hovers around 9k.
Using Last Event Timestamp as a Replacement
When using the last event timestamp instead of the allocation cancellation event to calculate resource usage, the results align closer to expectations but remain significantly higher than YuniKorn's summary log, likely representing an overestimate.
My results calculated using the events in the Kafka topic:
Number of allocations without matching cancels: 101
Total aggregated resources usage:
memory: 159366239582373.12
pods: 11375.528615858
vcore: 45109670.74353799
The difference (my value - yunikorn app summary value):
memory: 159366239582373.12 - 12561602838528 = 1.4680464e+14 (my value is 1168.6776% greater) pods: 11375.528615858 - 893 = 10482.5286159 (my value is 1173.85538% greater) vcore: 45109670.74353799 - 3552000 = 41557670.7435 (my value is 1169.9794% greater)
Conclusion and Inquiry
Is this a bug in YuniKorn? Besides logging events to a Kafka topic, are there other strategies my team can employ to improve resource usage tracking?
Any insights or recommendations would be greatly appreciated.