Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
The BulkResponse object contains a single Throwable per failing document in the bulk request. The connector currently loops through the failures, combines them into a single exception via suppression, and throws it. In cases where the bulk size is very large, the size of the resulting stack trace is so large that serializing it causes the TM to OOM. Attached is the heap dump visualization of a TM that OOM'ed with a failing bulk size of 1,000.
I have mitigated this issue in my local fork of the OS connector by only suppressing exceptions from a bulk response with unique root causes - this way, we can avoid massively nested stack traces where the root cause of every failure is the exact same. NOTE that this proposed fix does not mitigate the unlikely case in which every failing document in a very large BulkResponse has a different root cause. I believe this is acceptable judging by how infrequently this would occur, but it is worth revisiting in the future if it becomes a problem.
@opensearch connector community: let me know if you think this fix would be valuable - I'm happy to open a PR for this upstream!