Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-35830

Large failed bulk request can result in TM OOM

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      The BulkResponse object contains a single Throwable per failing document in the bulk request. The connector currently loops through the failures, combines them into a single exception via suppression, and throws it. In cases where the bulk size is very large, the size of the resulting stack trace is so large that serializing it causes the TM to OOM. Attached is the heap dump visualization of a TM that OOM'ed with a failing bulk size of 1,000.

      I have mitigated this issue in my local fork of the OS connector by only suppressing exceptions from a bulk response with unique root causes - this way, we can avoid massively nested stack traces where the root cause of every failure is the exact same. NOTE that this proposed fix does not mitigate the unlikely case in which every failing document in a very large BulkResponse has a different root cause. I believe this is acceptable judging by how infrequently this would occur, but it is worth revisiting in the future if it becomes a problem.

      @opensearch connector community: let me know if you think this fix would be valuable - I'm happy to open a PR for this upstream!

      Attachments

        Activity

          People

            Unassigned Unassigned
            sakkurn Saketh Kurnool
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: