Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-738

Multilang needs Overflow-Control mechanism and HeartBeat timeout problem

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 0.10.0, 0.9.3-rc2, 0.9.4, 1.0.0
    • None
    • storm-multilang
    • None

    Description

      hi, all

      we have a topology, which have 3 components(spout->parser->saver) and the parser is Multilang bolt with python. We do not use ACK mechanism.

      we found 2 problems with Mutilang python script.
      1) the parser python scripts may hold too many tuples and consume too many memory;
      2) with MultiLang heartbeat mechanism described by https://issues.apache.org/jira/browse/STORM-513, the python script always timeout to heartbeat, even when the parser bolt is normal, cause supervisor to restart itself.

      ShellBolt process === Father-Process
      PythonScript process === Child-Process

      The reason is :
      1) when topology do not use ACK mechanism, the spout do not have Overflow-control ability, if the stream have too many tuples comes, spout will send all the tuples to parser's ShellBolt process(Father-Process);
      2) parser's ShellBolt process just put the tuples to _pendingWrites queue, if the _pendingWrites queue does not have limit;
      3) parser's PythonScript process(Child-Process) call readMsg() to read a tuple from STDIN, handle the tuple, and emit a new tuple to its father process through STDOUT, and then call readTaskIds() from STDIN. Because Father-Process's queue already have too many other tuples, Child-Process will read all the tuples to pending_commands, util received TaskIds.
      4) so Child-Process process's pending_commands may contains too many tuples and consume too many memory.

      As to heartbeat, because there are too many pending_commands need Child-Process to handle, and Child-Process's every emit operation will need more I/O read operations from STDIN. It may need 10 seconds to handle one tuple, and this will cause the heartbeat tuple not handle quickly, and timeout will happen.

      Even if Father-Process's _pendingWrites have limits, for example 1000, Child-Process may needs 1000 x 1000 read operations then it can handle the heartbeat tuple.

      Robert Joseph Evans Jungtaek Lim this related to Multilang and heartbeat, please help to confirm the two problems.

      I think Father-Process and Child-Process need Overflow-Control Protocol to control the python script's memory usage.
      And heartbeat tuple needs a separate queue(pending_heartbeats), and Child-Process handle heartbeat tuple at high priority. Jungtaek Lim wish to hear your opinion.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            dashengju DashengJu

            Dates

              Created:
              Updated:

              Slack

                Issue deployment