Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-143

Launching a process throws away standard out; can hang

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 0.10.0
    • storm-core
    • None

    Description

      https://github.com/nathanmarz/storm/issues/489

      https://github.com/nathanmarz/storm/blob/master/src/clj/backtype/storm/util.clj#L349

      When we launch a process, standard out is written to a system buffer and does not appear to be read. Also, nothing is redirected to standard in. This can have the following effects:

      A worker can hang when initializing (e.g. UnsatisfiedLinkError looking for jzmq), and it will be unable to communicate the error as standard out is being swallowed.
      A process that writes too much to standard out will block if the buffer fills
      A process that tries to read form standard in for any reason will block.
      Perhaps we can redirect standard out to an .out file, and redirect /dev/null to the standard in stream of the process?

      ----------
      nathanmarz: Storm redirects stdout to the logging system. It's worked fine for us in our topologies.

      ----------
      d2r: We see in worker.clj, in mk-worker, where there is a call to redirect-stdio-to-slf4j!. This would not seem to help in cases such as we are seeing when there is a problem launching the worker itself.

      (defn -main [storm-id assignment-id port-str worker-id]
      (let [conf1 (read-storm-config)
      login_conf_file (System/getProperty "java.security.auth.login.config")
      conf (if login_conf_file (merge conf1

      {"java.security.auth.login.config" login_conf_file}

      ) conf1)]
      (validate-distributed-mode! conf)
      (mk-worker conf nil (java.net.URLDecoder/decode storm-id) assignment-id (Integer/parseInt port-str) worker-id)))
      If anything were to go wrong (CLASSPATH, jvm opts, misconfiguration...) before -main or before mk-worker, then any output would be lost. The symptom we saw was that the topology sat around apparently doing nothing, yet there was no log indicating that the workers were failing to start.

      Is there other redirection to logs that I'm missing?

      ----------
      xiaokang: we use bash to launch worker process and redirect its stdout to woker-port.out file. it heleped us find the zeromq jni problem that cause the jvm crash without any log.

      ----------
      nathanmarz: @d2r Yea, that's all I was referring to. If we redirect stdout, will the code that redirects stdout to the logging system still take effect? This is important because we can control the size of the logfiles (via the logback config) but not the size of the redirected stdout file.

      ----------
      d2r: My hunch is that it will work as it does now, except that any messages that are getting thrown away before that point would go to a file instead. I can play with it and find out. We wouldn't want to change the redirection, just restore visibility to any output that might occur prior to the redirection. There should be some safety valve to control the size of any new .out in case something goes berserk.

      @xiaokang I see how that would work. We also need to make sure redirection continues to work as it currently does for the above reason.

      ----------
      xiaokang: @d2r @nathanmarz In out cluster, storm's stdout redirection still works for any System.out output while JNI errors goes to worker-port.out file. I think it will be nice to use the same worker-port.log file for bash stdout redirection since logback can control log file size. But it is a little bit ugly to use bash to launch worker java process.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              xumingming James Xu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: