Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-1977

Leader Nimbus crashes with getClusterInfo when it doesn't have one or more replicated topology codes

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.0.0, 1.0.1
    • Fix Version/s: 2.0.0, 1.0.2, 1.1.0
    • Component/s: storm-core
    • Labels:
      None

      Description

      While investigating STORM-1976, I found that there're cases for nimbus to not having topology codes.
      Before BlobStore, only nimbuses which is having all topology codes can gain leadership, otherwise they give up leadership immediately. While introducing BlobStore, this logic is removed.

      I don't know it's intended or not, but it incurs one of nimbus to gain leadership which doesn't have replicated topology code, and the nimbus will be crashed when getClusterInfo is requested.

      Easiest way to reproduce is:

      1. comment cleanup-corrupt-topologies! from nimbus.clj (It's a quick workaround for resolving STORM-1976), and patch Storm cluster
      2. Launch Nimbus 1 (leader)
      3. Run topology
      4. Kill Nimbus 1
      5. Launch Nimbus 2 from different node
      6. Nimbus 2 gains leadership
      7. getClusterInfo is requested to Nimbus 2, and Nimbus 2 gets crashed

      Log:

      2016-07-17 08:47:48.378 o.a.s.b.FileBlobStoreImpl [INFO] Creating new blob store based in /grid/0/hadoop/storm/blobs
      ...
      2016-07-17 08:47:48.619 o.a.s.zookeeper [INFO] Queued up for leader lock.
      2016-07-17 08:47:48.651 o.a.s.zookeeper [INFO] <node1> gained leadership
      ...
      2016-07-17 08:47:48.833 o.a.s.d.nimbus [INFO] Starting nimbus server for storm version '1.1.1-SNAPSHOT'
      2016-07-17 08:47:49.295 o.a.s.t.ProcessFunction [ERROR] Internal error processing getClusterInfo
      KeyNotFoundException(msg:production-topology-2-1468745167-stormcode.ser)
              at org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:149)
              at org.apache.storm.blobstore.LocalFsBlobStore.getBlobReplication(LocalFsBlobStore.java:268)
      ...
              at org.apache.storm.daemon.nimbus$get_blob_replication_count.invoke(nimbus.clj:498)
              at org.apache.storm.daemon.nimbus$get_cluster_info$iter__9520__9524$fn__9525.invoke(nimbus.clj:1427)
      ...
              at org.apache.storm.daemon.nimbus$get_cluster_info.invoke(nimbus.clj:1401)
              at org.apache.storm.daemon.nimbus$mk_reified_nimbus$reify__9612.getClusterInfo(nimbus.clj:1838)
              at org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3724)
              at org.apache.storm.generated.Nimbus$Processor$getClusterInfo.getResult(Nimbus.java:3708)
              at org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:39)
      ...
      2016-07-17 08:47:49.397 o.a.s.b.BlobStoreUtils [ERROR] Could not download blob with keyproduction-topology-2-1468745167-stormconf.ser
      2016-07-17 08:47:49.400 o.a.s.b.BlobStoreUtils [ERROR] Could not update the blob with keyproduction-topology-2-1468745167-stormconf.ser
      2016-07-17 08:47:49.402 o.a.s.d.nimbus [ERROR] Error when processing event
      KeyNotFoundException(msg:production-topology-2-1468745167-stormconf.ser)
              at org.apache.storm.blobstore.LocalFsBlobStore.getStoredBlobMeta(LocalFsBlobStore.java:149)
              at org.apache.storm.blobstore.LocalFsBlobStore.getBlob(LocalFsBlobStore.java:239)
              at org.apache.storm.blobstore.BlobStore.readBlobTo(BlobStore.java:271)
              at org.apache.storm.blobstore.BlobStore.readBlob(BlobStore.java:300)
      ...
             at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
              at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
              at org.apache.storm.daemon.nimbus$read_storm_conf_as_nimbus.invoke(nimbus.clj:548)
              at org.apache.storm.daemon.nimbus$read_topology_details.invoke(nimbus.clj:555)
              at org.apache.storm.daemon.nimbus$mk_assignments$iter__9205__9209$fn__9210.invoke(nimbus.clj:912)
      ...
              at org.apache.storm.daemon.nimbus$mk_assignments.doInvoke(nimbus.clj:911)
              at clojure.lang.RestFn.invoke(RestFn.java:410)
              at org.apache.storm.daemon.nimbus$fn__9769$exec_fn__1363__auto____9770$fn__9781$fn__9782.invoke(nimbus.clj:2216)
              at org.apache.storm.daemon.nimbus$fn__9769$exec_fn__1363__auto____9770$fn__9781.invoke(nimbus.clj:2215)
              at org.apache.storm.timer$schedule_recurring$this__1732.invoke(timer.clj:105)
              at org.apache.storm.timer$mk_timer$fn__1715$fn__1716.invoke(timer.clj:50)
              at org.apache.storm.timer$mk_timer$fn__1715.invoke(timer.clj:42)
      ...
      2016-07-17 08:47:49.408 o.a.s.util [ERROR] Halting process: ("Error when processing an event")
      java.lang.RuntimeException: ("Error when processing an event")
              at org.apache.storm.util$exit_process_BANG_.doInvoke(util.clj:341)
              at clojure.lang.RestFn.invoke(RestFn.java:423)
              at org.apache.storm.daemon.nimbus$nimbus_data$fn__8727.invoke(nimbus.clj:205)
              at org.apache.storm.timer$mk_timer$fn__1715$fn__1716.invoke(timer.clj:71)
              at org.apache.storm.timer$mk_timer$fn__1715.invoke(timer.clj:42)
              at clojure.lang.AFn.run(AFn.java:22)
              at java.lang.Thread.run(Thread.java:745)
      2016-07-17 08:47:49.410 o.a.s.d.nimbus [INFO] Shutting down master
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                kabhwan Jungtaek Lim
                Reporter:
                kabhwan Jungtaek Lim
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: