Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-2879

Supervisor collapse continuously when there is a expired assignment for overdue storm

    XMLWordPrintableJSON

Details

    Description

      For now, when a topology is reassigned or killed for a cluster, supervisor will delete 4 files for an overdue storm:

      • storm-code
      • storm-ser
      • storm-jar
      • LocalAssignment

      Slot.java
      static DynamicState cleanupCurrentContainer(DynamicState dynamicState, StaticState staticState, MachineState nextState) throws Exception {
      assert(dynamicState.container != null);
      assert(dynamicState.currentAssignment != null);
      assert(dynamicState.container.areAllProcessesDead());

      dynamicState.container.cleanUp();
      staticState.localizer.releaseSlotFor(dynamicState.currentAssignment, staticState.port);
      DynamicState ret = dynamicState.withCurrentAssignment(null, null);
      if (nextState != null)

      { ret = ret.withState(nextState); }

      return ret;
      }

      But we do not make a transaction to do this, if an exception occurred during deleting storm-code/ser/jar, an overdue local assignment will be left on disk.

      Then when supervisor restart from the exception above, the slots will be initial and container will be recovered from LocalAssignments, the blob store will fetch the files from Nimbus/Master, but will get a KeyNotFoundException, and supervisor collapses again.

      This will happens continuously and supervisor will never recover until we clean up all the local assignments manually.

      This is the stack:
      2017-12-27 14:15:04.434 o.a.s.l.AsyncLocalizer [INFO] Cleaning up unused topologies in /opt/meituan/storm/data/supervisor/stormdist
      2017-12-27 14:15:04.434 o.a.s.d.s.AdvancedFSOps [INFO] Deleting path /opt/meituan/storm/data/supervisor/stormdist/app_dpsr_realtime_shop_vane_allcates-14-1513685785
      2017-12-27 14:15:04.445 o.a.s.d.s.Slot [INFO] STATE EMPTY msInState: 109 -> WAITING_FOR_BASIC_LOCALIZATION msInState: 1
      2017-12-27 14:15:04.471 o.a.s.d.s.Supervisor [INFO] Starting supervisor with id 255d3fed-f3ee-4c7e-8a08-b693c9a6a072 at host gq-data-rt48.gq.sankuai.com.
      2017-12-27 14:15:04.502 o.a.s.u.Utils [ERROR] An exception happened while downloading /opt/meituan/storm/data/supervisor/tmp/ca4f8174-59be-40a4-b431-dbc8b697f063/stormjar.jar from blob store.
      org.apache.storm.generated.KeyNotFoundException: null
      at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26656) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26624) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$beginBlobDownload_result.read(Nimbus.java:26555) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:86) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$Client.recv_beginBlobDownload(Nimbus.java:864) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$Client.beginBlobDownload(Nimbus.java:851) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.blobstore.NimbusBlobStore.getBlob(NimbusBlobStore.java:357) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorAttempt(Utils.java:598) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorImpl(Utils.java:582) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.utils.Utils.downloadResourcesAsSupervisor(Utils.java:574) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.downloadBaseBlobs(AsyncLocalizer.java:123) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:148) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:101) ~[storm-core-1.1.2-mt001.jar:?]
      at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[?:1.7.0_76]
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_76]
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_76]
      at java.lang.Thread.run(Thread.java:745) [?:1.7.0_76]
      2017-12-27 14:15:04.611 o.a.s.u.Utils [ERROR] An exception happened while downloading /opt/meituan/storm/data/supervisor/tmp/ca4f8174-59be-40a4-b431-dbc8b697f063/stormjar.jar from blob store.
      org.apache.storm.generated.KeyNotFoundException: null
      at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26656) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26624) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$beginBlobDownload_result.read(Nimbus.java:26555) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:86) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$Client.recv_beginBlobDownload(Nimbus.java:864) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$Client.beginBlobDownload(Nimbus.java:851) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.blobstore.NimbusBlobStore.getBlob(NimbusBlobStore.java:357) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorAttempt(Utils.java:598) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorImpl(Utils.java:582) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.utils.Utils.downloadResourcesAsSupervisor(Utils.java:574) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.downloadBaseBlobs(AsyncLocalizer.java:123) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:148) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:101) ~[storm-core-1.1.2-mt001.jar:?]
      at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[?:1.7.0_76]
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_76]
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_76]
      at java.lang.Thread.run(Thread.java:745) [?:1.7.0_76]
      2017-12-27 14:15:04.718 o.a.s.u.Utils [ERROR] An exception happened while downloading /opt/meituan/storm/data/supervisor/tmp/ca4f8174-59be-40a4-b431-dbc8b697f063/stormcode.ser from blob store.
      org.apache.storm.generated.KeyNotFoundException: null
      at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26656) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26624) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$beginBlobDownload_result.read(Nimbus.java:26555) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:86) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$Client.recv_beginBlobDownload(Nimbus.java:864) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$Client.beginBlobDownload(Nimbus.java:851) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.blobstore.NimbusBlobStore.getBlob(NimbusBlobStore.java:357) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorAttempt(Utils.java:598) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorImpl(Utils.java:582) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.utils.Utils.downloadResourcesAsSupervisor(Utils.java:574) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.downloadBaseBlobs(AsyncLocalizer.java:124) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:148) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:101) ~[storm-core-1.1.2-mt001.jar:?]
      at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[?:1.7.0_76]
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_76]
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_76]
      at java.lang.Thread.run(Thread.java:745) [?:1.7.0_76]
      2017-12-27 14:15:04.825 o.a.s.u.Utils [ERROR] An exception happened while downloading /opt/meituan/storm/data/supervisor/tmp/ca4f8174-59be-40a4-b431-dbc8b697f063/stormcode.ser from blob store.
      org.apache.storm.generated.KeyNotFoundException: null
      at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26656) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26624) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$beginBlobDownload_result.read(Nimbus.java:26555) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:86) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$Client.recv_beginBlobDownload(Nimbus.java:864) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$Client.beginBlobDownload(Nimbus.java:851) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.blobstore.NimbusBlobStore.getBlob(NimbusBlobStore.java:357) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorAttempt(Utils.java:598) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorImpl(Utils.java:582) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.utils.Utils.downloadResourcesAsSupervisor(Utils.java:574) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.downloadBaseBlobs(AsyncLocalizer.java:124) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:148) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:101) ~[storm-core-1.1.2-mt001.jar:?]
      at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[?:1.7.0_76]
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_76]
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_76]
      at java.lang.Thread.run(Thread.java:745) [?:1.7.0_76]
      2017-12-27 14:15:04.932 o.a.s.u.Utils [ERROR] An exception happened while downloading /opt/meituan/storm/data/supervisor/tmp/ca4f8174-59be-40a4-b431-dbc8b697f063/stormconf.ser from blob store.
      org.apache.storm.generated.KeyNotFoundException: null
      at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26656) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26624) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$beginBlobDownload_result.read(Nimbus.java:26555) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:86) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$Client.recv_beginBlobDownload(Nimbus.java:864) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$Client.beginBlobDownload(Nimbus.java:851) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.blobstore.NimbusBlobStore.getBlob(NimbusBlobStore.java:357) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorAttempt(Utils.java:598) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorImpl(Utils.java:582) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.utils.Utils.downloadResourcesAsSupervisor(Utils.java:574) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.downloadBaseBlobs(AsyncLocalizer.java:125) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:148) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:101) ~[storm-core-1.1.2-mt001.jar:?]
      at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[?:1.7.0_76]
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_76]
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_76]
      at java.lang.Thread.run(Thread.java:745) [?:1.7.0_76]
      2017-12-27 14:15:05.039 o.a.s.u.Utils [ERROR] An exception happened while downloading /opt/meituan/storm/data/supervisor/tmp/ca4f8174-59be-40a4-b431-dbc8b697f063/stormconf.ser from blob store.
      org.apache.storm.generated.KeyNotFoundException: null
      at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26656) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$beginBlobDownload_result$beginBlobDownload_resultStandardScheme.read(Nimbus.java:26624) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$beginBlobDownload_result.read(Nimbus.java:26555) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:86) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$Client.recv_beginBlobDownload(Nimbus.java:864) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.generated.Nimbus$Client.beginBlobDownload(Nimbus.java:851) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.blobstore.NimbusBlobStore.getBlob(NimbusBlobStore.java:357) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorAttempt(Utils.java:598) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.utils.Utils.downloadResourcesAsSupervisorImpl(Utils.java:582) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.utils.Utils.downloadResourcesAsSupervisor(Utils.java:574) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.downloadBaseBlobs(AsyncLocalizer.java:125) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:148) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.localizer.AsyncLocalizer$DownloadBaseBlobsDistributed.call(AsyncLocalizer.java:101) ~[storm-core-1.1.2-mt001.jar:?]
      at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[?:1.7.0_76]
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_76]
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_76]
      at java.lang.Thread.run(Thread.java:745) [?:1.7.0_76]
      2017-12-27 14:15:05.140 o.a.s.u.Utils [INFO] Could not extract resources from /opt/meituan/storm/data/supervisor/tmp/ca4f8174-59be-40a4-b431-dbc8b697f063/stormjar.jar
      2017-12-27 14:15:05.142 o.a.s.d.s.Slot [INFO] STATE WAITING_FOR_BASIC_LOCALIZATION msInState: 697 -> WAITING_FOR_BLOB_LOCALIZATION msInState: 0
      2017-12-27 14:15:05.142 o.a.s.l.AsyncLocalizer [WARN] Caught Exception While Downloading (rethrowing)...
      java.io.FileNotFoundException: File '/opt/meituan/storm/data/supervisor/stormdist/app_dpsr_realtime_shop_vane_allcates-14-1513685785/stormconf.ser' does not exist
      at org.apache.storm.shade.org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:292) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.shade.org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1815) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.utils.ConfigUtils.readSupervisorStormConfGivenPath(ConfigUtils.java:264) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.utils.ConfigUtils.readSupervisorStormConfImpl(ConfigUtils.java:376) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.utils.ConfigUtils.readSupervisorStormConf(ConfigUtils.java:370) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.localizer.AsyncLocalizer$DownloadBlobs.call(AsyncLocalizer.java:226) ~[storm-core-1.1.2-mt001.jar:?]
      at org.apache.storm.localizer.AsyncLocalizer$DownloadBlobs.call(AsyncLocalizer.java:213) ~[storm-core-1.1.2-mt001.jar:?]
      at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[?:1.7.0_76]
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_76]
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_76]
      at java.lang.Thread.run(Thread.java:745) [?:1.7.0_76]

      Attachments

        Issue Links

          Activity

            People

              danny0405 Danny Chen
              danny0405 Danny Chen
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1.5h
                  1.5h