Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-3122

FNFE due to race condition between "async localizer" and "update blob" timer thread

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      There's race condition between "async localizer" and "update blob" timer thread.

      When worker is shutting down, reference count for blob will be 0 and supervisor will remove actual blob file. There's also "update blob" timer thread which tries to keep blobs updated for downloaded topologies. While updating topology it should read some of blob files already downloaded assuming these files should be downloaded before, and the assumption is broken because of async localizer.

      Arun Mahadevan suggested an approach to fix this: "updateBlobsForTopology" can just catch the FIleNotFoundException and skip updating the blobs in case it can't find the stormconf, and the approach looks simplest fix so I'll provide a patch based on suggestion.

      Btw, it doesn't apply to master branch, since in master branch all blobs are synced up separately (no need to read stormconf to enumerate topology related blobs), and update logic is already fault-tolerance (skip to next sync when it can't pull the blob).

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            kabhwan Jungtaek Lim Assign to me
            kabhwan Jungtaek Lim
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - Not Specified
              Not Specified
              Remaining:
              Remaining Estimate - 0h
              0h
              Logged:
              Time Spent - 1h 20m
              1h 20m

              Issue deployment