Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-3015

Removing broadcast in quick successions causes Akka timeout

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.1.0
    • 1.1.0
    • Spark Core
    • None
    • Standalone EC2 Spark shell

    Description

      This issue is originally reported in SPARK-2916 in the context of MLLib, but we were able to reproduce it using a simple Spark shell command:

      (1 to 10000).foreach { i => sc.parallelize(1 to 1000, 48).sum }
      

      We still do not have a full understanding of the issue, but we have gleaned the following information so far. When the driver runs a GC, it attempts to clean up all the broadcast blocks that go out of scope at once. This causes the driver to send out many blocking RemoveBroadcast messages to the executors, which in turn send out blocking UpdateBlockInfo messages back to the driver. Both of these calls block until they receive the expected responses. We suspect that the high frequency at which we send these blocking messages is the cause of either dropped messages or internal deadlock somewhere.

      Unfortunately, it is highly difficult to reproduce depending on the environment. We have been able to reproduce it on a 6-node cluster in us-west-2, but not in us-west-1, for instance.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            andrewor14 Andrew Or Assign to me
            andrewor14 Andrew Or
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment