Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-10262

StateDirectory is not thread-safe

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.6.0
    • 2.6.0
    • streams
    • None

    Description

      As explicitly stated in the StateDirectory javadocs,  "This class is not thread-safe."

      Despite this, a single StateDirectory is shared among all the StreamThreads of a client. Some of the more "dangerous" methods are indeed synchronized, but others are not. For example, the innocent-sounding #directoryForTask is not thread-safe and is called in a number of places. We call it during task creation, and we call it during task closure (through StateDirectory#lock). It's not uncommon for one thread to be closing a task while another is creating it after a rebalance.

      In fact, we saw exactly that happen in our test application. This ultimately lead to the following exception

       

      org.apache.kafka.streams.errors.ProcessorStateException: task directory [/mnt/run/streams/state/stream-soak-test/1_0] doesn't exist and couldn't be created at org.apache.kafka.streams.processor.internals.StateDirectory.directoryForTask(StateDirectory.java:112) at org.apache.kafka.streams.processor.internals.ProcessorStateManager.<init>(ProcessorStateManager.java:187) at org.apache.kafka.streams.processor.internals.StandbyTaskCreator.createTasks(StandbyTaskCreator.java:85) at org.apache.kafka.streams.processor.internals.TaskManager.handleAssignment(TaskManager.java:337)
      

       

      The exception arises from this line in StateDirectory#directoryForTask:

      if (hasPersistentStores && !taskDir.exists() && !taskDir.mkdir()) 
      

      Presumably, if the taskDir did not exist when the two threads began this method, then they would both attempt to create the directory. One of them will get there first, leaving the other to return unsuccessfully from mkdir and ultimately throw the above ProcessorStateException.

      I've only confirmed that this affects 2.6 so far, but the unsafe methods are present in earlier versions. It's possible we made the problem worse somehow during "The Refactor" so that it's easier to hit this race condition.

      Attachments

        Issue Links

          Activity

            People

              mjsax Matthias J. Sax
              ableegoldman A. Sophie Blee-Goldman
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: