Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-10262

StateDirectory is not thread-safe

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.6.0
    • Fix Version/s: 2.6.0
    • Component/s: streams
    • Labels:
      None

      Description

      As explicitly stated in the StateDirectory javadocs,  "This class is not thread-safe."

      Despite this, a single StateDirectory is shared among all the StreamThreads of a client. Some of the more "dangerous" methods are indeed synchronized, but others are not. For example, the innocent-sounding #directoryForTask is not thread-safe and is called in a number of places. We call it during task creation, and we call it during task closure (through StateDirectory#lock). It's not uncommon for one thread to be closing a task while another is creating it after a rebalance.

      In fact, we saw exactly that happen in our test application. This ultimately lead to the following exception

       

      org.apache.kafka.streams.errors.ProcessorStateException: task directory [/mnt/run/streams/state/stream-soak-test/1_0] doesn't exist and couldn't be created at org.apache.kafka.streams.processor.internals.StateDirectory.directoryForTask(StateDirectory.java:112) at org.apache.kafka.streams.processor.internals.ProcessorStateManager.<init>(ProcessorStateManager.java:187) at org.apache.kafka.streams.processor.internals.StandbyTaskCreator.createTasks(StandbyTaskCreator.java:85) at org.apache.kafka.streams.processor.internals.TaskManager.handleAssignment(TaskManager.java:337)
      

       

      The exception arises from this line in StateDirectory#directoryForTask:

      if (hasPersistentStores && !taskDir.exists() && !taskDir.mkdir()) 
      

      Presumably, if the taskDir did not exist when the two threads began this method, then they would both attempt to create the directory. One of them will get there first, leaving the other to return unsuccessfully from mkdir and ultimately throw the above ProcessorStateException.

      I've only confirmed that this affects 2.6 so far, but the unsafe methods are present in earlier versions. It's possible we made the problem worse somehow during "The Refactor" so that it's easier to hit this race condition.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                mjsax Matthias J. Sax
                Reporter:
                ableegoldman A. Sophie Blee-Goldman
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: