Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-48105

Fix the data corruption issue when state store unload and snapshotting happens concurrently for HDFS state store

    XMLWordPrintableJSON

Details

    Description

      There are two race conditions between state store snapshotting and state store unloading which could result in query failure and potential data corruption.

       

      Case 1:

      1. the maintenance thread pool encounters some issues and call the stopMaintenanceTask, this function further calls threadPool.stop. However, this function doesn't wait for the stop operation to be completed and move to do the state store unload and clear.
      2. the provider unload will close the state store which clear the values of loadedMaps for HDFS backed state store.
      3. if the not-yet-stop maintenance thread is still running and trying to do the snapshot, but the data in the underlying `HDFSBackedStateStoreMap` has been removed. if this snapshot process completes successfully, then we will write corrupted data and the following batches will consume this corrupted data.

      Case 2:

      1. In executor_1, the maintenance thread is going to do the snapshot for state_store_1, it retrieves the `HDFSBackedStateStoreMap` object from the loadedMaps, after this, the maintenance thread releases the lock of the loadedMaps.
      2. state_store_1 is loaded in another executor, e.g. executor_2.
      3. another state store, state_store_2, is loaded on executor_1 and reportActiveStoreInstance to driver.
      4. executor_1 does the unload for those no longer active state store which clears the data entries in the `HDFSBackedStateStoreMap`
      5. the snapshotting thread is terminated and uploads the incomplete snapshot to cloud because the iterator doesn't have next element after doing the clear.
      6. future batches are consuming the corrupted data.

       

      Proposed fix:

      • When we close the hdfs state store, we should only remove the entry from `loadedMaps` rather than doing the active data cleanup. JVM GC should be able to help us GC those objects.
      • we should wait for the maintenance thread to stop before unloading the providers. 

       

      Thanks anishshri-db for helping debug this issue!

      Attachments

        Issue Links

          Activity

            People

              huanli.wang Huanli Wang
              huanli.wang Huanli Wang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: