Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5238 Improve the robustness of Spark Streaming WAL mechanism
  3. SPARK-5147

write ahead logs from streaming receiver are not purged because cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.2.0
    • Fix Version/s: 1.2.1, 1.3.0
    • Component/s: DStreams
    • Labels:
      None

      Description

      Hi all,

      We are running a Spark streaming application with ReliableKafkaReceiver. We have "spark.streaming.receiver.writeAheadLog.enable" set to true so write ahead logs (WALs) for received data are created under receivedData/streamId folder in the checkpoint directory.

      However, old WALs are never purged by time. receivedBlockMetadata and checkpoint files are purged correctly though. I went through the code, WriteAheadLogBasedBlockHandler class in ReceivedBlockHandler.scala is responsible for cleaning up the old blocks. It has method cleanupOldBlocks, which is never called by any class. ReceiverSupervisorImpl class holds a WriteAheadLogBasedBlockHandler instance. However, it only calls storeBlock method to create WALs but never calls cleanupOldBlocks method to purge old WALs.

      The size of the WAL folder increases constantly on HDFS. This is preventing us from running the ReliableKafkaReceiver 24x7. Can somebody please take a look.

      Thanks,
      Max

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              superxma Max Xu
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: