Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18150 Spark 2.* failes to create partitions for avro files
  3. SPARK-18156

CLONE - StreamExecution should discard unneeded metadata

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Invalid
    • None
    • 2.0.1, 2.1.0
    • Structured Streaming
    • None

    Description

      The StreamExecution maintains a write-ahead log of batch metadata in order to allow repeating previously in-flight batches if the driver is restarted. StreamExecution does not garbage-collect or compact this log in any way.

      Since the log is implemented with HDFSMetadataLog, these files will consume memory on the HDFS NameNode. Specifically, each log file will consume about 300 bytes of NameNode memory (150 bytes for the inode and 150 bytes for the block of file contents; see https://www.cloudera.com/documentation/enterprise/latest/topics/admin_nn_memory_config.html. An application with a 100 msec batch interval will increase the NameNode's heap usage by about 250MB per day.

      There is also the matter of recovery. StreamExecution reads its entire log when restarting. This read operation will be very expensive if the log contains millions of entries spread over millions of files.

      Attachments

        Activity

          People

            freiss Frederick Reiss
            sunilsbjoshi Sunil Kumar
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: