Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8360 Structured Streaming (aka Streaming DataFrames)
  3. SPARK-17372

Running a file stream on a directory with partitioned subdirs throw NotSerializableException/StackOverflowError

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0
    • 2.0.1, 2.1.0
    • Structured Streaming
    • None

    Description

      When we create a filestream on a directory that has partitioned subdirs (i.e. dir/x=y/), then ListingFileCatalog.allFiles returns the files in the dir as Seq[String] which internally is a Stream[String]. This is because of this line, where a LinkedHashSet.values.toSeq returns Stream. Then when the FileStreamSource filters this Stream[String] to remove the seen files, it creates a new Stream[String], which has a filter function that has a $outer reference to the FileStreamSource (in Scala 2.10). Trying to serialize this Stream[String] causes NotSerializableException. This will happened even if there is just one file in the dir.

      Its important to note that this behavior is different in Scala 2.11. There is no $outer reference to FileStreamSource, so it does not throw NotSerializableException. However, with a large sequence of files (tested with 10000 files), it throws StackOverflowError. This is because how Stream class is implemented. Its basically like a linked list, and attempting to serialize a long Stream requires recursively going through linked list, thus resulting in StackOverflowError.

      In short, across both Scala 2.10 and 2.11, serialization fails when both the following conditions are true.

      • file stream defined on a partitioned directory
      • directory has 10k+ files

      The right solution is to convert the seq to an array before writing to the log.

      Attachments

        Activity

          People

            tdas Tathagata Das
            tdas Tathagata Das
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: