Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8360 Structured Streaming (aka Streaming DataFrames)
  3. SPARK-17372

Running a file stream on a directory with partitioned subdirs throw NotSerializableException/StackOverflowError

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 2.0.1, 2.1.0
    • Component/s: Structured Streaming
    • Labels:
      None
    • Target Version/s:

      Description

      When we create a filestream on a directory that has partitioned subdirs (i.e. dir/x=y/), then ListingFileCatalog.allFiles returns the files in the dir as Seq[String] which internally is a Stream[String]. This is because of this line, where a LinkedHashSet.values.toSeq returns Stream. Then when the FileStreamSource filters this Stream[String] to remove the seen files, it creates a new Stream[String], which has a filter function that has a $outer reference to the FileStreamSource (in Scala 2.10). Trying to serialize this Stream[String] causes NotSerializableException. This will happened even if there is just one file in the dir.

      Its important to note that this behavior is different in Scala 2.11. There is no $outer reference to FileStreamSource, so it does not throw NotSerializableException. However, with a large sequence of files (tested with 10000 files), it throws StackOverflowError. This is because how Stream class is implemented. Its basically like a linked list, and attempting to serialize a long Stream requires recursively going through linked list, thus resulting in StackOverflowError.

      In short, across both Scala 2.10 and 2.11, serialization fails when both the following conditions are true.

      • file stream defined on a partitioned directory
      • directory has 10k+ files

      The right solution is to convert the seq to an array before writing to the log.

        Attachments

          Activity

            People

            • Assignee:
              tdas Tathagata Das
              Reporter:
              tdas Tathagata Das
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: