Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-7766

KryoSerializerInstance reuse is not safe when auto-reset is disabled

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.4.0
    • 1.4.0
    • Spark Core
    • None

    Description

      SPARK-3386 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using KryoSerializer: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling reset() between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, which can cause errors during deserialization.

      The fix should be simple: add reset calls at the end of serialize and serializeStream.

      Thanks to John Carrino for reporting this issue on GItHub: https://github.com/apache/spark/pull/5606#issuecomment-103995103

      Attachments

        Issue Links

          Activity

            People

              joshrosen Josh Rosen
              joshrosen Josh Rosen
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: