Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5269

BlockManager.dataDeserialize always creates a new serializer instance

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • Spark Core

    Description

      BlockManager.dataDeserialize always creates a new instance of the serializer, which is pretty slow in some cases. I'm using Kryo serialization and have a custom registrator, and its register method is showing up as taking about 15% of the execution time in my profiles. This started happening after I increased the number of keys in a job with a shuffle phase by a factor of 40.

      One solution I can think of is to create a ThreadLocal SerializerInstance for the defaultSerializer, and only create a new one if a custom serializer is passed in. AFAICT a custom serializer is passed only from DiskStore.getValues, and that, on the other hand, depends on the serializer passed to ExternalSorter. I don't know how often this is used, but I think this can still be a good solution for the standard use case.
      Oh, and also - ExternalSorter already has a SerializerInstance, so if the getValues method is called from a single thread, maybe we can pass that directly?

      I'd be happy to try a patch but would probably need a confirmation from someone that this approach would indeed work (or an idea for another).

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ivan.vergiliev Ivan Vergiliev
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: