Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-5775

Make the spark runner not serialize data unless spark is spilling to disk

Details

    • Improvement
    • Status: Open
    • P3
    • Resolution: Unresolved
    • None
    • None
    • runner-spark
    • None

    Description

      Currently for storage level MEMORY_ONLY, Beam does not coder-ify the data. This lets Spark keep the data in memory avoiding the serialization round trip. Unfortunately the logic is fairly coarse - as soon as you switch to MEMORY_AND_DISK, Beam coder-ifys the data even though Spark might have chosen to keep the data in memory, incurring the serialization overhead.

       

      Ideally Beam would serialize the data lazily - as Spark chooses to spill to disk. This would be a change in behavior when using beam, but luckily Spark has a solution for folks that want data serialized in memory - MEMORY_AND_DISK_SER will keep the data serialized.

      Attachments

        Activity

          People

            Unassigned Unassigned
            mikekap Mike Kaplinskiy
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 12h 20m
                12h 20m