Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30946

FileStreamSourceLog/FileStreamSinkLog: leverage UnsafeRow type to serialize/deserialize entry



    • Improvement
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 3.1.0
    • None
    • Structured Streaming
    • None


      HDFSMetadataLog and its descendants are normally using JSON serialization to serialize/deserialize entries.

      While it's good to support backward compatibility (like field addition and field deletion), it tends to take bunch of overhead as it adds field names, and should store all data types to string (at least when it's being written to file), works badly for some kind of fields - e.g. timestamp.

      The major overhead is heavily affecting to CompactibleFileStreamLog, as "compact" operation requires to load all entities and do the transformation/filtering (I haven't seen any effective operation being implemented though), and store them altogether into one file. This is the one of major reason why the metadata file is too huge and it brings unacceptable latency on "compact" operation.

      Fortunately, the entity class for both FileStreamSourceLog (FileEntry) and FileStreamSinkLog (SinkFileStatus) haven't been modified for over 3 years. The latest modification of both classes were year 2016. We can call it "stable" - and then we have more option to optimize serde.

      One easy but pretty effective approach on optimizing serde is converting to UnsafeRow and storing it on the same way we do in HDFSBackedStateStoreProvider, and vice versa. It has being running for 2.x versions, so the approach is considered as safe.


        Issue Links



              Unassigned Unassigned
              kabhwan Jungtaek Lim
              0 Vote for this issue
              2 Start watching this issue