Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30946

FileStreamSourceLog/FileStreamSinkLog: leverage UnsafeRow type to serialize/deserialize entry

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 3.1.0
    • None
    • Structured Streaming
    • None

    Description

      HDFSMetadataLog and its descendants are normally using JSON serialization to serialize/deserialize entries.

      While it's good to support backward compatibility (like field addition and field deletion), it tends to take bunch of overhead as it adds field names, and should store all data types to string (at least when it's being written to file), works badly for some kind of fields - e.g. timestamp.

      The major overhead is heavily affecting to CompactibleFileStreamLog, as "compact" operation requires to load all entities and do the transformation/filtering (I haven't seen any effective operation being implemented though), and store them altogether into one file. This is the one of major reason why the metadata file is too huge and it brings unacceptable latency on "compact" operation.

      Fortunately, the entity class for both FileStreamSourceLog (FileEntry) and FileStreamSinkLog (SinkFileStatus) haven't been modified for over 3 years. The latest modification of both classes were year 2016. We can call it "stable" - and then we have more option to optimize serde.

      One easy but pretty effective approach on optimizing serde is converting to UnsafeRow and storing it on the same way we do in HDFSBackedStateStoreProvider, and vice versa. It has being running for 2.x versions, so the approach is considered as safe.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              kabhwan Jungtaek Lim
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: