[SPARK-30946] FileStreamSourceLog/FileStreamSinkLog: leverage UnsafeRow type to serialize/deserialize entry - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: Structured Streaming
Labels:
None

Description

HDFSMetadataLog and its descendants are normally using JSON serialization to serialize/deserialize entries.

While it's good to support backward compatibility (like field addition and field deletion), it tends to take bunch of overhead as it adds field names, and should store all data types to string (at least when it's being written to file), works badly for some kind of fields - e.g. timestamp.

The major overhead is heavily affecting to CompactibleFileStreamLog, as "compact" operation requires to load all entities and do the transformation/filtering (I haven't seen any effective operation being implemented though), and store them altogether into one file. This is the one of major reason why the metadata file is too huge and it brings unacceptable latency on "compact" operation.

Fortunately, the entity class for both FileStreamSourceLog (FileEntry) and FileStreamSinkLog (SinkFileStatus) haven't been modified for over 3 years. The latest modification of both classes were year 2016. We can call it "stable" - and then we have more option to optimize serde.

One easy but pretty effective approach on optimizing serde is converting to UnsafeRow and storing it on the same way we do in HDFSBackedStateStoreProvider, and vice versa. It has being running for 2.x versions, so the approach is considered as safe.

Attachments

Issue Links

links to

GitHub Pull Request #27694

Activity

People

Assignee:: Unassigned

Reporter:: Jungtaek Lim

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 25/Feb/20 13:51

Updated:: 11/Dec/20 04:47

Resolved:: 11/Dec/20 04:47