[SPARK-12197] Kryo's Avro Serializer add support for dynamic schemas using SchemaRepository - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: 1.5.0
Fix Version/s: None
Component/s: Spark Core
Labels:
- avro
- kryo
- schema
- serialization

Description

The original problem: Serializing GenericRecords in Spark Core results in a very high overhead, as the schema is serialized per record. (When in the actual input data of HDFS it's stored once per file. )

The extended problem: Spark 1.5 introduced the ability to register Avro schemas ahead of time using SparkConf. This solution is partial as some applications may not know exactly which schemas they're going to read ahead of time.

Extended solution:
Adding a schema repository to the Serializer. Assuming the generic record has schemaId on them, it's possible to extract them dynamically from the read records and serialize only the schemaId.
Upon deserialization the schemaRepo will be queried once again.

The local caching mechanism will remain in tact - so in fact each Task will query the schema repo only once per schemaId.

The previous static registering of schemas will remain in place, as it is more efficient when the schemas are known ahead of time.

New flow of serializing generic record:
1) check the pre-registered schema list, if found the schema, serialize only its finger print
2) if not found, and schema repo has been set, attempt to extract the schemaId from record and check if repo contains the id. If so - serialize only the schema id
3) if no schema repo set or didn't find the schemaId in repo - compress and send the entire schema.

Attachments

Issue Links

links to

[Github] Pull Request #10190 (RotemShaul)

[Github] Pull Request #10625 (RotemShaul)

[Github] Pull Request #13761 (RotemShaul)

Activity

People

Assignee:: Unassigned

Reporter:: Rotem Shaul

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 08/Dec/15 07:14

Updated:: 18/Jun/16 12:56

Resolved:: 16/Jun/16 15:20

Time Tracking

Estimated:

72h

Remaining:

72h

Logged:

Not Specified