[AVRO-1704] Standardized format for encoding messages with Avro - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.9.0, 1.8.2
Component/s: java, spec
Labels:
None

Description

I'm currently using the Datafile format for encoding messages that are written to Kafka and Cassandra. This seems rather wasteful:

1. I only encode a single record at a time, so there's no need for sync markers and other metadata related to multi-record files.
2. The entire schema is inlined every time.

However, the Datafile format is the only one that has been standardized, meaning that I can read and write data with minimal effort across the various languages in use in my organization. If there was a standardized format for encoding single values that was optimized for out-of-band schema transfer, I would much rather use that.

I think the necessary pieces of the format would be:

1. A format version number.
2. A schema fingerprint type identifier, i.e. Rabin, MD5, SHA256, etc.
3. The actual schema fingerprint (according to the type.)
4. Optional metadata map.
5. The encoded datum.

The language libraries would implement a MessageWriter that would encode datums in this format, as well as a MessageReader that, given a SchemaStore, would be able to decode datums. The reader would decode the fingerprint and ask its SchemaStore to return the corresponding writer's schema.

The idea is that SchemaStore would be an abstract interface that allowed library users to inject custom backends. A simple, file system based one could be provided out of the box.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

AVRO-1704.3.patch
24/Jul/16 22:50
4 kB
Ryan Blue
AVRO-1704.4.patch
24/Jul/16 22:56
4 kB
Ryan Blue
AVRO-1704-20160410.patch
10/Apr/16 21:54
38 kB
Niels Basjes
AVRO-1704-2016-05-03-Unfinished.patch
03/May/16 06:47
59 kB
Niels Basjes

Issue Links

blocks

AVRO-1885 Release 1.8.2

Resolved

incorporates

AVRO-1888 Java: Single-record encoding marker bytes check is incorrect

Resolved

links to

GitHub Pull Request #103

Activity

People

Assignee:: Niels Basjes

Reporter:: Daniel Schierbeck

Votes:: 1 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 16/Jul/15 08:38

Updated:: 01/Jun/17 14:36

Resolved:: 04/Sep/16 20:45