[AVRO-2128] Schema parsing in the Java library is more permissive than the C implementation or the JSON specification - ASF JIRA

Add vote

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: java
Labels:
None

Description

When parsing schemas, the Java library accepts C-style comments (which are forbidden in JSON) and is unaffected by trailing garbage (parsing stops as soon as it reaches the end of the JSON structure).

In the C library, however, comments and trailing whitspaces cause an error.

If a schema is accepted by one language binding, it should be accepted by the other as well. The schema should also be valid JSON. It's the Java library that does not enforce this by being more permissive than it should be, so it seems that the Java implementation should be changed. However, we must also consider whether making the Java library stricter at this point would make any existing data unreadable.

Fortunately, the schema that is written in the data files themselves is always valid JSON, even if it is based on a non-JSON-conformant schema. The reason for this is that Java library parses the schema, build an in-memory representation and then reserializes that, thereby removing comments and trailing garbage. So existing data files are not affected, only user-supplied schemas. These can be manually updated (unlike existing data files).

The real-world use-case where this discrepancy causes problems is Hive-Impala interaction. Users can create tables in Hive by supplying an Avro schema. That schema will be associated with the whole table by getting saved in the Hive metastore. Impala also consults this metadata when accessing the table and that causes an error in the Avro C library that Impala uses. This is detailed in IMPALA-1024. In particular, this comment contains a lot of relevant information.