Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
When parsing schemas, the Java library accepts C-style comments (which are forbidden in JSON) and is unaffected by trailing garbage (parsing stops as soon as it reaches the end of the JSON structure).
In the C library, however, comments and trailing whitspaces cause an error.
If a schema is accepted by one language binding, it should be accepted by the other as well. The schema should also be valid JSON. It's the Java library that does not enforce this by being more permissive than it should be, so it seems that the Java implementation should be changed. However, we must also consider whether making the Java library stricter at this point would make any existing data unreadable.
Fortunately, the schema that is written in the data files themselves is always valid JSON, even if it is based on a non-JSON-conformant schema. The reason for this is that Java library parses the schema, build an in-memory representation and then reserializes that, thereby removing comments and trailing garbage. So existing data files are not affected, only user-supplied schemas. These can be manually updated (unlike existing data files).
The real-world use-case where this discrepancy causes problems is Hive-Impala interaction. Users can create tables in Hive by supplying an Avro schema. That schema will be associated with the whole table by getting saved in the Hive metastore. Impala also consults this metadata when accessing the table and that causes an error in the Avro C library that Impala uses. This is detailed in IMPALA-1024. In particular, this comment contains a lot of relevant information.
Attachments
Issue Links
- blocks
-
IMPALA-1024 Impala BE cannot parse Avro schema that contains a trailing semi-colon
- Open