The parser assumes that names are defined before they are used, although this is not required by the specification. We recommend that the spec be changed to agree with the impl (i.e., require that names are defined before they are used).
I think we can fix the parser. I have been thinking about how to implement Schema/Field/Protocol as immutable data structures and a requirement for that is to prevent cyclic references in the Schema objects, which requires storing name references and a name based schema registry – the same tools needed for such a parser.
When a schema name is redefined, the parser throws an exception, even if the two definitions of the name are the same. This is contrary to the spec, which says "A schema may only contain multiple definitions of a fullname if the definitions are equivalent." We recommend that the spec be changed to agree with the implementation (i.e., disallow multiple definitions of the same name, even if the def's are the same).
I think this is a decent spec change, especiallyl since "if the definitions are equivalent" is insufficiently defined currently – equivalent at what level? Including all metadata, even 'doc'? It is probably best if a single schema definition does not re-define any named schema elements.
The parser calls validateName on the symbols of an enumeration, restricting the syntax of enumeration symbols. The spec does not call for such a restriction. We recommend that the spec be changed to conform to the implementation (i.e., restrict symbols the same way we restrict names). This helps in cannonicalization (don't have to worry about Unicode normalization).
I wonder how various language implementations deal with this currently, it would not surprise me if more than Java already have an implicit restriction beyond the spec. We should change the spec to restrict symbols to the same restriction as field names.
Schema.validateName uses Character.isLetter and Character.isLetterOrDigit to test characters. These accept all Unicode characters (except supplemental ones). The Avro spec says that names should be restricted to ASCII letters. We think this is an implementation bug and should be fixed. (Again, nice to avoid Unicode normalization.)
From the spec:
Record, enums and fixed are named types. Each has a fullname that is composed of two parts; a name and a namespace. Equality of names is defined on the fullname.
The name portion of a fullname, record field names, and enum symbols must:
- start with [A-Za-z_]
- subsequently contain only [A-Za-z0-9_]
A namespace is a dot-separated sequence of such names.
When the parser descends into a named schema, the default namespace in the names variable is stored into the local variable savedSpace, which is restored on exit. However, if the routine exits abruptly (with an exception), this restoration does not occur. This is probably a bug, and restoration should be in a finally. (In Parser.parse, the flag validateNames is restored in a finally clause.)
Sounds like a bug, a patch containing a reproducible test case that fails and a fix for another ticket would be great!