Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.7.0
    • Component/s: java
    • Labels:

      Description

      Add function that returns a standardized, 64-bit fingerprint for schemas. Fingerprints are designed such that the chances of collisions is very, very low.

      1. AVRO-1006.patch
        60 kB
        Raymie Stata
      2. AVRO-1006.patch
        60 kB
        Raymie Stata
      3. AVRO-1006.patch
        59 kB
        Raymie Stata
      4. AVRO-1006.patch
        59 kB
        Raymie Stata
      5. AVRO-1006.patch
        58 kB
        Raymie Stata
      6. AVRO-1006-prelim.patch
        2 kB
        Raymie Stata
      7. schema-fingerprinting.html
        32 kB
        Raymie Stata
      8. schema-fingerprinting.html
        31 kB
        Raymie Stata
      9. schema-fingerprinting.html
        30 kB
        Raymie Stata

        Issue Links

          Activity

          Hide
          Raymie Stata added a comment -

          Detailed description of the proposed feature.

          Show
          Raymie Stata added a comment - Detailed description of the proposed feature.
          Hide
          Raymie Stata added a comment -

          An Avro schema fingerprint is a hash of an Avro schema. Within a collection of even a million schemas, the probability of a collision is still around 0.001%. Thus, fingerprints can be used in place of schemas.

          One motivating use-case for fingerprints is a pub/sub message bus. On a pub/sub bus, since multiple writers can publish to the same topic using different schemas, each message must be associated with its schema. Rather than include the actual schema with every message, one can instead include the fingerprint of the schema, which would be smaller. When a reader encounters a fingerprint it hasn't seen before, it can look it up and cache it. (The attached document describes possible lookup mechanisms.)

          The proposed approach to fingerprinting is pretty straight forward. First, we convert Avro schemas into a canonical form. Two equivalent schemas always have the same canonical form. Once we have the canonical form, we simply take a 64-bit "Rabin fingerprint" (a CRC) of that text.

          Show
          Raymie Stata added a comment - An Avro schema fingerprint is a hash of an Avro schema. Within a collection of even a million schemas, the probability of a collision is still around 0.001%. Thus, fingerprints can be used in place of schemas. One motivating use-case for fingerprints is a pub/sub message bus. On a pub/sub bus, since multiple writers can publish to the same topic using different schemas, each message must be associated with its schema. Rather than include the actual schema with every message, one can instead include the fingerprint of the schema, which would be smaller. When a reader encounters a fingerprint it hasn't seen before, it can look it up and cache it. (The attached document describes possible lookup mechanisms.) The proposed approach to fingerprinting is pretty straight forward. First, we convert Avro schemas into a canonical form. Two equivalent schemas always have the same canonical form. Once we have the canonical form, we simply take a 64-bit "Rabin fingerprint" (a CRC) of that text.
          Hide
          Raymie Stata added a comment -

          In implementing canonical forms, we identified some bugs (both spec and impl bugs) that it would be nice to resolve:

          • The parser assumes that names are defined before they are used, although this is not required by the specification. We recommend that the spec be changed to agree with the impl (i.e., require that names are defined before they are used).
          • When a schema name is redefined, the parser throws an exception, even if the two definitions of the name are the same. This is contrary to the spec, which says "A schema may only contain multiple definitions of a fullname if the definitions are equivalent." We recommend that the spec be changed to agree with the implementation (i.e., disallow multiple definitions of the same name, even if the def's are the same).
          • The parser calls validateName on the symbols of an enumeration, restricting the syntax of enumeration symbols. The spec does not call for such a restriction. We recommend that the spec be changed to conform to the implementation (i.e., restrict symbols the same way we restrict names). This helps in cannonicalization (don't have to worry about Unicode normalization).
          • Schema.validateName uses Character.isLetter and Character.isLetterOrDigit to test characters. These accept all Unicode characters (except supplemental ones). The Avro spec says that names should be restricted to ASCII letters. We think this is an implementation bug and should be fixed. (Again, nice to avoid Unicode normalization.)
          • When the parser descends into a named schema, the default namespace in the names variable is stored into the local variable savedSpace, which is restored on exit. However, if the routine exits abruptly (with an exception), this restoration does not occur. This is probably a bug, and restoration should be in a finally. (In Parser.parse, the flag validateNames is restored in a finally clause.)

          Before submitting fixes, I'll file separate JIRAs, but I'd like to get any feedback here first.

          Show
          Raymie Stata added a comment - In implementing canonical forms, we identified some bugs (both spec and impl bugs) that it would be nice to resolve: The parser assumes that names are defined before they are used, although this is not required by the specification. We recommend that the spec be changed to agree with the impl (i.e., require that names are defined before they are used). When a schema name is redefined, the parser throws an exception, even if the two definitions of the name are the same. This is contrary to the spec, which says "A schema may only contain multiple definitions of a fullname if the definitions are equivalent." We recommend that the spec be changed to agree with the implementation (i.e., disallow multiple definitions of the same name, even if the def's are the same). The parser calls validateName on the symbols of an enumeration, restricting the syntax of enumeration symbols. The spec does not call for such a restriction. We recommend that the spec be changed to conform to the implementation (i.e., restrict symbols the same way we restrict names). This helps in cannonicalization (don't have to worry about Unicode normalization). Schema.validateName uses Character.isLetter and Character.isLetterOrDigit to test characters. These accept all Unicode characters (except supplemental ones). The Avro spec says that names should be restricted to ASCII letters. We think this is an implementation bug and should be fixed. (Again, nice to avoid Unicode normalization.) When the parser descends into a named schema, the default namespace in the names variable is stored into the local variable savedSpace , which is restored on exit. However, if the routine exits abruptly (with an exception), this restoration does not occur. This is probably a bug, and restoration should be in a finally. (In Parser.parse , the flag validateNames is restored in a finally clause.) Before submitting fixes, I'll file separate JIRAs, but I'd like to get any feedback here first.
          Hide
          Sam Pullara added a comment -

          This would be great. I ended up using the text representation + SHA256 in Avrobase for my schema keys. I would love to see a standard way that worked across languages and no matter how mangled the schema JSON was rendered. On the 0.001% collision rate, that seems high to me - would a 128-bit hash be a better choice?

          Show
          Sam Pullara added a comment - This would be great. I ended up using the text representation + SHA256 in Avrobase for my schema keys. I would love to see a standard way that worked across languages and no matter how mangled the schema JSON was rendered. On the 0.001% collision rate, that seems high to me - would a 128-bit hash be a better choice?
          Hide
          Raymie Stata added a comment -

          Updated design doc based on feedback:

          • Fixed typos and clarified some text.
          • Added suggestion that the spec be explicit about case-sensitivity.
          • Most significantly, fixed error in stated probability of collisions, and added a bunch more text around that issue.
          Show
          Raymie Stata added a comment - Updated design doc based on feedback: Fixed typos and clarified some text. Added suggestion that the spec be explicit about case-sensitivity. Most significantly, fixed error in stated probability of collisions, and added a bunch more text around that issue.
          Hide
          Raymie Stata added a comment -

          Sam writes:

          On the 0.001% collision rate, that seems high to me - would a 128-bit hash be a better choice?

          Thanks for pointing this out. Turns out the 0.001% is a bug in the writeup, the actual probabilities are quite a bit lower: 3E-8 (0.000003%) for a million-item cache, 3E-10 for 100K items, and 3E-12 for 10K items (I'd love to have someone check my math). Assuming an insertion per minute into a fixed-sized table (ie, random eviction), you'd expect a collision every year with the 1M item cache, every century with 100K items, and every millennia with 10K items. This seems acceptable, especially since I expect these caches to be closer to 10K items than a million (there's a bit of a discussion on this point in the updated writeup). So are you happier now with 64 bits?

          (The doc defines a canonical text for schemas, and fingerprints based that text. The patch will contain a function for returning the canonical text. This approach implicitly standardizes how one would take an MD5-or SHA-xxx fingerprint of a schema, but perhaps I can be explicit on this point.)

          Show
          Raymie Stata added a comment - Sam writes: On the 0.001% collision rate, that seems high to me - would a 128-bit hash be a better choice? Thanks for pointing this out. Turns out the 0.001% is a bug in the writeup, the actual probabilities are quite a bit lower: 3E-8 (0.000003%) for a million-item cache, 3E-10 for 100K items, and 3E-12 for 10K items (I'd love to have someone check my math). Assuming an insertion per minute into a fixed-sized table (ie, random eviction), you'd expect a collision every year with the 1M item cache, every century with 100K items, and every millennia with 10K items. This seems acceptable, especially since I expect these caches to be closer to 10K items than a million (there's a bit of a discussion on this point in the updated writeup). So are you happier now with 64 bits? (The doc defines a canonical text for schemas, and fingerprints based that text. The patch will contain a function for returning the canonical text. This approach implicitly standardizes how one would take an MD5-or SHA-xxx fingerprint of a schema, but perhaps I can be explicit on this point.)
          Hide
          Raymie Stata added a comment -

          Sigh. My math is wrong. I think things are better than stated, but I need to think about this more.

          Show
          Raymie Stata added a comment - Sigh. My math is wrong. I think things are better than stated, but I need to think about this more.
          Hide
          Raymie Stata added a comment -

          Major change: yet another fix to the text discussing the adequacy of 64-bit fingerprints. Also a bunch of editorial fixes.

          Show
          Raymie Stata added a comment - Major change: yet another fix to the text discussing the adequacy of 64-bit fingerprints. Also a bunch of editorial fixes.
          Hide
          Raymie Stata added a comment -

          I fixed the math bug in the writeup (I misapplied the "birthday probability" where I should've applied a much more favorable probability). The bottom line is 64 bits seem adequate for most use-cases. For use-cases that want longer fingerprints, the spec now explicitly recommends using MD5 or the SHA-family to the canonical text of a schema.

          I'm happy to continue discussing the overall proposal, but my next step here is to file patches for the "surrounding issues" (e.g., the spec fixes).

          Show
          Raymie Stata added a comment - I fixed the math bug in the writeup (I misapplied the "birthday probability" where I should've applied a much more favorable probability). The bottom line is 64 bits seem adequate for most use-cases. For use-cases that want longer fingerprints, the spec now explicitly recommends using MD5 or the SHA-family to the canonical text of a schema. I'm happy to continue discussing the overall proposal, but my next step here is to file patches for the "surrounding issues" (e.g., the spec fixes).
          Hide
          Sam Pullara added a comment -

          I think the specification should really separate the canonicalization and the hashing stuff. Enumerating various ways of hashing the result of the canonical form doesn't really seem necessary since in any of the cases you have listed you could just as easily use a monotonically increasing number for the schema – as long as your central repository or other schema provider lets you look up the number with the canonical schema text. This was the design I ultimately went with for Avrobase on top of MySQL since it is easy to lookup a row by a value of a column.

          Show
          Sam Pullara added a comment - I think the specification should really separate the canonicalization and the hashing stuff. Enumerating various ways of hashing the result of the canonical form doesn't really seem necessary since in any of the cases you have listed you could just as easily use a monotonically increasing number for the schema – as long as your central repository or other schema provider lets you look up the number with the canonical schema text. This was the design I ultimately went with for Avrobase on top of MySQL since it is easy to lookup a row by a value of a column.
          Hide
          Raymie Stata added a comment -

          We believe that, in pub/sub and other scenarios, schema repositories will become a standard part of the Avro infrastructure. Given that, to promote an ecosystem with interoperable but alternative implementations, we believe that "common" (let's not call them "standard") ways to generate good, hash-based fingerprints will be useful. In particular, while standard, cryptographic hashes are good enough for fingerprints of size greater than 128 bits, we believe that 64-bit hashes represent a very useful point in the size/diversity tradeoff space, and there should be a common way in the Avro ecosystem for generating 64-bit hashes.

          Does this make sense? If so, I think I'll re-write the explanation around this line of reasoning, which is indeed closer to our original though process.

          All this being said, your point about schema numbers is a good one, I'll add a note about that technique on the section about schema repositories. It seems like the right protocol here is for the "inserter" of a schema to insert the schema and get an identifier back from the repository before using the schema. Is that how you do it?

          Show
          Raymie Stata added a comment - We believe that, in pub/sub and other scenarios, schema repositories will become a standard part of the Avro infrastructure. Given that, to promote an ecosystem with interoperable but alternative implementations, we believe that "common" (let's not call them "standard") ways to generate good, hash-based fingerprints will be useful. In particular, while standard, cryptographic hashes are good enough for fingerprints of size greater than 128 bits, we believe that 64-bit hashes represent a very useful point in the size/diversity tradeoff space, and there should be a common way in the Avro ecosystem for generating 64-bit hashes. Does this make sense? If so, I think I'll re-write the explanation around this line of reasoning, which is indeed closer to our original though process. All this being said, your point about schema numbers is a good one, I'll add a note about that technique on the section about schema repositories. It seems like the right protocol here is for the "inserter" of a schema to insert the schema and get an identifier back from the repository before using the schema. Is that how you do it?
          Hide
          Sam Pullara added a comment -

          It seems that the protocol for any system, even one using hashes, would have to follow that same protocol even if the "inserter" also acts as the repository. The only critical thing is that there is some way to generate a unique identifier and store a schema with that as the key that can be looked up by any receiver of the message. Only after that is done can you send a message using that identifier (and expect it to be understood).

          One advantage that using hashes have over my sequence number suggestion is that you can easily shard, federate and merge hash-based repositories, qualities that could be useful in practice. Also, looking up a schema by its canonical text rather than determining its existence in the repository through hash lookup would likely be more efficient.

          Show
          Sam Pullara added a comment - It seems that the protocol for any system, even one using hashes, would have to follow that same protocol even if the "inserter" also acts as the repository. The only critical thing is that there is some way to generate a unique identifier and store a schema with that as the key that can be looked up by any receiver of the message. Only after that is done can you send a message using that identifier (and expect it to be understood). One advantage that using hashes have over my sequence number suggestion is that you can easily shard, federate and merge hash-based repositories, qualities that could be useful in practice. Also, looking up a schema by its canonical text rather than determining its existence in the repository through hash lookup would likely be more efficient.
          Hide
          Doug Cutting added a comment -

          Some notes:

          • Primitive types may have attributes, e.g., {"type":"int", "java-class":"java.lang.Short"}

            , so only primitives without any attributes may be represented by their name alone.

          • Attributes within JSON objects are not ordered. A correct JSON parser need not preserve ordering. Relying on order-preservation may require some implementations to write their own JSON libraries.
          • With multiple Avro implementations, the chance of an inconsistent canonicalization implementation is significant. Creating an adequate test suite and validating all implementations would require significant effort.

          Given the above, I'd be hesitant to build a system that depends on consistent canonical schemas for correct operation. Folks who build systems that use Avro would thus be wise to design them to gracefully handle inconsistent canonicalization. For example, Avro's RPC handshake currently uses a fingerprint-like approach without requiring canonicalization. Two implementations that represent a schema using the same string will have more efficient handshakes, but implementations that produce different strings for equivalent schemas will still interoperate correctly. So a standard, recommended canonical form could be useful, but folks should perhaps not assume that every implementation is correct.

          I like the idea of a schema repository. A related idea I've had is to use something like a URL shortener. Instead of mapping url->url, it could map url->schema. One would register one's schema with the shortener, then hand out references. A shortener would, as an optimization, return the same ID for equivalent schemas. The shortener would only need to rely on only a single canonicalization implementation, its own.

          Show
          Doug Cutting added a comment - Some notes: Primitive types may have attributes, e.g., {"type":"int", "java-class":"java.lang.Short"} , so only primitives without any attributes may be represented by their name alone. Attributes within JSON objects are not ordered. A correct JSON parser need not preserve ordering. Relying on order-preservation may require some implementations to write their own JSON libraries. With multiple Avro implementations, the chance of an inconsistent canonicalization implementation is significant. Creating an adequate test suite and validating all implementations would require significant effort. Given the above, I'd be hesitant to build a system that depends on consistent canonical schemas for correct operation. Folks who build systems that use Avro would thus be wise to design them to gracefully handle inconsistent canonicalization. For example, Avro's RPC handshake currently uses a fingerprint-like approach without requiring canonicalization. Two implementations that represent a schema using the same string will have more efficient handshakes, but implementations that produce different strings for equivalent schemas will still interoperate correctly. So a standard, recommended canonical form could be useful, but folks should perhaps not assume that every implementation is correct. I like the idea of a schema repository. A related idea I've had is to use something like a URL shortener. Instead of mapping url->url, it could map url->schema. One would register one's schema with the shortener, then hand out references. A shortener would, as an optimization, return the same ID for equivalent schemas. The shortener would only need to rely on only a single canonicalization implementation, its own.
          Hide
          Scott Carey added a comment -

          More notes:

          • Schema equivalence has a few variations
            • Serialization equivalent – attribute metadata is irrelevant, {"type":"int", "java-class":"java.lang.Short"}

              is equal to

              {"int"}

              . Defaults and doc are also irrelevant for this case.

            • Serialization and metadata equivalence, where the above two are not equivalent.
            • Reversible transformation equivalence, e.g. ["int", "string"] equals ["string", "int], or records with pure field reordering.
          • Other schema relationships that are related to equivalence but cannot satisfy associativity and transitivity
            • Alias equivalence is not transitive, but is associative.
            • Schema resolution and transformation is often neither transitive or associative.

          All three equivalence variations above may be useful for different purposes, especially the first two. Serialization equivalence is important for long term storage. Full equivalence with metadata is often needed for internal state. But we may want to let users specify which optional components are included (attributes, defaults, doc). Doug's point about JSON being an unordered format is important and limits using the json string as the fingerprint.
          Perhaps we can complete the Avro Schema for schemas (AVRO-251) which can define field order and equivalence unambiguously and all implementations should be able to support. The output bytes from the Avro binary serialization of the schema can be used to feed a hash algorithm.

          Show
          Scott Carey added a comment - More notes: Schema equivalence has a few variations Serialization equivalent – attribute metadata is irrelevant, {"type":"int", "java-class":"java.lang.Short"} is equal to {"int"} . Defaults and doc are also irrelevant for this case. Serialization and metadata equivalence, where the above two are not equivalent. Reversible transformation equivalence, e.g. ["int", "string"] equals ["string", "int] , or records with pure field reordering. Other schema relationships that are related to equivalence but cannot satisfy associativity and transitivity Alias equivalence is not transitive, but is associative. Schema resolution and transformation is often neither transitive or associative. All three equivalence variations above may be useful for different purposes, especially the first two. Serialization equivalence is important for long term storage. Full equivalence with metadata is often needed for internal state. But we may want to let users specify which optional components are included (attributes, defaults, doc). Doug's point about JSON being an unordered format is important and limits using the json string as the fingerprint. Perhaps we can complete the Avro Schema for schemas ( AVRO-251 ) which can define field order and equivalence unambiguously and all implementations should be able to support. The output bytes from the Avro binary serialization of the schema can be used to feed a hash algorithm.
          Hide
          Eric Hauser added a comment -

          I believe LinkedIn has already written a schema service for Avro that there were interested in open sourcing at one point. Someone with contacts on the Kafka team might want to contact them and see if they are able to do so.

          Show
          Eric Hauser added a comment - I believe LinkedIn has already written a schema service for Avro that there were interested in open sourcing at one point. Someone with contacts on the Kafka team might want to contact them and see if they are able to do so.
          Hide
          Thiruvalluvan M. G. added a comment -

          Doug's point about JSON being an unordered format is important and limits using the json string as the fingerprint.
          Perhaps we can complete the Avro Schema for schemas (AVRO-251) which can define field order and equivalence unambiguously and all implementations should be able to support. The output bytes from the Avro binary serialization of the schema can be used to feed a hash algorithm.

          While representing the canonical schema as Avro data reduces it (compared to Json representation) it does not eliminate ambiguity. Non-empty arrays (and maps) can be represented in Avro in more than one way.

          Doug's observation implies that we cannot use a third-party Json library to generate the canonical representation. For fingerprinting to work, we need some canonical representation (which by definition is not ambiguous). Either we restrict (by removing ambiguities) an existing standard or invent a new one.

          I think Raymie's canonicalization rules are simple and given that we'll have only US-ASCII characters in the canonical representation, writing a JSON generator in any language will not be hard. And it will be parsable (with no new code) and human-readable.

          Show
          Thiruvalluvan M. G. added a comment - Doug's point about JSON being an unordered format is important and limits using the json string as the fingerprint. Perhaps we can complete the Avro Schema for schemas ( AVRO-251 ) which can define field order and equivalence unambiguously and all implementations should be able to support. The output bytes from the Avro binary serialization of the schema can be used to feed a hash algorithm. While representing the canonical schema as Avro data reduces it (compared to Json representation) it does not eliminate ambiguity. Non-empty arrays (and maps) can be represented in Avro in more than one way. Doug's observation implies that we cannot use a third-party Json library to generate the canonical representation. For fingerprinting to work, we need some canonical representation (which by definition is not ambiguous). Either we restrict (by removing ambiguities) an existing standard or invent a new one. I think Raymie's canonicalization rules are simple and given that we'll have only US-ASCII characters in the canonical representation, writing a JSON generator in any language will not be hard. And it will be parsable (with no new code) and human-readable.
          Hide
          Scott Carey added a comment -

          The parser assumes that names are defined before they are used, although this is not required by the specification. We recommend that the spec be changed to agree with the impl (i.e., require that names are defined before they are used).

          I think we can fix the parser. I have been thinking about how to implement Schema/Field/Protocol as immutable data structures and a requirement for that is to prevent cyclic references in the Schema objects, which requires storing name references and a name based schema registry – the same tools needed for such a parser.

          When a schema name is redefined, the parser throws an exception, even if the two definitions of the name are the same. This is contrary to the spec, which says "A schema may only contain multiple definitions of a fullname if the definitions are equivalent." We recommend that the spec be changed to agree with the implementation (i.e., disallow multiple definitions of the same name, even if the def's are the same).

          I think this is a decent spec change, especiallyl since "if the definitions are equivalent" is insufficiently defined currently – equivalent at what level? Including all metadata, even 'doc'? It is probably best if a single schema definition does not re-define any named schema elements.

          The parser calls validateName on the symbols of an enumeration, restricting the syntax of enumeration symbols. The spec does not call for such a restriction. We recommend that the spec be changed to conform to the implementation (i.e., restrict symbols the same way we restrict names). This helps in cannonicalization (don't have to worry about Unicode normalization).

          I wonder how various language implementations deal with this currently, it would not surprise me if more than Java already have an implicit restriction beyond the spec. We should change the spec to restrict symbols to the same restriction as field names.

          Schema.validateName uses Character.isLetter and Character.isLetterOrDigit to test characters. These accept all Unicode characters (except supplemental ones). The Avro spec says that names should be restricted to ASCII letters. We think this is an implementation bug and should be fixed. (Again, nice to avoid Unicode normalization.)

          I agree.

          From the spec:

          Names

          Record, enums and fixed are named types. Each has a fullname that is composed of two parts; a name and a namespace. Equality of names is defined on the fullname.

          The name portion of a fullname, record field names, and enum symbols must:

          • start with [A-Za-z_]
          • subsequently contain only [A-Za-z0-9_]

          A namespace is a dot-separated sequence of such names.

          When the parser descends into a named schema, the default namespace in the names variable is stored into the local variable savedSpace, which is restored on exit. However, if the routine exits abruptly (with an exception), this restoration does not occur. This is probably a bug, and restoration should be in a finally. (In Parser.parse, the flag validateNames is restored in a finally clause.)

          Sounds like a bug, a patch containing a reproducible test case that fails and a fix for another ticket would be great!

          Show
          Scott Carey added a comment - The parser assumes that names are defined before they are used, although this is not required by the specification. We recommend that the spec be changed to agree with the impl (i.e., require that names are defined before they are used). I think we can fix the parser. I have been thinking about how to implement Schema/Field/Protocol as immutable data structures and a requirement for that is to prevent cyclic references in the Schema objects, which requires storing name references and a name based schema registry – the same tools needed for such a parser. When a schema name is redefined, the parser throws an exception, even if the two definitions of the name are the same. This is contrary to the spec, which says "A schema may only contain multiple definitions of a fullname if the definitions are equivalent." We recommend that the spec be changed to agree with the implementation (i.e., disallow multiple definitions of the same name, even if the def's are the same). I think this is a decent spec change, especiallyl since "if the definitions are equivalent" is insufficiently defined currently – equivalent at what level? Including all metadata, even 'doc'? It is probably best if a single schema definition does not re-define any named schema elements. The parser calls validateName on the symbols of an enumeration, restricting the syntax of enumeration symbols. The spec does not call for such a restriction. We recommend that the spec be changed to conform to the implementation (i.e., restrict symbols the same way we restrict names). This helps in cannonicalization (don't have to worry about Unicode normalization). I wonder how various language implementations deal with this currently, it would not surprise me if more than Java already have an implicit restriction beyond the spec. We should change the spec to restrict symbols to the same restriction as field names. Schema.validateName uses Character.isLetter and Character.isLetterOrDigit to test characters. These accept all Unicode characters (except supplemental ones). The Avro spec says that names should be restricted to ASCII letters. We think this is an implementation bug and should be fixed. (Again, nice to avoid Unicode normalization.) I agree. From the spec: Names Record, enums and fixed are named types. Each has a fullname that is composed of two parts; a name and a namespace. Equality of names is defined on the fullname. The name portion of a fullname, record field names, and enum symbols must: start with [A-Za-z_] subsequently contain only [A-Za-z0-9_] A namespace is a dot-separated sequence of such names. When the parser descends into a named schema, the default namespace in the names variable is stored into the local variable savedSpace, which is restored on exit. However, if the routine exits abruptly (with an exception), this restoration does not occur. This is probably a bug, and restoration should be in a finally. (In Parser.parse, the flag validateNames is restored in a finally clause.) Sounds like a bug, a patch containing a reproducible test case that fails and a fix for another ticket would be great!
          Hide
          Scott Carey added a comment -

          I think we can fix the parser. I have been thinking about how to implement Schema/Field/Protocol as immutable data structures and a requirement for that is to prevent cyclic references in the Schema objects, which requires storing name references and a name based schema registry – the same tools needed for such a parser.

          Although we could make the parser work with use-before-define, it is probably a good spec simplification to avoid it that will be easier to build and test across languages.

          Show
          Scott Carey added a comment - I think we can fix the parser. I have been thinking about how to implement Schema/Field/Protocol as immutable data structures and a requirement for that is to prevent cyclic references in the Schema objects, which requires storing name references and a name based schema registry – the same tools needed for such a parser. Although we could make the parser work with use-before-define, it is probably a good spec simplification to avoid it that will be easier to build and test across languages.
          Hide
          Scott Carey added a comment -

          While representing the canonical schema as Avro data reduces it (compared to Json representation) it does not eliminate ambiguity. Non-empty arrays (and maps) can be represented in Avro in more than one way.

          Can you provide an example? I am having trouble thinking of an example that doesn't fall under the other disambiguations in the document, e.g.

          {"type":"int"}

          == "int". Can we have an avro serialized canonical form without the ambiguity? The document describes two forms of normailzation: the Avro normalization, and the JSON normalization. I don't think that something that has undergone the Avro normalization can have ambiguous array definition. If so that would break both the JSON string fingerprint case as well as the avro binary fingerprint. I am suggesting we avoid the JSON normalization and its dependency on JSON serializers that support ordering by using an avro binary representation for the input to a hash. Both cases require the Avro normalization component.

          It might be useful to ALWAYS store schemas in memory in normalized form – with attributes and doc represented and attached separately. The PRIMITIVES, FULLNAME, and MINIMIZE optimizatoins can always be applied, the STRIP optimization can be trivial by using a shared canonical schema. For example, a Schema can have a CanonicalSchema member variable, plus attributes and doc. Two Schema.Fixed that only vary on their doc or attributes would share thee same CanonicalSchema. Is it ever useful to know that two schemas differ only due to FULLNAME, MINIMIZE, or PRIMITIVE expansion?

          Show
          Scott Carey added a comment - While representing the canonical schema as Avro data reduces it (compared to Json representation) it does not eliminate ambiguity. Non-empty arrays (and maps) can be represented in Avro in more than one way. Can you provide an example? I am having trouble thinking of an example that doesn't fall under the other disambiguations in the document, e.g. {"type":"int"} == "int". Can we have an avro serialized canonical form without the ambiguity? The document describes two forms of normailzation: the Avro normalization, and the JSON normalization. I don't think that something that has undergone the Avro normalization can have ambiguous array definition. If so that would break both the JSON string fingerprint case as well as the avro binary fingerprint. I am suggesting we avoid the JSON normalization and its dependency on JSON serializers that support ordering by using an avro binary representation for the input to a hash. Both cases require the Avro normalization component. It might be useful to ALWAYS store schemas in memory in normalized form – with attributes and doc represented and attached separately. The PRIMITIVES, FULLNAME, and MINIMIZE optimizatoins can always be applied, the STRIP optimization can be trivial by using a shared canonical schema. For example, a Schema can have a CanonicalSchema member variable, plus attributes and doc. Two Schema.Fixed that only vary on their doc or attributes would share thee same CanonicalSchema. Is it ever useful to know that two schemas differ only due to FULLNAME, MINIMIZE, or PRIMITIVE expansion?
          Hide
          Thiruvalluvan M. G. added a comment -

          Can you provide an example?

          A three string array can be encoded in Avro binary as:
          <1><string1><1><string2><1><string3><0> or <3><string1><string2><string3><0>

          So, when you and encode the fields of a record as an array, the binary version of the schema can have more than one valid representations which are equivalent but not not equal. The fingerprints could thus be different.

          Show
          Thiruvalluvan M. G. added a comment - Can you provide an example? A three string array can be encoded in Avro binary as: <1><string1><1><string2><1><string3><0> or <3><string1><string2><string3><0> So, when you and encode the fields of a record as an array, the binary version of the schema can have more than one valid representations which are equivalent but not not equal. The fingerprints could thus be different.
          Hide
          Scott Carey added a comment -

          A three string array can be encoded in Avro binary as:
          <1><string1><1><string2><1><string3><0> or <3><string1><string2><string3><0>

          Oh, I see. You mean that the binary representation of an avro array has multiple forms. That is an issue. We could require that the single block form for arrays/maps is used. I believe this is easier than implementing and testing a conforming json stringification in every language.

          Show
          Scott Carey added a comment - A three string array can be encoded in Avro binary as: <1><string1><1><string2><1><string3><0> or <3><string1><string2><string3><0> Oh, I see. You mean that the binary representation of an avro array has multiple forms. That is an issue. We could require that the single block form for arrays/maps is used. I believe this is easier than implementing and testing a conforming json stringification in every language.
          Hide
          Raymie Stata added a comment -

          Separating out the specification issues raised above.

          Show
          Raymie Stata added a comment - Separating out the specification issues raised above.
          Hide
          Raymie Stata added a comment -

          A clarification, which addresses issues raised by Doug and Scott. The need I'm solving for is to capture that part of a writer's schema which a reader needs to read data. This is a relatively straight-forward notion of "equivalence," and a very useful one. And the good news is that this notion of equivalence allows us to ignore many aspects of schemas (e.g., attributes, aliases, default values).

          I've attached code that generates the canonical string as defined in the design document. The attached code is untested, but I included it so folks have a rough idea of what it would look like. Leaving aside my APL-ish style, I don't think it's terribly long or complicated.

          Show
          Raymie Stata added a comment - A clarification, which addresses issues raised by Doug and Scott. The need I'm solving for is to capture that part of a writer's schema which a reader needs to read data. This is a relatively straight-forward notion of "equivalence," and a very useful one. And the good news is that this notion of equivalence allows us to ignore many aspects of schemas (e.g., attributes, aliases, default values). I've attached code that generates the canonical string as defined in the design document. The attached code is untested, but I included it so folks have a rough idea of what it would look like. Leaving aside my APL-ish style, I don't think it's terribly long or complicated.
          Hide
          Raymie Stata added a comment -

          Having slept on it, I don't think that the Schema class is a good place for this code. Rather, I'll create a o.a.avro.util.SchemaFP class, which will contain a static method to canonicalize schemas (ie, the code in the current patch), and will also contain static methods for generating fingerprints of various lengths (64-bit CRC fingerprints, 128-bit MD5 fingerprints, etc). I'll try to post something over the next few days. But in the meantime, please continue commenting as if that's the plan of record.

          Show
          Raymie Stata added a comment - Having slept on it, I don't think that the Schema class is a good place for this code. Rather, I'll create a o.a.avro.util.SchemaFP class, which will contain a static method to canonicalize schemas (ie, the code in the current patch), and will also contain static methods for generating fingerprints of various lengths (64-bit CRC fingerprints, 128-bit MD5 fingerprints, etc). I'll try to post something over the next few days. But in the meantime, please continue commenting as if that's the plan of record.
          Hide
          Raymie Stata added a comment -

          I've uploaded my patch. I've put the canonicalization and fingerprinting code in a new class, util.SchemaFP, rather than extending the already too-long Schema class.

          Show
          Raymie Stata added a comment - I've uploaded my patch. I've put the canonicalization and fingerprinting code in a new class, util.SchemaFP, rather than extending the already too-long Schema class.
          Hide
          Doug Cutting added a comment -

          This looks generally good. A few nits:

          • In class and method names, should we abbreviate 'fp' or spell out 'fingerprint'? FP means "floating point" to my eye.
          • Might we instead put this in org.apache.avro.SchemaFingerprint, rather than in util? Right now things in the util package depend only on the JDK, not on other parts of Avro.
          • Public methods and classes need javadoc comments.
          • The changes to the spec are not correctly processed by Forrest 0.8 for me.
          Show
          Doug Cutting added a comment - This looks generally good. A few nits: In class and method names, should we abbreviate 'fp' or spell out 'fingerprint'? FP means "floating point" to my eye. Might we instead put this in org.apache.avro.SchemaFingerprint, rather than in util? Right now things in the util package depend only on the JDK, not on other parts of Avro. Public methods and classes need javadoc comments. The changes to the spec are not correctly processed by Forrest 0.8 for me.
          Hide
          Raymie Stata added a comment -

          > ['fp' or 'fingerprint']

          I renamed classes to "fingerprint," but left methods "fp." SchemaFingerprint.fingerprint seems unnecessarily long...

          > [put in o.a.avro]

          Good idea – I moved it.

          > [missing javadoc]

          Added

          > [breaks Forrest]

          I can't get Forrest running on my machine – can you send me the errors? (Or maybe just fix them?)

          Show
          Raymie Stata added a comment - > ['fp' or 'fingerprint'] I renamed classes to "fingerprint," but left methods "fp." SchemaFingerprint.fingerprint seems unnecessarily long... > [put in o.a.avro] Good idea – I moved it. > [missing javadoc] Added > [breaks Forrest] I can't get Forrest running on my machine – can you send me the errors? (Or maybe just fix them?)
          Hide
          graham sanderson added a comment -

          "A clarification, which addresses issues raised by Doug and Scott. The need I'm solving for is to capture that part of a writer's schema which a reader needs to read data. This is a relatively straight-forward notion of "equivalence," and a very useful one. And the good news is that this notion of equivalence allows us to ignore many aspects of schemas (e.g., attributes, aliases, default values)."

          Perhaps this should be made clearer (when naming the class/method), I came across this feature because of a desire to hash/fingerprint avro schemas for messaging, and was seeing if there was already a util to do it. In my case I potentially might use custom properties on fields in the schema to indicate they are being transmitted using a certain named dictionary and thus in my case they affect the ability to interpret the message, so I'd rather stick with something that I can reliably use on the producer end to encode the entire state of the schema, rather than a particular well defined sub-set of the schema.

          Note that (thanks to someone making Props a LinkedHashMap since the code base I'm using) and the particular implementation of Jackson, schema.toString() in the Java impl appears like it will be fine for my purposes, and if another language implementation happens to produce a different hash value I'm cool with that, as long as it is relatively stable; for example:

          SchemaInstance1 toJson> string x
          string x fromJson> SchemInstance2 -> toJson string y
          string x and string y being equal seems a reasonable enough guarantee for me

          Show
          graham sanderson added a comment - "A clarification, which addresses issues raised by Doug and Scott. The need I'm solving for is to capture that part of a writer's schema which a reader needs to read data. This is a relatively straight-forward notion of "equivalence," and a very useful one. And the good news is that this notion of equivalence allows us to ignore many aspects of schemas (e.g., attributes, aliases, default values)." Perhaps this should be made clearer (when naming the class/method), I came across this feature because of a desire to hash/fingerprint avro schemas for messaging, and was seeing if there was already a util to do it. In my case I potentially might use custom properties on fields in the schema to indicate they are being transmitted using a certain named dictionary and thus in my case they affect the ability to interpret the message, so I'd rather stick with something that I can reliably use on the producer end to encode the entire state of the schema, rather than a particular well defined sub-set of the schema. Note that (thanks to someone making Props a LinkedHashMap since the code base I'm using) and the particular implementation of Jackson, schema.toString() in the Java impl appears like it will be fine for my purposes, and if another language implementation happens to produce a different hash value I'm cool with that, as long as it is relatively stable; for example: SchemaInstance1 toJson > string x string x fromJson > SchemInstance2 -> toJson string y string x and string y being equal seems a reasonable enough guarantee for me
          Hide
          graham sanderson added a comment -

          Note, more succinctly, given that I and other people in this issue have been discussing or inferring schema repositories, I think maybe it makes sense in general or at least as a separate issue to deal with a digest of the schema in its entirety, even if that results in a number of hash values mapping to equivalent schemas. Generally I think the use cases don't involve insane quantities of schemas in the first place (although I won't say that for certain, because the ability to define an anonymous schema on the fly for certain applications is rather appealing)

          Show
          graham sanderson added a comment - Note, more succinctly, given that I and other people in this issue have been discussing or inferring schema repositories, I think maybe it makes sense in general or at least as a separate issue to deal with a digest of the schema in its entirety, even if that results in a number of hash values mapping to equivalent schemas. Generally I think the use cases don't involve insane quantities of schemas in the first place (although I won't say that for certain, because the ability to define an anonymous schema on the fly for certain applications is rather appealing)
          Hide
          Raymie Stata added a comment -

          On the one hand, the "Parsing Canonical Form" as I've defined it has important uses and I think should be part of the Avro system.

          On the other hand, it's a good point that there may be additional canonical forms that satisfy different needs in the future. The class- and method-names of the previous patch implicitly assumed a single canonical form. I've modified the patch with new names that will better support the addition of new canonical forms in the future. (I think I also fixed the Forrest problems, although I haven't been able to verify.)

          Show
          Raymie Stata added a comment - On the one hand, the "Parsing Canonical Form" as I've defined it has important uses and I think should be part of the Avro system. On the other hand, it's a good point that there may be additional canonical forms that satisfy different needs in the future. The class- and method-names of the previous patch implicitly assumed a single canonical form. I've modified the patch with new names that will better support the addition of new canonical forms in the future. (I think I also fixed the Forrest problems, although I haven't been able to verify.)
          Hide
          graham sanderson added a comment -

          Raymie, thank you; I think we're in agreement. However the latest patch has class SchemaNormalization with static toParsingForm and static build methods

          the other static methods reference these specific versions directly.

          Perhaps it would be best (since the guts of the parsing form are in the current build method) to leave SchemaNormalization as a class of static methods, but pass a Normalizer instance to most of the static method... in this case the current build method would basically be the guts of the ParsingFormNormalizer class

          Note the common Normalizer instances would likely be sensibly named singletons in the SchemaNormalizer class

          Show
          graham sanderson added a comment - Raymie, thank you; I think we're in agreement. However the latest patch has class SchemaNormalization with static toParsingForm and static build methods the other static methods reference these specific versions directly. Perhaps it would be best (since the guts of the parsing form are in the current build method) to leave SchemaNormalization as a class of static methods, but pass a Normalizer instance to most of the static method... in this case the current build method would basically be the guts of the ParsingFormNormalizer class Note the common Normalizer instances would likely be sensibly named singletons in the SchemaNormalizer class
          Hide
          Doug Cutting added a comment -

          Raymie> SchemaFingerprint.fingerprint seems unnecessarily long...

          Now this becomes SchemaNormalization.fp(SchemaNormalization.toParsingForm(schema)). The 'fp' might better be spelled out as 'fingerprint'. Also a utility method like SchemaNormalization.parsingFingerprint(schema) might be useful.

          Graham> pass a Normalizer instance...

          With the latest API, someone can already call SchemaNormalization.fingerprint() with a differently normalized schema, so I don't see the need for this. As we add more normalizers to Avro we can add new methods, so I'm not (yet) seeing the advantage of adding a Normalization interface.

          Show
          Doug Cutting added a comment - Raymie> SchemaFingerprint.fingerprint seems unnecessarily long... Now this becomes SchemaNormalization.fp(SchemaNormalization.toParsingForm(schema)). The 'fp' might better be spelled out as 'fingerprint'. Also a utility method like SchemaNormalization.parsingFingerprint(schema) might be useful. Graham> pass a Normalizer instance... With the latest API, someone can already call SchemaNormalization.fingerprint() with a differently normalized schema, so I don't see the need for this. As we add more normalizers to Avro we can add new methods, so I'm not (yet) seeing the advantage of adding a Normalization interface.
          Hide
          Raymie Stata added a comment -

          Renamed methods as suggested by Doug (convenience method was already in there).

          Show
          Raymie Stata added a comment - Renamed methods as suggested by Doug (convenience method was already in there).
          Hide
          Doug Cutting added a comment -

          A few minor nits:

          • the 'href' in one of the documentation links is missing its 'h'
          • the WHITESPACE comment should perhaps read, "Eliminate all whitespace in JSON outside of string literals."
          • we might define a nested FingerprintAlgorithm Enum for the implemented fingerprint algorithm names.
          • SchemaNormalization should probably have a private constructor, e.g., 'private SchemaNormalization() {}'
          • the #fingerprint link in the class documentation is broken.

          Otherwise I'm +1 and look forward to committing this early next week unless there are objections.

          Show
          Doug Cutting added a comment - A few minor nits: the 'href' in one of the documentation links is missing its 'h' the WHITESPACE comment should perhaps read, "Eliminate all whitespace in JSON outside of string literals." we might define a nested FingerprintAlgorithm Enum for the implemented fingerprint algorithm names. SchemaNormalization should probably have a private constructor, e.g., 'private SchemaNormalization() {}' the #fingerprint link in the class documentation is broken. Otherwise I'm +1 and look forward to committing this early next week unless there are objections.
          Hide
          Raymie Stata added a comment -

          The string passed to the fingerprint method is in turn passed through to MessageDigest.getInstance. Using an enum here would prevent folks from accessing digest algo's in the underlying library. I suggest leaving that as-is.

          Otherwise I'll fix the rest of these issues.

          Show
          Raymie Stata added a comment - The string passed to the fingerprint method is in turn passed through to MessageDigest.getInstance. Using an enum here would prevent folks from accessing digest algo's in the underlying library. I suggest leaving that as-is. Otherwise I'll fix the rest of these issues.
          Hide
          Douglas Kaminsky added a comment -

          Is cross language support in scope for this feature? The trickiest part is getting the output to match between implementations. That's where I see most of the value in this being incorporated into the core framework instead of implemented by individual users (as we do now)

          Show
          Douglas Kaminsky added a comment - Is cross language support in scope for this feature? The trickiest part is getting the output to match between implementations. That's where I see most of the value in this being incorporated into the core framework instead of implemented by individual users (as we do now)
          Hide
          Raymie Stata added a comment -

          Yes, this is meant to be a cross-language feature, and the definition of "parsing normal form" is in the Avro spec as a result.

          To aid in assuring interoperability, this patch includes file-based test data that should be used across languages. Test inputs and outputs can be found in share/test/data/schema-tests.txt

          Show
          Raymie Stata added a comment - Yes, this is meant to be a cross-language feature, and the definition of "parsing normal form" is in the Avro spec as a result. To aid in assuring interoperability, this patch includes file-based test data that should be used across languages. Test inputs and outputs can be found in share/test/data/schema-tests.txt
          Hide
          Doug Cutting added a comment -

          Patch looks good. +1 I'll commit this unless there are objections.

          Show
          Doug Cutting added a comment - Patch looks good. +1 I'll commit this unless there are objections.
          Hide
          Thiruvalluvan M. G. added a comment -

          +1 Looks good to me.

          Show
          Thiruvalluvan M. G. added a comment - +1 Looks good to me.
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Raymie!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Raymie!

            People

            • Assignee:
              Raymie Stata
              Reporter:
              Raymie Stata
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development