Details

    • Type: New Feature New Feature
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: java
    • Labels:
      None

      Description

      A schema for schemas would permits schemas to be written in binary.

      1. AVRO-251.patch
        17 kB
        Doug Cutting
      2. AVRO-251.patch
        19 kB
        Doug Cutting
      3. AVRO-251.patch
        5 kB
        Doug Cutting

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          Here's an early version of this. It defines org.apache.avro.meta.SchemaDef. We still need a method that converts this to an org.apache.avro.Schema, which should not be hard to write.

          Alternately, we could rely on ValidatingEncoder and ValidatingDecoder, and add methods directly to Schema.java to read/write a schema in binary according to this schema declaration.

          Show
          Doug Cutting added a comment - Here's an early version of this. It defines org.apache.avro.meta.SchemaDef. We still need a method that converts this to an org.apache.avro.Schema, which should not be hard to write. Alternately, we could rely on ValidatingEncoder and ValidatingDecoder, and add methods directly to Schema.java to read/write a schema in binary according to this schema declaration.
          Hide
          Doug Cutting added a comment -

          Here's a new, complete version of this. It directly encodes and decodes schemas.

          JsonNode#equals() doesn't seem to do what I expect, so I had to modify Schema.Field.equals() to use string comparison of default values. I wonder if we even need to serialize default values. The typical application of this code is to efficiently store the schema used to write data with that data. In this case, the default value is never used and just adds baggage. (Default values are only used when reading data written by a schema that lacks that value.) This patch would also be simplified somewhat if we didn't need to read/write arbitrary json for default values, although it might be a shame to discard this json-in-avro implementation.

          Show
          Doug Cutting added a comment - Here's a new, complete version of this. It directly encodes and decodes schemas. JsonNode#equals() doesn't seem to do what I expect, so I had to modify Schema.Field.equals() to use string comparison of default values. I wonder if we even need to serialize default values. The typical application of this code is to efficiently store the schema used to write data with that data. In this case, the default value is never used and just adds baggage. (Default values are only used when reading data written by a schema that lacks that value.) This patch would also be simplified somewhat if we didn't need to read/write arbitrary json for default values, although it might be a shame to discard this json-in-avro implementation.
          Hide
          Thiruvalluvan M. G. added a comment -

          It's unfortunate that if you serialize an Avro schema using JSON encoder and schema for schema, it won't match the standard JSON version of the schema. I don't think it'll be possible to fix the JSON encoder to match the standard JSON version. But we can do the other way around. Though it's desirable to have this property, it'll be cause incompatibility with the existing schemas and I think the schema will become harder for humans to read.

          Since the "order" field of the record is optional, it should be a union of null and Enum in the schema.

          Show
          Thiruvalluvan M. G. added a comment - It's unfortunate that if you serialize an Avro schema using JSON encoder and schema for schema, it won't match the standard JSON version of the schema. I don't think it'll be possible to fix the JSON encoder to match the standard JSON version. But we can do the other way around. Though it's desirable to have this property, it'll be cause incompatibility with the existing schemas and I think the schema will become harder for humans to read. Since the "order" field of the record is optional, it should be a union of null and Enum in the schema.
          Hide
          Philip Zeyliger added a comment -

          I haven't had a chance to fully grok the changes to Schema.java yet (particularly the writeSchema() code), but here's what I've been reviewing so far...

          I wonder if we even need to serialize default values.

          It depends on what the goal of a binary schema is. If we want to preserve the entire nuance of the writing schema, we should keep the defaults. In protocol buffers, it's sometimes the case that the default values aren't materialized, because a missing value gets read as the default (this assumes same read and write schema). That's not the case here: it's actually valid to evolve a schema by changing the default value (this is not really valid in PBs).

          Schema.Field.equals() to use string comparison of default values

          This deserves a quick comment in the code. It's nuanced enough that someone might trip up over it later.

          names.space(savedSpace); // restore space

          I suspect this was a bug before (not popping back the previous space). I didn't see a test for it.

          avro/Schema.m4

          I should finish off AVRO-152 so there could be comments in here.

          One thing I noticed is that you don't store "other" fields. For example, in AVRO-152 I introduce adding "doc" to records and protocols. And, of course, you've been using the ability to write arbitrary extra fields to implement the reflection API. I could see either storing or not storing extra stuff; I think it's quite similar to the question of whether or not to store defaults.

          Do you think it would be useful to have a hook at the schema type to have a way to evolve it? Should it be a union?

          src/schemata/org/json/Json.avsc: {"type": "record", "name": "Field",

          I think you should call this "object" to follow the JSON terminology. See http://www.json.org/fatfree.html .

          I'm also not sure about putting this into "org.json". "org.apache.avro.json", perhaps?

          //System.out.println(new String(data, "UTF-8"));

          Ditch it?

          Show
          Philip Zeyliger added a comment - I haven't had a chance to fully grok the changes to Schema.java yet (particularly the writeSchema() code), but here's what I've been reviewing so far... I wonder if we even need to serialize default values. It depends on what the goal of a binary schema is. If we want to preserve the entire nuance of the writing schema, we should keep the defaults. In protocol buffers, it's sometimes the case that the default values aren't materialized, because a missing value gets read as the default (this assumes same read and write schema). That's not the case here: it's actually valid to evolve a schema by changing the default value (this is not really valid in PBs). Schema.Field.equals() to use string comparison of default values This deserves a quick comment in the code. It's nuanced enough that someone might trip up over it later. names.space(savedSpace); // restore space I suspect this was a bug before (not popping back the previous space). I didn't see a test for it. avro/Schema.m4 I should finish off AVRO-152 so there could be comments in here. One thing I noticed is that you don't store "other" fields. For example, in AVRO-152 I introduce adding "doc" to records and protocols. And, of course, you've been using the ability to write arbitrary extra fields to implement the reflection API. I could see either storing or not storing extra stuff; I think it's quite similar to the question of whether or not to store defaults. Do you think it would be useful to have a hook at the schema type to have a way to evolve it? Should it be a union? src/schemata/org/json/Json.avsc: {"type": "record", "name": "Field", I think you should call this "object" to follow the JSON terminology. See http://www.json.org/fatfree.html . I'm also not sure about putting this into "org.json". "org.apache.avro.json", perhaps? //System.out.println(new String(data, "UTF-8")); Ditch it?
          Hide
          Philip Zeyliger added a comment -

          More thoughts:

          My main reservation here is using the Decoder/Encoder API (instead of a higher-level one) for decoding and encoding the binary form.

          If you're curious, Protocol Buffer's version of the PB that describes PBs is at http://code.google.com/p/protobuf/source/browse/trunk/src/google/protobuf/descriptor.proto

          Default values

          Thinking about default values once more, I think you could encode them in binary. Store them in a "bytes" type, and force users to decode them according to the schema that the default value corresponds to. More compact, and no JSON conceptual dependency. (This pattern is generally sometimes useful: "dynamic schemas" within an avro file.

          public abstract void writeIndex(int unionIndex)

          Is Encoder.java a public api? Is it too late to rename writeIndex to writeUnionTag? Had to scratch my head there until I looked at the code.

          private static final int NAME_INDEX = Type.values().length;

          From a maintainability perspective, it scares me that Type.values() has to be kept in sync with the order of the types in the union in Schema.m4. The "string" that corresponds to NAME_INDEX in Schema.m4 deserves a comment; it's a pretty different category than the others. It seems that instead of string, it could be a record (that contains a string) called "reference", or just use a comment.

          writeSchema...

          This method uses the Encoder API to write an AVRO record. Is there a reason not to use the Specific or Generic APIs instead?

          I think the same thing about readSchema().

          readJson()

          Is this method decoding AVRO's json schema into JSON objects? You should spell that out more clearly; there are a lot of different things json could refer to in this class.

          Show
          Philip Zeyliger added a comment - More thoughts: My main reservation here is using the Decoder/Encoder API (instead of a higher-level one) for decoding and encoding the binary form. If you're curious, Protocol Buffer's version of the PB that describes PBs is at http://code.google.com/p/protobuf/source/browse/trunk/src/google/protobuf/descriptor.proto Default values Thinking about default values once more, I think you could encode them in binary. Store them in a "bytes" type, and force users to decode them according to the schema that the default value corresponds to. More compact, and no JSON conceptual dependency. (This pattern is generally sometimes useful: "dynamic schemas" within an avro file. public abstract void writeIndex(int unionIndex) Is Encoder.java a public api? Is it too late to rename writeIndex to writeUnionTag? Had to scratch my head there until I looked at the code. private static final int NAME_INDEX = Type.values().length; From a maintainability perspective, it scares me that Type.values() has to be kept in sync with the order of the types in the union in Schema.m4. The "string" that corresponds to NAME_INDEX in Schema.m4 deserves a comment; it's a pretty different category than the others. It seems that instead of string, it could be a record (that contains a string) called "reference", or just use a comment. writeSchema... This method uses the Encoder API to write an AVRO record. Is there a reason not to use the Specific or Generic APIs instead? I think the same thing about readSchema(). readJson() Is this method decoding AVRO's json schema into JSON objects? You should spell that out more clearly; there are a lot of different things json could refer to in this class.
          Hide
          Doug Cutting added a comment -

          Thiru> it won't match the standard JSON version of the schema [ ... ]

          Yes, but I don't expect folks to use this for anything but binary (which I should better document). This schema representation is really only useful when schemas are written out frequently, and should then be written more efficiently than they can be with JSON. It's not aesthetically ideal, but still useful, and I fear forcing the JSON encodings to correspond would be impractical for reasons you indicate.

          Thiru> Since the "order" field of the record is optional, it should be a union of null and Enum in the schema.

          It's not optional. It has a default value in the Java API, but it's always specified for every field. If it were such a union then its binary encoding would be larger.

          Philip> If we want to preserve the entire nuance of the writing schema, we should keep the defaults.

          Note that this currently does not preserve every nuance, e.g., user properties. So my vote is to remove default values as well.

          A container that stores lots of binary schemas would be wise to include one copy of the JSON schema-for-schemas, so that, if that meta-schema changes, the data can still be processed. This representation should contain only what's required to decode data, and its documentation should note that. It should only contain things that are required of "actual" schemas, and applications can interpret with an "expected" schema that contains default values, user properties, etc.

          Philip> I suspect this was a bug before (not popping back the previous space). I didn't see a test for it.

          Good point. Yes, it is a pre-existing bug triggered here by the inclusion of the Json schema from a different namespace in the Schema schema, so this patch does add a test, but not an explicit one. I've filed AVRO-255 for this. (This schema also identified AVRO-256.)

          Philip> Do you think it would be useful to have a hook at the schema type to have a way to evolve it? Should it be a union?

          It is in fact a union, wrapped in a record to give it a name. But, no, I think the best way to evolve it is for containers to keep a copy of the meta schema, as with any other and build on Avro's evolution mechanisms. Worst case, if an application neglects to store a copy and this schema changes in a subsequent release, the application can always go back to the prior release and retrieve it and use it as the "actual" schema while using the new release's version as "expected".

          Philip> I think you should call this "object" to follow the JSON terminology.

          It's not a JSON object but a "key value pair" within an object. So perhaps we should call it KeyValuePair? Note though that if we drop default values we don't need this schema here anyway. This code could however still be useful for a command-line tool that takes arbitrary Json input and encodes it in an Avro data file, and then for MapReduce programs that process Json-format generic data. It's nice to see how little code is required to incorporate full JSON data into Avro.

          Philip> From a maintainability perspective, it scares me that Type.values() has to be kept in sync with the order of the types in the union in Schema.m4

          There are two things that protect us here:

          • We have unit tests that attempt to exhaustively test this encoding/decoding code. So it's in that sense its as safe as our DatumReader/DatumWriter implementations. But if this were our only protection, it would not be a programming style we'd recommend for applications. Except,
          • We have ValidatingEncoder and ValidatingDecoder. Folks who use this programming style who do not develop exhaustive test suites should be strongly advised to only use it with ValidatingEncoder and ValidatingDecoder. When I initially wrote this code there were a few errors (I forgot to read/write null values) that were caught by the validator.

          This "event based" programming style requires only a bit more coding than wrapper classes, but saves a level of redirection and/or copies. It also works well for applications whose serialized data does not correspond directly to in-memory structures (e.g., a streaming application with array values too large to fit in memory). The validators make it practical.

          Philip> It seems that instead of string, it could be a record (that contains a string) called "reference"

          That's a good idea. Using the event-style, the code need not actually change, nor would the serialized representation, only the schema.

          Show
          Doug Cutting added a comment - Thiru> it won't match the standard JSON version of the schema [ ... ] Yes, but I don't expect folks to use this for anything but binary (which I should better document). This schema representation is really only useful when schemas are written out frequently, and should then be written more efficiently than they can be with JSON. It's not aesthetically ideal, but still useful, and I fear forcing the JSON encodings to correspond would be impractical for reasons you indicate. Thiru> Since the "order" field of the record is optional, it should be a union of null and Enum in the schema. It's not optional. It has a default value in the Java API, but it's always specified for every field. If it were such a union then its binary encoding would be larger. Philip> If we want to preserve the entire nuance of the writing schema, we should keep the defaults. Note that this currently does not preserve every nuance, e.g., user properties. So my vote is to remove default values as well. A container that stores lots of binary schemas would be wise to include one copy of the JSON schema-for-schemas, so that, if that meta-schema changes, the data can still be processed. This representation should contain only what's required to decode data, and its documentation should note that. It should only contain things that are required of "actual" schemas, and applications can interpret with an "expected" schema that contains default values, user properties, etc. Philip> I suspect this was a bug before (not popping back the previous space). I didn't see a test for it. Good point. Yes, it is a pre-existing bug triggered here by the inclusion of the Json schema from a different namespace in the Schema schema, so this patch does add a test, but not an explicit one. I've filed AVRO-255 for this. (This schema also identified AVRO-256 .) Philip> Do you think it would be useful to have a hook at the schema type to have a way to evolve it? Should it be a union? It is in fact a union, wrapped in a record to give it a name. But, no, I think the best way to evolve it is for containers to keep a copy of the meta schema, as with any other and build on Avro's evolution mechanisms. Worst case, if an application neglects to store a copy and this schema changes in a subsequent release, the application can always go back to the prior release and retrieve it and use it as the "actual" schema while using the new release's version as "expected". Philip> I think you should call this "object" to follow the JSON terminology. It's not a JSON object but a "key value pair" within an object. So perhaps we should call it KeyValuePair? Note though that if we drop default values we don't need this schema here anyway. This code could however still be useful for a command-line tool that takes arbitrary Json input and encodes it in an Avro data file, and then for MapReduce programs that process Json-format generic data. It's nice to see how little code is required to incorporate full JSON data into Avro. Philip> From a maintainability perspective, it scares me that Type.values() has to be kept in sync with the order of the types in the union in Schema.m4 There are two things that protect us here: We have unit tests that attempt to exhaustively test this encoding/decoding code. So it's in that sense its as safe as our DatumReader/DatumWriter implementations. But if this were our only protection, it would not be a programming style we'd recommend for applications. Except, We have ValidatingEncoder and ValidatingDecoder. Folks who use this programming style who do not develop exhaustive test suites should be strongly advised to only use it with ValidatingEncoder and ValidatingDecoder. When I initially wrote this code there were a few errors (I forgot to read/write null values) that were caught by the validator. This "event based" programming style requires only a bit more coding than wrapper classes, but saves a level of redirection and/or copies. It also works well for applications whose serialized data does not correspond directly to in-memory structures (e.g., a streaming application with array values too large to fit in memory). The validators make it practical. Philip> It seems that instead of string, it could be a record (that contains a string) called "reference" That's a good idea. Using the event-style, the code need not actually change, nor would the serialized representation, only the schema.
          Hide
          Philip Zeyliger added a comment -

          Does the serialization and deserialization to binary schemas belong in Schema.java or does it belong in a nearby class? I think the usecase for it (I know you have one in mind, and we're hinting at it in this JIRA) ought to be spelled out in the JavaDoc for the appropriate methods.

          Note that this currently does not preserve every nuance, e.g., user properties. So my vote is to remove default values as well.

          If you're not preserving user properties, I'm +1 for killing the defaults. This leaves us in a place where we have representations of schemas that, without other representations, we can't read data with. (The way I think of it, we always need two schemas: the schema the data was written with, and the schema the data is being read with. We can use the binary version for the former, but not the latter. Is that right? Do we have names for these two schemas?)

          If you were inclined towards keeping the defaults, I would keep pushing for storing them as avro-encoded binary bytes.

          It's nice to see how little code is required to incorporate full JSON data into Avro.

          Yes, that JSON itself has a small schema is re-assuring. I'm +1 for taking this out of this patch, but separately producing a tool to represent "binary JSON" in Avro.

          Just to be sure we've thought of it, one alternative is to ditch the whole binary representation and store the original schema in Avro-encoded binary JSON. I actually prefer schemas to be typed.

          This "event based" programming style requires only a bit more coding than wrapper classes, but saves a level of redirection and/or copies.

          I appreciate that with ValidatingEncoder we get a sense of security. But I have a hard time buying the performance argument here. I think you would agree that using either the specific (my preference) API or the generic API would be clearer from a code perspective. If the performance of the specific API is crap, then we need to measure it and fix it: after all, that is the API Avro recommends people to use. Considering that set of schemas in a program should have small cardinality, and the binary representation could be cached, speed doesn't seem paramount here.

          I agree that event-based models are very useful for things that, say, don't fit into memory readily. Schemas pretty much have to fit into memory readily, so I don't think the case applies here.

          Show
          Philip Zeyliger added a comment - Does the serialization and deserialization to binary schemas belong in Schema.java or does it belong in a nearby class? I think the usecase for it (I know you have one in mind, and we're hinting at it in this JIRA) ought to be spelled out in the JavaDoc for the appropriate methods. Note that this currently does not preserve every nuance, e.g., user properties. So my vote is to remove default values as well. If you're not preserving user properties, I'm +1 for killing the defaults. This leaves us in a place where we have representations of schemas that, without other representations, we can't read data with. (The way I think of it, we always need two schemas: the schema the data was written with, and the schema the data is being read with. We can use the binary version for the former, but not the latter. Is that right? Do we have names for these two schemas?) If you were inclined towards keeping the defaults, I would keep pushing for storing them as avro-encoded binary bytes. It's nice to see how little code is required to incorporate full JSON data into Avro. Yes, that JSON itself has a small schema is re-assuring. I'm +1 for taking this out of this patch, but separately producing a tool to represent "binary JSON" in Avro. Just to be sure we've thought of it, one alternative is to ditch the whole binary representation and store the original schema in Avro-encoded binary JSON. I actually prefer schemas to be typed. This "event based" programming style requires only a bit more coding than wrapper classes, but saves a level of redirection and/or copies. I appreciate that with ValidatingEncoder we get a sense of security. But I have a hard time buying the performance argument here. I think you would agree that using either the specific (my preference) API or the generic API would be clearer from a code perspective. If the performance of the specific API is crap, then we need to measure it and fix it: after all, that is the API Avro recommends people to use. Considering that set of schemas in a program should have small cardinality, and the binary representation could be cached, speed doesn't seem paramount here. I agree that event-based models are very useful for things that, say, don't fit into memory readily. Schemas pretty much have to fit into memory readily, so I don't think the case applies here.
          Hide
          Doug Cutting added a comment -

          > I think the usecase for it (I know you have one in mind, and we're hinting at it in this JIRA)

          I think you're referring to AVRO-160. That's indeed what instigated this last week, but over the weekend I've since had second thoughts about actually using it there. I still believe this is useful for some applications, but, if folks agree with me about AVRO-160, then committing this is not urgent.

          > one alternative is to ditch the whole binary representation and store the original schema in Avro-encoded binary JSON.

          I like that idea. It'd be bigger than with the schema here, since all of the Avro keywords will be included, but it will still be considerably smaller and faster than textual JSON. Plus the specification for JSON is much less likely to change, so its schema is likely to be much more stable and hence its less risky to assume that schema as a system constant.

          > I think you would agree that using either the specific (my preference) API or the generic API would be clearer from a code perspective.

          Perhaps a bit, but not much. It adds an intermediate representation, which has some cognitive overhead, which this code does not. This code instead requires some understanding of Avro's encoder/decoder API. I don't think that would reduce the code size by more than perhaps 10%, and I don't think it would be much more robust. Efficiently mapping the union branch classes to Schema subclasses would require something like a Map<Class,Schema.Type>. This table could be built by processing the schema, rather than as this patch does by assuming that the Schema.Type enum is sync'd with the union. But we could change this patch to build that mapping from the schema too if we are particularly concerned about that.

          I actually generated the specific code first, and considered writing it that way, but it felt like more work to me.

          Show
          Doug Cutting added a comment - > I think the usecase for it (I know you have one in mind, and we're hinting at it in this JIRA) I think you're referring to AVRO-160 . That's indeed what instigated this last week, but over the weekend I've since had second thoughts about actually using it there. I still believe this is useful for some applications, but, if folks agree with me about AVRO-160 , then committing this is not urgent. > one alternative is to ditch the whole binary representation and store the original schema in Avro-encoded binary JSON. I like that idea. It'd be bigger than with the schema here, since all of the Avro keywords will be included, but it will still be considerably smaller and faster than textual JSON. Plus the specification for JSON is much less likely to change, so its schema is likely to be much more stable and hence its less risky to assume that schema as a system constant. > I think you would agree that using either the specific (my preference) API or the generic API would be clearer from a code perspective. Perhaps a bit, but not much. It adds an intermediate representation, which has some cognitive overhead, which this code does not. This code instead requires some understanding of Avro's encoder/decoder API. I don't think that would reduce the code size by more than perhaps 10%, and I don't think it would be much more robust. Efficiently mapping the union branch classes to Schema subclasses would require something like a Map<Class,Schema.Type>. This table could be built by processing the schema, rather than as this patch does by assuming that the Schema.Type enum is sync'd with the union. But we could change this patch to build that mapping from the schema too if we are particularly concerned about that. I actually generated the specific code first, and considered writing it that way, but it felt like more work to me.
          Hide
          Scott Carey added a comment -

          I am using DataFileReader/Writer and the header is about 5K in size because the whole schema is in text.

          I'm not sure if the approach in this ticket is best for the file format, but some way to persist a schema in a compact form would be useful. A binary format would be smaller, but every field and type would still have to be there in text. Maybe, for the data file we could just store the schema as the string, deflate compressed. That might be computationally more expensive for a compact schema representation, but it could be clean in general – if the first character in a byte[] that represents a schema is a special marker value (that is invalid in JSON), then the remaining bytes are compressed json, otherwise its utf-8 json.

          My largest schema is 6.3k as a string including whitespace 'pretty printed', and 4.9k without whitespace as printed by Schema.toString().
          It is 1.3k compressed by gzip -5 or higher, and 1.5k by gzip -1.

          Show
          Scott Carey added a comment - I am using DataFileReader/Writer and the header is about 5K in size because the whole schema is in text. I'm not sure if the approach in this ticket is best for the file format, but some way to persist a schema in a compact form would be useful. A binary format would be smaller, but every field and type would still have to be there in text. Maybe, for the data file we could just store the schema as the string, deflate compressed. That might be computationally more expensive for a compact schema representation, but it could be clean in general – if the first character in a byte[] that represents a schema is a special marker value (that is invalid in JSON), then the remaining bytes are compressed json, otherwise its utf-8 json. My largest schema is 6.3k as a string including whitespace 'pretty printed', and 4.9k without whitespace as printed by Schema.toString(). It is 1.3k compressed by gzip -5 or higher, and 1.5k by gzip -1.
          Hide
          Doug Cutting added a comment -

          Might we instead add a metadata field like, "avro.schema.codec"?

          Show
          Doug Cutting added a comment - Might we instead add a metadata field like, "avro.schema.codec"?
          Hide
          Scott Carey added a comment -

          Yeah, in the file format case that is simpler.

          In the general case of serializing a schema compactly, compression may be an option. Schema could have a constructor or factory that takes a byte[] and a corresponding serializer method that returns a the schema as a set of compressed bytes.

          Show
          Scott Carey added a comment - Yeah, in the file format case that is simpler. In the general case of serializing a schema compactly, compression may be an option. Schema could have a constructor or factory that takes a byte[] and a corresponding serializer method that returns a the schema as a set of compressed bytes.
          Hide
          Nick Palmer added a comment -

          While it is certainly nice to be able to serialize to binary I suspect that having complete meta will enable a lot more applications than you would expect. I am working on one in particular where I need this schema but I need it to be complete. I need the "other" fields mentioned above. As it stands in this patch my application needs:

          Record: name, namespace, and doc fields.
          Field: doc.
          Enum: namespace.
          Order is not a union with null. (Already mentioned)

          For what I am writing I need these in the meta-data to be able to drive things and it would be nice if they were offered as officially supported meta-data so that it is maintained as Avro evolves and my application can adapt to new versions via this meta-data when I drop a new Avro jar in instead of having to maintain it myself.

          Show
          Nick Palmer added a comment - While it is certainly nice to be able to serialize to binary I suspect that having complete meta will enable a lot more applications than you would expect. I am working on one in particular where I need this schema but I need it to be complete. I need the "other" fields mentioned above. As it stands in this patch my application needs: Record: name, namespace, and doc fields. Field: doc. Enum: namespace. Order is not a union with null. (Already mentioned) For what I am writing I need these in the meta-data to be able to drive things and it would be nice if they were offered as officially supported meta-data so that it is maintained as Avro evolves and my application can adapt to new versions via this meta-data when I drop a new Avro jar in instead of having to maintain it myself.
          Hide
          Noble Paul added a comment -

          We have a use case where we send messages (over a message bus) using avro format. Because these are one way messages, handshake is not an option. With the binary encoding we should be able to dramatically reduce the payload

          Show
          Noble Paul added a comment - We have a use case where we send messages (over a message bus) using avro format. Because these are one way messages, handshake is not an option. With the binary encoding we should be able to dramatically reduce the payload
          Hide
          Doug Cutting added a comment -

          Paul, In the current patch the binary schema format does not preserve all schema attributes, e.g., documentation strings, keeping the representation smaller and faster. If all attributes are required one could use the read/writeJson() methods in this class to serialize schemas, or we could enhance the patch.

          Another approach to one-way messaging is to have senders schemas registered in a shared database, and refer to them by a numeric ID in each message sent. The receiver can then lookup the ID in the database. Access to the database is cached.

          Nick, the current patch does not preserve documentation, but does preserve namespace and order correctly. The patch is out of date with trunk, however.

          Show
          Doug Cutting added a comment - Paul, In the current patch the binary schema format does not preserve all schema attributes, e.g., documentation strings, keeping the representation smaller and faster. If all attributes are required one could use the read/writeJson() methods in this class to serialize schemas, or we could enhance the patch. Another approach to one-way messaging is to have senders schemas registered in a shared database, and refer to them by a numeric ID in each message sent. The receiver can then lookup the ID in the database. Access to the database is cached. Nick, the current patch does not preserve documentation, but does preserve namespace and order correctly. The patch is out of date with trunk, however.
          Hide
          Noble Paul added a comment -

          I see that the current patch is not complete. The most important requirement is to keep it as compact as possible. All that we need to deserialize a message is just the field names and types.

          Another approach to one-way messaging is to have senders schemas registered in a shared database..

          We are exploring that approach. The problem is that it adds an extra element to the deployment (which can be another failure point in the system). Moreover, it is error prone if we rely on manually keeping the repository up-to-date

          Show
          Noble Paul added a comment - I see that the current patch is not complete. The most important requirement is to keep it as compact as possible. All that we need to deserialize a message is just the field names and types. Another approach to one-way messaging is to have senders schemas registered in a shared database.. We are exploring that approach. The problem is that it adds an extra element to the deployment (which can be another failure point in the system). Moreover, it is error prone if we rely on manually keeping the repository up-to-date
          Hide
          Doug Cutting added a comment -

          I think the patch is/was complete, it just no longer applies cleanly to trunk. If it looks like it meets your needs I could work to update it.

          Show
          Doug Cutting added a comment - I think the patch is/was complete, it just no longer applies cleanly to trunk. If it looks like it meets your needs I could work to update it.
          Hide
          Noble Paul added a comment -

          It will be helpful if I can get this patch up to date with trunk

          Show
          Noble Paul added a comment - It will be helpful if I can get this patch up to date with trunk
          Hide
          Doug Cutting added a comment -

          Here's an updated version of this that applies to trunk. It fails two tests, one in testRecord that's expected, since it doesn't preserve record properties. The other is in testComplexUnions and I've not yet debugged that. But it's still probably good enough for you to evaluate.

          Show
          Doug Cutting added a comment - Here's an updated version of this that applies to trunk. It fails two tests, one in testRecord that's expected, since it doesn't preserve record properties. The other is in testComplexUnions and I've not yet debugged that. But it's still probably good enough for you to evaluate.
          Hide
          Noble Paul added a comment -

          Thanks Doug, I'll test this and post my findings

          Show
          Noble Paul added a comment - Thanks Doug, I'll test this and post my findings

            People

            • Assignee:
              Doug Cutting
              Reporter:
              Doug Cutting
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:

                Development