Details

    • Type: New Feature New Feature
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.7.6
    • Fix Version/s: None
    • Component/s: spec
    • Labels:
      None
    • Release Note:
      -Cyclic reference support

      Description

      This is a proposed implementation to add cycle reference support to Avro. It basically introduce a new type named Cycle. Cycles contains a string representing the path to the other reference.

      For example if we have an object of type Message that have a member named previous with type Message too. If we have have this hierarchy:

      message
      previous : message2

      message2
      previous : message2

      When serializing the cycle path for "message2.previous" will be "previous".

      The implementation depend on ANTLR to evaluate those cycle at read time to resolve them. I used ANTLR 3.2. This dependency is not mandated; I just used ANTLR to speed thing up. I kept in this implementation the generated code from ANTLR though this should not be the case as this should be generated during the build. I only updated the Java code.

      I did not make full unit testing but you can find "avrotest.Main" class that can be used a preliminary test.

      Please do not hesitate to contact me for further clarification if this seems interresting.

      Best regards,
      Moustapha Cherri

      1. avro-1.4.1-cycle.patch.gz
        41 kB
        Moustapha Cherri
      2. avro-1.4.1-cycle.patch.gz
        43 kB
        Moustapha Cherri
      3. avro_circular_references.zip
        4 kB
        Sachin Goyal
      4. avro_circular_refs_2014_06_14.zip
        7 kB
        Sachin Goyal
      5. circular_refs_and_nonstring_map_keys_2014_06_25.zip
        11 kB
        Sachin Goyal

        Activity

        Hide
        Moustapha Cherri added a comment -

        implementation patch from version 1.4.1

        Show
        Moustapha Cherri added a comment - implementation patch from version 1.4.1
        Hide
        Xiaolu Ye added a comment -

        Moustapha,

        Our project need cycle reference support. Thanks for submitting the patch. When trying to use it, I noticed a bug in GenericData.Array. It needs to override AbstractList.set(int, T) as below section of GenericDatumReader.interpretCycles() uses set call. Currently, it throws UnsupportedOperationException at line 175

        if (record instanceof List)

        { ((List) record).set(Integer.parseInt(field), result); // line 175 }

        else

        { Array.set(record, Integer.parseInt(field), result); }

        Thanks,

        Xiaolu

        Show
        Xiaolu Ye added a comment - Moustapha, Our project need cycle reference support. Thanks for submitting the patch. When trying to use it, I noticed a bug in GenericData.Array. It needs to override AbstractList.set(int, T) as below section of GenericDatumReader.interpretCycles() uses set call. Currently, it throws UnsupportedOperationException at line 175 if (record instanceof List) { ((List) record).set(Integer.parseInt(field), result); // line 175 } else { Array.set(record, Integer.parseInt(field), result); } Thanks, Xiaolu
        Hide
        Moustapha Cherri added a comment -

        Dear Xiaolu,

        Please find the fix to the issue you mentioned. I fixed more issues related to GenericDatumReader too.

        Thanks for your reply and interest in this.

        Best regards,
        Moustapha Cherri

        Show
        Moustapha Cherri added a comment - Dear Xiaolu, Please find the fix to the issue you mentioned. I fixed more issues related to GenericDatumReader too. Thanks for your reply and interest in this. Best regards, Moustapha Cherri
        Hide
        Doug Cutting added a comment -

        The current patch is a specification change, since it adds a new schema type, "cycle". It is back-compatible, but not forward-compatible: implementations that do not implement cycles would be unable to read data that contains cycle schemas.

        Instead, a schema for a cycle reference might be defined. For example, one could define a org.apache.avro.cycles.CycleReference record containing a single integer field. CycleReference would only be used in unions with types that it may refer to. The DatumWriter would keep a IdentityHashMap<Object,Integer> of records, maps and arrays written, adding an entry the first time an instance is seen, and writing a CycleReference for subsequent occurrences in appropriate unions. The DatumReader would then keep an array of all records, maps and arrays that have been read and, when it reads a CycleReference in a union, return a pointer to the indicated element of that array.

        Adding cycles should not slow applications that do not require this feature. They could be implemented in newly defined CycleDatumReader/Writer, or perhaps GenericDatumReader and GenericDatumWriter could be modified to optionally handle such cycles.

        Show
        Doug Cutting added a comment - The current patch is a specification change, since it adds a new schema type, "cycle". It is back-compatible, but not forward-compatible: implementations that do not implement cycles would be unable to read data that contains cycle schemas. Instead, a schema for a cycle reference might be defined. For example, one could define a org.apache.avro.cycles.CycleReference record containing a single integer field. CycleReference would only be used in unions with types that it may refer to. The DatumWriter would keep a IdentityHashMap<Object,Integer> of records, maps and arrays written, adding an entry the first time an instance is seen, and writing a CycleReference for subsequent occurrences in appropriate unions. The DatumReader would then keep an array of all records, maps and arrays that have been read and, when it reads a CycleReference in a union, return a pointer to the indicated element of that array. Adding cycles should not slow applications that do not require this feature. They could be implemented in newly defined CycleDatumReader/Writer, or perhaps GenericDatumReader and GenericDatumWriter could be modified to optionally handle such cycles.
        Hide
        Moustapha Cherri added a comment -

        Dear Doug,

        Thanks for your valuable comments. I agree with you. I will implement them soon. I will build a wrapper around GenericDatumReader and GenericDatumWriter so that applications can inject this aspect as needed. Wrapping will allow application using for example ReflectDatumReader/ ReflectDatumWriter to use this feature without implementing a new special class for them.

        Best regards,
        Moustapha Cherri

        Show
        Moustapha Cherri added a comment - Dear Doug, Thanks for your valuable comments. I agree with you. I will implement them soon. I will build a wrapper around GenericDatumReader and GenericDatumWriter so that applications can inject this aspect as needed. Wrapping will allow application using for example ReflectDatumReader/ ReflectDatumWriter to use this feature without implementing a new special class for them. Best regards, Moustapha Cherri
        Hide
        Sachin Goyal added a comment -

        Support for circular references without modifying the grammar.
        Doug Cutting, this patch takes above recommendations into account.

        Example of a simple circular reference (very common in Java and Hibernate world):
        -------------------
        class Parent

        { String name; Child child; }

        ;

        class Child

        { Parent parent; String city; }

        ;
        -------------------
        Without this fix, Avro correctly generates schema for the above but fails with StackOverflow error when trying to serialize actual objects with circular reference.

        This fix requires following steps for using it:
        1) Allow clients to choose name of the extra-field which every entity can use to store its ID.
        ReflectData rdata = ReflectData.AllowNull.get();
        rdata.setCircularRefIdPrefix("__crefId");

        2) One must pass the same rdata when serializing objects.
        DatumWriter<T> datumWriter = new ReflectDatumWriter (T.class, rdata);

        3) MUST use the same field-name when reading circular-reference objects.
        GenericDatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord> ();
        GenericData gdata = datumReader.getData();
        gdata.setCircularRefIdPrefix("__crefId");
        If field-name is different, there may be no error but circular references will not be resolved.

        4) Choose if the circular references should be fully resolved or not.
        datumReader.setResolveCircularRefs (false);
        If this is set to true, circular references will point to actual objects forming a cycle.
        If this is set to false, circular references will point to dummy objects having only the circular-reference-ID. This option allows Avro to work everywhere else without running into infinite loops (like GenericRecord.toString()) while providing users enough information to
        form true circular references themselves if required.

        Test-case provided for circular references in List and Maps as well.
        Happy to make modifications if any.

        Show
        Sachin Goyal added a comment - Support for circular references without modifying the grammar. Doug Cutting , this patch takes above recommendations into account. Example of a simple circular reference (very common in Java and Hibernate world): ------------------- class Parent { String name; Child child; } ; class Child { Parent parent; String city; } ; ------------------- Without this fix, Avro correctly generates schema for the above but fails with StackOverflow error when trying to serialize actual objects with circular reference. This fix requires following steps for using it: 1) Allow clients to choose name of the extra-field which every entity can use to store its ID. ReflectData rdata = ReflectData.AllowNull.get(); rdata.setCircularRefIdPrefix("__crefId"); 2) One must pass the same rdata when serializing objects. DatumWriter<T> datumWriter = new ReflectDatumWriter (T.class, rdata); 3) MUST use the same field-name when reading circular-reference objects. GenericDatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord> (); GenericData gdata = datumReader.getData(); gdata.setCircularRefIdPrefix("__crefId"); If field-name is different, there may be no error but circular references will not be resolved. 4) Choose if the circular references should be fully resolved or not. datumReader.setResolveCircularRefs (false); If this is set to true, circular references will point to actual objects forming a cycle. If this is set to false, circular references will point to dummy objects having only the circular-reference-ID. This option allows Avro to work everywhere else without running into infinite loops (like GenericRecord.toString()) while providing users enough information to form true circular references themselves if required. Test-case provided for circular references in List and Maps as well. Happy to make modifications if any.
        Hide
        Sachin Goyal added a comment -

        Avro-Circular-References-1.7.6

        Show
        Sachin Goyal added a comment - Avro-Circular-References-1.7.6
        Hide
        Sachin Goyal added a comment -

        For designs with no circular reference, the above options need not be used and those clients would not be affected thus.

        Show
        Sachin Goyal added a comment - For designs with no circular reference, the above options need not be used and those clients would not be affected thus.
        Hide
        Doug Cutting added a comment -

        This does not appear to write data that other implementations can read. Am I missing something? It also appears to require that, to support circularity, the first field in a record must be a string.

        Rather, I think we must explicitly add the reference to the schema, e.g.:

        {"type":"record", "name":"CircularListOfStrings", "fields":[
          {"name":"value", "type":"string"},
          {"name":"next", "type":["CircularListOfStrings","int"]}
        ]}
        

        A circular instance would appear to existing Avro implementations as a non-circular list of strings terminating with an integer.

        Show
        Doug Cutting added a comment - This does not appear to write data that other implementations can read. Am I missing something? It also appears to require that, to support circularity, the first field in a record must be a string. Rather, I think we must explicitly add the reference to the schema, e.g.: { "type" : "record" , "name" : "CircularListOfStrings" , "fields" :[ { "name" : "value" , "type" : "string" }, { "name" : "next" , "type" :[ "CircularListOfStrings" , " int " ]} ]} A circular instance would appear to existing Avro implementations as a non-circular list of strings terminating with an integer.
        Hide
        Sachin Goyal added a comment -

        For other applications to read, they must set in the field-name used for circular reference:

        GenericDatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord> ();
        GenericData gdata = datumReader.getData();
        gdata.setCircularRefIdPrefix("__crefId");
        

        If they use the above for the reader, then the output would be:

        {
           "__crefId":"0",
           "name":"John Sr",
           "child":{
              "__crefId":"1",
              "name":"John Jr",
              "parent":{
                 "__crefId":"__crefId0",
                 "name":null,
                 "child":null
              }
           }
        }
        

        And if they set the flag:

        datumReader.setResolveCircularRefs (true);
        

        Then the reference to parent is actually replaced with the reference to the actual parent.
        The reason I have made the actual reference optional is because some tools might not be prepared to handle circular reference yet.
        For example, GenericRecord.toString() fails with StackOverflow if the reference is replaced with an actual one.

        Default value for datumReader.setResolveCircularRefs is false, so the output record's type is consistent with the schema and it has enough information for clients to help them deduce circular reference.

        As per this fix, the first field in a record must be a string to support circularity.

        Show
        Sachin Goyal added a comment - For other applications to read, they must set in the field-name used for circular reference: GenericDatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord> (); GenericData gdata = datumReader.getData(); gdata.setCircularRefIdPrefix( "__crefId" ); If they use the above for the reader, then the output would be: { "__crefId":"0", "name":"John Sr", "child":{ "__crefId":"1", "name":"John Jr", "parent":{ "__crefId":"__crefId0", "name": null , "child": null } } } And if they set the flag: datumReader.setResolveCircularRefs ( true ); Then the reference to parent is actually replaced with the reference to the actual parent. The reason I have made the actual reference optional is because some tools might not be prepared to handle circular reference yet. For example, GenericRecord.toString() fails with StackOverflow if the reference is replaced with an actual one. Default value for datumReader.setResolveCircularRefs is false, so the output record's type is consistent with the schema and it has enough information for clients to help them deduce circular reference. As per this fix, the first field in a record must be a string to support circularity.
        Hide
        Sachin Goyal added a comment -

        Doug Cutting, I could not get why other applications would not be able to read it.
        Could you provide an example?
        If they use the following, Avro reader can read the data and so other applications can also.

        GenericDatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord> ();
        GenericData gdata = datumReader.getData();
        gdata.setCircularRefIdPrefix("<id>");
        

        I am actually trying to follow the Jackson way of doing it (We have used this approach for Jackson in several places and it worked well).
        http://java.dzone.com/articles/circular-dependencies-jackson

        Show
        Sachin Goyal added a comment - Doug Cutting , I could not get why other applications would not be able to read it. Could you provide an example? If they use the following, Avro reader can read the data and so other applications can also. GenericDatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord> (); GenericData gdata = datumReader.getData(); gdata.setCircularRefIdPrefix( "<id>" ); I am actually trying to follow the Jackson way of doing it (We have used this approach for Jackson in several places and it worked well). http://java.dzone.com/articles/circular-dependencies-jackson
        Hide
        Doug Cutting added a comment -

        By other implementations I mean C, C++, Python, Perl, C#, Ruby, & PHP. Existing versions of these should be able to read a data file that contains circular references written by Java.

        Perhaps I'm missing something, but it looks like once the reference ID is read the rest of the schema is skipped. If that's correct, then, in order for an existing implementation to be able to read such data it must declare a union of the id type and the record type, to accurately describe what's written.

        Show
        Doug Cutting added a comment - By other implementations I mean C, C++, Python, Perl, C#, Ruby, & PHP. Existing versions of these should be able to read a data file that contains circular references written by Java. Perhaps I'm missing something, but it looks like once the reference ID is read the rest of the schema is skipped. If that's correct, then, in order for an existing implementation to be able to read such data it must declare a union of the id type and the record type, to accurately describe what's written.
        Hide
        Sachin Goyal added a comment -

        Doug Cutting, excellent suggestion.
        I have made the suggested changes in the new patch:

        1) If

        ReflectData.setCircularRefIdPrefix("some-field-name")

        is set, Avro converts each RECORD schema into a UNION schema such that it can either be a record or a string. This should make circular references readable by other implementations of Avro like c, c++, php, ruby etc.
        (AllowNull also works with this).

        2) Added some thread-safe constructs like ThreadLocal for hash-maps used in circular reference code.

        3) Enhanced the test-case (with detailed comments) to test regular deserialization and circular deserialization for a circular-ref serialized Avro.

        Show
        Sachin Goyal added a comment - Doug Cutting , excellent suggestion. I have made the suggested changes in the new patch: 1) If ReflectData.setCircularRefIdPrefix( "some-field-name" ) is set, Avro converts each RECORD schema into a UNION schema such that it can either be a record or a string. This should make circular references readable by other implementations of Avro like c, c++, php, ruby etc. (AllowNull also works with this). 2) Added some thread-safe constructs like ThreadLocal for hash-maps used in circular reference code. 3) Enhanced the test-case (with detailed comments) to test regular deserialization and circular deserialization for a circular-ref serialized Avro.
        Hide
        Sachin Goyal added a comment -

        Circular References

        ------------------------------------

        Serialization
        Extra API required (not optional):

        ReflectData.setCircularRefIdPrefix("some-field-name")

        If the above is set, following happens:

        1. During serialization, each record contains the extra field specified above. The value for this field is just a monotonically increasing number meant to uniquely identify each record in one particular serialization.
        2. While writing schema, each RECORD schema is converted into a UNION schema such that it can either be a record or a string. During object serialization, if a record is seen before, it is not written as a record. Rather it is written as a string in the format: "some-field-name" + "ID-generated-in #1 above". With this structure, reading applications have enough information to restore the circular reference if they want. This structure is also usable by languages not supporting circular reference because they will read that circular-reference as a normal string.
          (AllowNull also works with this).
        3. Above field-name is included in each record as a property. This allows the readers to become aware of this field-name so that the clients do not have to specify this just to populate the circular references. Basically it makes the schema self-sufficient.

        Deserialization
        Extra API required (optional):

        GenericDatumReader.setResolveCircularRefs(boolean)

        Based on #3 above, GenericDatumReader becomes circular-reference-aware.
        But since all GenericDatumReaders share a common GenericData instance, they are provided with another flag "resolveCircularRefs" to control whether they want to resolve circular references or not.
        If this flag is set and the serialized schema has non-null value for circular-reference-field, GenericDatumReader does the following:

        1. If any record has circular-ref-field, store its value and the corresponding record in a map.
        2. Look for unions which can be serialized as a record as well as a string. On finding such a record serialized as a string, replace the string with the record retreived from the map created in #1

        Non-string map-keys

        ----------------------------------------

        Serialization
        No extra API required.

        Without this patch, Avro throws an exception for non-string map-keys.
        This patch converts such maps into an array of records where each record has two fields: key and value. Example:
        Map<ObjX, ObjY> is converted to [{"key":

        {ObjX}

        , "value":{ObjY}}]
        To do this, following is done:

        1. In ReflectData.java, create schema for key as well as value in the non-string hash-map.
          Encapsulate these two schemas into a record schema and create an array schema of such records.
          Set property NS_MAP_FLAG to "1" and store the actual class of the map as a CLASS_PROP
        1. While writing out a non-string map field, if NS_MAP_FLAG is set, convert map to array of records using map.entrySet()

        Deserialization
        No extra API required.

        Deserialization for non-string map-keys is pretty simple since data and the schema match exactly.
        So it just deserializes automatically.
        To create an actual map (like when using ReflectDatumReader with actual-class type-parameter), map is instantiated using CLASS_PROP if the property NS_MAP_FLAG is set to "1"

        Testcases included

        ------------------------------------
        The unit tests cover the following:

        1. Circular references at multiple levels of hierarchy
        2. Circular references within Collections and Maps.
        3. Circular and non-circular deserialization of circularly serialized objects.
        4. Non-string map-keys having circular references.
        5. Non-string map-keys with nested maps.
        Show
        Sachin Goyal added a comment - Circular References ------------------------------------ Serialization Extra API required (not optional): ReflectData.setCircularRefIdPrefix( "some-field-name" ) If the above is set, following happens: During serialization, each record contains the extra field specified above. The value for this field is just a monotonically increasing number meant to uniquely identify each record in one particular serialization. While writing schema, each RECORD schema is converted into a UNION schema such that it can either be a record or a string. During object serialization, if a record is seen before, it is not written as a record. Rather it is written as a string in the format: "some-field-name" + "ID-generated-in #1 above". With this structure, reading applications have enough information to restore the circular reference if they want. This structure is also usable by languages not supporting circular reference because they will read that circular-reference as a normal string. (AllowNull also works with this). Above field-name is included in each record as a property. This allows the readers to become aware of this field-name so that the clients do not have to specify this just to populate the circular references. Basically it makes the schema self-sufficient. Deserialization Extra API required (optional): GenericDatumReader.setResolveCircularRefs( boolean ) Based on #3 above, GenericDatumReader becomes circular-reference-aware. But since all GenericDatumReaders share a common GenericData instance, they are provided with another flag "resolveCircularRefs" to control whether they want to resolve circular references or not. If this flag is set and the serialized schema has non-null value for circular-reference-field, GenericDatumReader does the following: If any record has circular-ref-field, store its value and the corresponding record in a map. Look for unions which can be serialized as a record as well as a string. On finding such a record serialized as a string, replace the string with the record retreived from the map created in #1 Non-string map-keys ---------------------------------------- Serialization No extra API required. Without this patch, Avro throws an exception for non-string map-keys. This patch converts such maps into an array of records where each record has two fields: key and value. Example: Map<ObjX, ObjY> is converted to [{"key": {ObjX} , "value":{ObjY}}] To do this, following is done: In ReflectData.java, create schema for key as well as value in the non-string hash-map. Encapsulate these two schemas into a record schema and create an array schema of such records. Set property NS_MAP_FLAG to "1" and store the actual class of the map as a CLASS_PROP While writing out a non-string map field, if NS_MAP_FLAG is set, convert map to array of records using map.entrySet() Deserialization No extra API required. Deserialization for non-string map-keys is pretty simple since data and the schema match exactly. So it just deserializes automatically. To create an actual map (like when using ReflectDatumReader with actual-class type-parameter), map is instantiated using CLASS_PROP if the property NS_MAP_FLAG is set to "1" Testcases included ------------------------------------ The unit tests cover the following: Circular references at multiple levels of hierarchy Circular references within Collections and Maps. Circular and non-circular deserialization of circularly serialized objects. Non-string map-keys having circular references. Non-string map-keys with nested maps.
        Hide
        Doug Cutting added a comment -

        For some reason I cannot apply the patch file you've generated, so it's hard for me to analyze it in detail. What tool are you using to generate this patch?

        I'd prefer we use an explicit type rather than overload string for this purpose. A union with a record like:

        {"type":"record","name":"org.apache.avro.CircularRef", "fields":[{"name":"ref", "type":int}]}
        

        We don't ever have to create or define such a record. Rather, a CustomEncoding can be used to directly resolve such references at read time. (If circular references are not enabled then a GenericRecord would be read.) No string prefixing, etc. would then be required.

        Rather than modifying ReflectData to support this, might we instead create a subclass of ReflectData that supports circular references?

        Non-string map key support in reflection should be addressed in a separate issue.

        Show
        Doug Cutting added a comment - For some reason I cannot apply the patch file you've generated, so it's hard for me to analyze it in detail. What tool are you using to generate this patch? I'd prefer we use an explicit type rather than overload string for this purpose. A union with a record like: { "type" : "record" , "name" : "org.apache.avro.CircularRef" , "fields" :[{ "name" : "ref" , "type" : int }]} We don't ever have to create or define such a record. Rather, a CustomEncoding can be used to directly resolve such references at read time. (If circular references are not enabled then a GenericRecord would be read.) No string prefixing, etc. would then be required. Rather than modifying ReflectData to support this, might we instead create a subclass of ReflectData that supports circular references? Non-string map key support in reflection should be addressed in a separate issue.
        Hide
        Sachin Goyal added a comment -

        Thanks Doug.

        I created the patch using diff -r and it can be patched using patch -i <patch-file>

        For non-string map-keys, I have submitted a separate patch at https://issues.apache.org/jira/browse/AVRO-680

        Could you please explain your comment regarding circular references in more detail?
        I will change my patch accordingly.

        For example, here is a circular reference schema generated using Avro:

        {
          "type" : "record",
          "name" : "SimpleParent",
          "namespace" : "org.apache.avro.generic",
          "fields" : [ {
            "name" : "parentName",
            "type" : [ "null", "string" ],
            "default" : null
          }, {
            "name" : "child",
            "type" : [ "null", {
              "type" : "record",
              "name" : "SimpleChild",
              "fields" : [ {
                "name" : "childName",
                "type" : [ "null", "string" ],
                "default" : null
              }, {
                "name" : "parent",
                "type" : [ "null", "SimpleParent"],
                "default" : null
              } ]
            }, "string" ],
            "default" : null
          } ]
        }
        

        The current code converts it to the following:

        {
          "type" : "record",
          "name" : "SimpleParent",
          "namespace" : "org.apache.avro.generic",
          "fields" : [ {
            "name" : "__crefId",
            "type" : "string"
          }, {
            "name" : "parentName",
            "type" : [ "null", "string" ],
            "default" : null
          }, {
            "name" : "child",
            "type" : [ "null", {
              "type" : "record",
              "name" : "SimpleChild",
              "fields" : [ {
                "name" : "__crefId",
                "type" : "string"
              }, {
                "name" : "childName",
                "type" : [ "null", "string" ],
                "default" : null
              }, {
                "name" : "parent",
                "type" : [ "null", "SimpleParent", "string" ],
                "default" : null
              } ],
              "circularRefIdPrefix" : "__crefId"
            }, "string" ],
            "default" : null
          } ],
          "circularRefIdPrefix" : "__crefId"
        }
        

        Can you please apply your comments above to this example?
        It will help me in understanding how it would be different from the above solution.

        As per my understanding, circular references can come in any record-type element.
        So CustomeEncoder approach would need to write it as a record sometimes or a string/int sometimes.
        Can a CustomEncoder pass the control to regular Avro Encoder to write a record normally?

        Show
        Sachin Goyal added a comment - Thanks Doug. I created the patch using diff -r and it can be patched using patch -i <patch-file> For non-string map-keys, I have submitted a separate patch at https://issues.apache.org/jira/browse/AVRO-680 Could you please explain your comment regarding circular references in more detail? I will change my patch accordingly. For example, here is a circular reference schema generated using Avro: { "type" : "record", "name" : "SimpleParent", "namespace" : "org.apache.avro.generic", "fields" : [ { "name" : "parentName", "type" : [ " null ", "string" ], " default " : null }, { "name" : "child", "type" : [ " null ", { "type" : "record", "name" : "SimpleChild", "fields" : [ { "name" : "childName", "type" : [ " null ", "string" ], " default " : null }, { "name" : "parent", "type" : [ " null ", "SimpleParent"], " default " : null } ] }, "string" ], " default " : null } ] } The current code converts it to the following: { "type" : "record", "name" : "SimpleParent", "namespace" : "org.apache.avro.generic", "fields" : [ { "name" : "__crefId", "type" : "string" }, { "name" : "parentName", "type" : [ " null ", "string" ], " default " : null }, { "name" : "child", "type" : [ " null ", { "type" : "record", "name" : "SimpleChild", "fields" : [ { "name" : "__crefId", "type" : "string" }, { "name" : "childName", "type" : [ " null ", "string" ], " default " : null }, { "name" : "parent", "type" : [ " null ", "SimpleParent", "string" ], " default " : null } ], "circularRefIdPrefix" : "__crefId" }, "string" ], " default " : null } ], "circularRefIdPrefix" : "__crefId" } Can you please apply your comments above to this example? It will help me in understanding how it would be different from the above solution. As per my understanding, circular references can come in any record-type element. So CustomeEncoder approach would need to write it as a record sometimes or a string/int sometimes. Can a CustomEncoder pass the control to regular Avro Encoder to write a record normally?
        Hide
        Martin Kleppmann added a comment -

        I don't mean to be grouchy, but I'm not convinced this is a feature that Avro really needs to have. It adds complexity to all the language implementations of Avro that need to support it, and it creates difficulties for anyone who needs to map between Avro's data model and another data model (e.g. representing Avro data in JSON format, or accessing Avro data in Pig through Pig's type system).

        For an application that needs to model an arbitrary graph of references between objects in Avro, I would argue that it's actually better for the object IDs to be explicitly visible in the data model, rather than being automatically resolved and thus hidden.

        If there's some really awesome use case for circular references that I'm missing, I'm open to being convinced. But I don't currently see it.

        Show
        Martin Kleppmann added a comment - I don't mean to be grouchy, but I'm not convinced this is a feature that Avro really needs to have. It adds complexity to all the language implementations of Avro that need to support it, and it creates difficulties for anyone who needs to map between Avro's data model and another data model (e.g. representing Avro data in JSON format, or accessing Avro data in Pig through Pig's type system). For an application that needs to model an arbitrary graph of references between objects in Avro, I would argue that it's actually better for the object IDs to be explicitly visible in the data model, rather than being automatically resolved and thus hidden. If there's some really awesome use case for circular references that I'm missing, I'm open to being convinced. But I don't currently see it.
        Hide
        Doug Cutting added a comment -

        I imagine a schema like the following, to permit potentially circular linked lists:

        {"type":"record", "name":"List", "fields":[
          {"name":"nodeName", "type":"string"},
          {"name":"next", 
           "type": ["null", "List", 
                   {"type":"record","name":"org.apache.avro.CircularRef","fields":[{"name":"id","type":"int"}] }] }] }
        

        The writer would add an entry to an IdentityHashMap<Object,Integer> for every sub-record it writes. Whenever it encounters a previously-written record, it writes a ref instead. Similarly, the reader would add each records it reads to an array, and when a ref is read, return the corresponding element of the array.

        Such functionality should entirely be contained in CircularData (subclass of ReflectData) and CircularDatumReader and CircularDatumWriter, with no alterations to existing classes (except perhaps to expose functionality to subclasses if needed).

        Show
        Doug Cutting added a comment - I imagine a schema like the following, to permit potentially circular linked lists: { "type" : "record" , "name" : "List" , "fields" :[ { "name" : "nodeName" , "type" : "string" }, { "name" : "next" , "type" : [ " null " , "List" , { "type" : "record" , "name" : "org.apache.avro.CircularRef" , "fields" :[{ "name" : "id" , "type" : " int " }] }] }] } The writer would add an entry to an IdentityHashMap<Object,Integer> for every sub-record it writes. Whenever it encounters a previously-written record, it writes a ref instead. Similarly, the reader would add each records it reads to an array, and when a ref is read, return the corresponding element of the array. Such functionality should entirely be contained in CircularData (subclass of ReflectData) and CircularDatumReader and CircularDatumWriter, with no alterations to existing classes (except perhaps to expose functionality to subclasses if needed).

          People

          • Assignee:
            Unassigned
            Reporter:
            Moustapha Cherri
          • Votes:
            2 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - 672h
              672h
              Remaining:
              Remaining Estimate - 672h
              672h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development