Details

    • Type: New Feature New Feature
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.7.6
    • Fix Version/s: None
    • Component/s: spec
    • Labels:
      None
    • Release Note:
      -Cyclic reference support

      Description

      This is a proposed implementation to add cycle reference support to Avro. It basically introduce a new type named Cycle. Cycles contains a string representing the path to the other reference.

      For example if we have an object of type Message that have a member named previous with type Message too. If we have have this hierarchy:

      message
      previous : message2

      message2
      previous : message2

      When serializing the cycle path for "message2.previous" will be "previous".

      The implementation depend on ANTLR to evaluate those cycle at read time to resolve them. I used ANTLR 3.2. This dependency is not mandated; I just used ANTLR to speed thing up. I kept in this implementation the generated code from ANTLR though this should not be the case as this should be generated during the build. I only updated the Java code.

      I did not make full unit testing but you can find "avrotest.Main" class that can be used a preliminary test.

      Please do not hesitate to contact me for further clarification if this seems interresting.

      Best regards,
      Moustapha Cherri

      1. avro-1.4.1-cycle.patch.gz
        41 kB
        Moustapha Cherri
      2. avro-1.4.1-cycle.patch.gz
        43 kB
        Moustapha Cherri
      3. avro_circular_references.zip
        4 kB
        Sachin Goyal
      4. avro_circular_refs_2014_06_14.zip
        7 kB
        Sachin Goyal
      5. circular_refs_and_nonstring_map_keys_2014_06_25.zip
        11 kB
        Sachin Goyal
      6. avro_circular_refs6.patch
        40 kB
        Sachin Goyal
      7. PERF_8000_cycles.zip
        7 kB
        Sachin Goyal
      8. avro_circular_refs7.patch
        38 kB
        Sachin Goyal
      9. AVRO-695.patch
        37 kB
        Doug Cutting
      10. AVRO-695.patch
        35 kB
        Sachin Goyal

        Activity

        Hide
        Moustapha Cherri added a comment -

        implementation patch from version 1.4.1

        Show
        Moustapha Cherri added a comment - implementation patch from version 1.4.1
        Hide
        Xiaolu Ye added a comment -

        Moustapha,

        Our project need cycle reference support. Thanks for submitting the patch. When trying to use it, I noticed a bug in GenericData.Array. It needs to override AbstractList.set(int, T) as below section of GenericDatumReader.interpretCycles() uses set call. Currently, it throws UnsupportedOperationException at line 175

        if (record instanceof List)

        { ((List) record).set(Integer.parseInt(field), result); // line 175 }

        else

        { Array.set(record, Integer.parseInt(field), result); }

        Thanks,

        Xiaolu

        Show
        Xiaolu Ye added a comment - Moustapha, Our project need cycle reference support. Thanks for submitting the patch. When trying to use it, I noticed a bug in GenericData.Array. It needs to override AbstractList.set(int, T) as below section of GenericDatumReader.interpretCycles() uses set call. Currently, it throws UnsupportedOperationException at line 175 if (record instanceof List) { ((List) record).set(Integer.parseInt(field), result); // line 175 } else { Array.set(record, Integer.parseInt(field), result); } Thanks, Xiaolu
        Hide
        Moustapha Cherri added a comment -

        Dear Xiaolu,

        Please find the fix to the issue you mentioned. I fixed more issues related to GenericDatumReader too.

        Thanks for your reply and interest in this.

        Best regards,
        Moustapha Cherri

        Show
        Moustapha Cherri added a comment - Dear Xiaolu, Please find the fix to the issue you mentioned. I fixed more issues related to GenericDatumReader too. Thanks for your reply and interest in this. Best regards, Moustapha Cherri
        Hide
        Doug Cutting added a comment -

        The current patch is a specification change, since it adds a new schema type, "cycle". It is back-compatible, but not forward-compatible: implementations that do not implement cycles would be unable to read data that contains cycle schemas.

        Instead, a schema for a cycle reference might be defined. For example, one could define a org.apache.avro.cycles.CycleReference record containing a single integer field. CycleReference would only be used in unions with types that it may refer to. The DatumWriter would keep a IdentityHashMap<Object,Integer> of records, maps and arrays written, adding an entry the first time an instance is seen, and writing a CycleReference for subsequent occurrences in appropriate unions. The DatumReader would then keep an array of all records, maps and arrays that have been read and, when it reads a CycleReference in a union, return a pointer to the indicated element of that array.

        Adding cycles should not slow applications that do not require this feature. They could be implemented in newly defined CycleDatumReader/Writer, or perhaps GenericDatumReader and GenericDatumWriter could be modified to optionally handle such cycles.

        Show
        Doug Cutting added a comment - The current patch is a specification change, since it adds a new schema type, "cycle". It is back-compatible, but not forward-compatible: implementations that do not implement cycles would be unable to read data that contains cycle schemas. Instead, a schema for a cycle reference might be defined. For example, one could define a org.apache.avro.cycles.CycleReference record containing a single integer field. CycleReference would only be used in unions with types that it may refer to. The DatumWriter would keep a IdentityHashMap<Object,Integer> of records, maps and arrays written, adding an entry the first time an instance is seen, and writing a CycleReference for subsequent occurrences in appropriate unions. The DatumReader would then keep an array of all records, maps and arrays that have been read and, when it reads a CycleReference in a union, return a pointer to the indicated element of that array. Adding cycles should not slow applications that do not require this feature. They could be implemented in newly defined CycleDatumReader/Writer, or perhaps GenericDatumReader and GenericDatumWriter could be modified to optionally handle such cycles.
        Hide
        Moustapha Cherri added a comment -

        Dear Doug,

        Thanks for your valuable comments. I agree with you. I will implement them soon. I will build a wrapper around GenericDatumReader and GenericDatumWriter so that applications can inject this aspect as needed. Wrapping will allow application using for example ReflectDatumReader/ ReflectDatumWriter to use this feature without implementing a new special class for them.

        Best regards,
        Moustapha Cherri

        Show
        Moustapha Cherri added a comment - Dear Doug, Thanks for your valuable comments. I agree with you. I will implement them soon. I will build a wrapper around GenericDatumReader and GenericDatumWriter so that applications can inject this aspect as needed. Wrapping will allow application using for example ReflectDatumReader/ ReflectDatumWriter to use this feature without implementing a new special class for them. Best regards, Moustapha Cherri
        Hide
        Sachin Goyal added a comment -

        Support for circular references without modifying the grammar.
        Doug Cutting, this patch takes above recommendations into account.

        Example of a simple circular reference (very common in Java and Hibernate world):
        -------------------
        class Parent

        { String name; Child child; }

        ;

        class Child

        { Parent parent; String city; }

        ;
        -------------------
        Without this fix, Avro correctly generates schema for the above but fails with StackOverflow error when trying to serialize actual objects with circular reference.

        This fix requires following steps for using it:
        1) Allow clients to choose name of the extra-field which every entity can use to store its ID.
        ReflectData rdata = ReflectData.AllowNull.get();
        rdata.setCircularRefIdPrefix("__crefId");

        2) One must pass the same rdata when serializing objects.
        DatumWriter<T> datumWriter = new ReflectDatumWriter (T.class, rdata);

        3) MUST use the same field-name when reading circular-reference objects.
        GenericDatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord> ();
        GenericData gdata = datumReader.getData();
        gdata.setCircularRefIdPrefix("__crefId");
        If field-name is different, there may be no error but circular references will not be resolved.

        4) Choose if the circular references should be fully resolved or not.
        datumReader.setResolveCircularRefs (false);
        If this is set to true, circular references will point to actual objects forming a cycle.
        If this is set to false, circular references will point to dummy objects having only the circular-reference-ID. This option allows Avro to work everywhere else without running into infinite loops (like GenericRecord.toString()) while providing users enough information to
        form true circular references themselves if required.

        Test-case provided for circular references in List and Maps as well.
        Happy to make modifications if any.

        Show
        Sachin Goyal added a comment - Support for circular references without modifying the grammar. Doug Cutting , this patch takes above recommendations into account. Example of a simple circular reference (very common in Java and Hibernate world): ------------------- class Parent { String name; Child child; } ; class Child { Parent parent; String city; } ; ------------------- Without this fix, Avro correctly generates schema for the above but fails with StackOverflow error when trying to serialize actual objects with circular reference. This fix requires following steps for using it: 1) Allow clients to choose name of the extra-field which every entity can use to store its ID. ReflectData rdata = ReflectData.AllowNull.get(); rdata.setCircularRefIdPrefix("__crefId"); 2) One must pass the same rdata when serializing objects. DatumWriter<T> datumWriter = new ReflectDatumWriter (T.class, rdata); 3) MUST use the same field-name when reading circular-reference objects. GenericDatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord> (); GenericData gdata = datumReader.getData(); gdata.setCircularRefIdPrefix("__crefId"); If field-name is different, there may be no error but circular references will not be resolved. 4) Choose if the circular references should be fully resolved or not. datumReader.setResolveCircularRefs (false); If this is set to true, circular references will point to actual objects forming a cycle. If this is set to false, circular references will point to dummy objects having only the circular-reference-ID. This option allows Avro to work everywhere else without running into infinite loops (like GenericRecord.toString()) while providing users enough information to form true circular references themselves if required. Test-case provided for circular references in List and Maps as well. Happy to make modifications if any.
        Hide
        Sachin Goyal added a comment -

        Avro-Circular-References-1.7.6

        Show
        Sachin Goyal added a comment - Avro-Circular-References-1.7.6
        Hide
        Sachin Goyal added a comment -

        For designs with no circular reference, the above options need not be used and those clients would not be affected thus.

        Show
        Sachin Goyal added a comment - For designs with no circular reference, the above options need not be used and those clients would not be affected thus.
        Hide
        Doug Cutting added a comment -

        This does not appear to write data that other implementations can read. Am I missing something? It also appears to require that, to support circularity, the first field in a record must be a string.

        Rather, I think we must explicitly add the reference to the schema, e.g.:

        {"type":"record", "name":"CircularListOfStrings", "fields":[
          {"name":"value", "type":"string"},
          {"name":"next", "type":["CircularListOfStrings","int"]}
        ]}
        

        A circular instance would appear to existing Avro implementations as a non-circular list of strings terminating with an integer.

        Show
        Doug Cutting added a comment - This does not appear to write data that other implementations can read. Am I missing something? It also appears to require that, to support circularity, the first field in a record must be a string. Rather, I think we must explicitly add the reference to the schema, e.g.: { "type" : "record" , "name" : "CircularListOfStrings" , "fields" :[ { "name" : "value" , "type" : "string" }, { "name" : "next" , "type" :[ "CircularListOfStrings" , " int " ]} ]} A circular instance would appear to existing Avro implementations as a non-circular list of strings terminating with an integer.
        Hide
        Sachin Goyal added a comment -

        For other applications to read, they must set in the field-name used for circular reference:

        GenericDatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord> ();
        GenericData gdata = datumReader.getData();
        gdata.setCircularRefIdPrefix("__crefId");
        

        If they use the above for the reader, then the output would be:

        {
           "__crefId":"0",
           "name":"John Sr",
           "child":{
              "__crefId":"1",
              "name":"John Jr",
              "parent":{
                 "__crefId":"__crefId0",
                 "name":null,
                 "child":null
              }
           }
        }
        

        And if they set the flag:

        datumReader.setResolveCircularRefs (true);
        

        Then the reference to parent is actually replaced with the reference to the actual parent.
        The reason I have made the actual reference optional is because some tools might not be prepared to handle circular reference yet.
        For example, GenericRecord.toString() fails with StackOverflow if the reference is replaced with an actual one.

        Default value for datumReader.setResolveCircularRefs is false, so the output record's type is consistent with the schema and it has enough information for clients to help them deduce circular reference.

        As per this fix, the first field in a record must be a string to support circularity.

        Show
        Sachin Goyal added a comment - For other applications to read, they must set in the field-name used for circular reference: GenericDatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord> (); GenericData gdata = datumReader.getData(); gdata.setCircularRefIdPrefix( "__crefId" ); If they use the above for the reader, then the output would be: { "__crefId":"0", "name":"John Sr", "child":{ "__crefId":"1", "name":"John Jr", "parent":{ "__crefId":"__crefId0", "name": null , "child": null } } } And if they set the flag: datumReader.setResolveCircularRefs ( true ); Then the reference to parent is actually replaced with the reference to the actual parent. The reason I have made the actual reference optional is because some tools might not be prepared to handle circular reference yet. For example, GenericRecord.toString() fails with StackOverflow if the reference is replaced with an actual one. Default value for datumReader.setResolveCircularRefs is false, so the output record's type is consistent with the schema and it has enough information for clients to help them deduce circular reference. As per this fix, the first field in a record must be a string to support circularity.
        Hide
        Sachin Goyal added a comment -

        Doug Cutting, I could not get why other applications would not be able to read it.
        Could you provide an example?
        If they use the following, Avro reader can read the data and so other applications can also.

        GenericDatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord> ();
        GenericData gdata = datumReader.getData();
        gdata.setCircularRefIdPrefix("<id>");
        

        I am actually trying to follow the Jackson way of doing it (We have used this approach for Jackson in several places and it worked well).
        http://java.dzone.com/articles/circular-dependencies-jackson

        Show
        Sachin Goyal added a comment - Doug Cutting , I could not get why other applications would not be able to read it. Could you provide an example? If they use the following, Avro reader can read the data and so other applications can also. GenericDatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord> (); GenericData gdata = datumReader.getData(); gdata.setCircularRefIdPrefix( "<id>" ); I am actually trying to follow the Jackson way of doing it (We have used this approach for Jackson in several places and it worked well). http://java.dzone.com/articles/circular-dependencies-jackson
        Hide
        Doug Cutting added a comment -

        By other implementations I mean C, C++, Python, Perl, C#, Ruby, & PHP. Existing versions of these should be able to read a data file that contains circular references written by Java.

        Perhaps I'm missing something, but it looks like once the reference ID is read the rest of the schema is skipped. If that's correct, then, in order for an existing implementation to be able to read such data it must declare a union of the id type and the record type, to accurately describe what's written.

        Show
        Doug Cutting added a comment - By other implementations I mean C, C++, Python, Perl, C#, Ruby, & PHP. Existing versions of these should be able to read a data file that contains circular references written by Java. Perhaps I'm missing something, but it looks like once the reference ID is read the rest of the schema is skipped. If that's correct, then, in order for an existing implementation to be able to read such data it must declare a union of the id type and the record type, to accurately describe what's written.
        Hide
        Sachin Goyal added a comment -

        Doug Cutting, excellent suggestion.
        I have made the suggested changes in the new patch:

        1) If

        ReflectData.setCircularRefIdPrefix("some-field-name")

        is set, Avro converts each RECORD schema into a UNION schema such that it can either be a record or a string. This should make circular references readable by other implementations of Avro like c, c++, php, ruby etc.
        (AllowNull also works with this).

        2) Added some thread-safe constructs like ThreadLocal for hash-maps used in circular reference code.

        3) Enhanced the test-case (with detailed comments) to test regular deserialization and circular deserialization for a circular-ref serialized Avro.

        Show
        Sachin Goyal added a comment - Doug Cutting , excellent suggestion. I have made the suggested changes in the new patch: 1) If ReflectData.setCircularRefIdPrefix( "some-field-name" ) is set, Avro converts each RECORD schema into a UNION schema such that it can either be a record or a string. This should make circular references readable by other implementations of Avro like c, c++, php, ruby etc. (AllowNull also works with this). 2) Added some thread-safe constructs like ThreadLocal for hash-maps used in circular reference code. 3) Enhanced the test-case (with detailed comments) to test regular deserialization and circular deserialization for a circular-ref serialized Avro.
        Hide
        Sachin Goyal added a comment -

        Circular References

        ------------------------------------

        Serialization
        Extra API required (not optional):

        ReflectData.setCircularRefIdPrefix("some-field-name")

        If the above is set, following happens:

        1. During serialization, each record contains the extra field specified above. The value for this field is just a monotonically increasing number meant to uniquely identify each record in one particular serialization.
        2. While writing schema, each RECORD schema is converted into a UNION schema such that it can either be a record or a string. During object serialization, if a record is seen before, it is not written as a record. Rather it is written as a string in the format: "some-field-name" + "ID-generated-in #1 above". With this structure, reading applications have enough information to restore the circular reference if they want. This structure is also usable by languages not supporting circular reference because they will read that circular-reference as a normal string.
          (AllowNull also works with this).
        3. Above field-name is included in each record as a property. This allows the readers to become aware of this field-name so that the clients do not have to specify this just to populate the circular references. Basically it makes the schema self-sufficient.

        Deserialization
        Extra API required (optional):

        GenericDatumReader.setResolveCircularRefs(boolean)

        Based on #3 above, GenericDatumReader becomes circular-reference-aware.
        But since all GenericDatumReaders share a common GenericData instance, they are provided with another flag "resolveCircularRefs" to control whether they want to resolve circular references or not.
        If this flag is set and the serialized schema has non-null value for circular-reference-field, GenericDatumReader does the following:

        1. If any record has circular-ref-field, store its value and the corresponding record in a map.
        2. Look for unions which can be serialized as a record as well as a string. On finding such a record serialized as a string, replace the string with the record retreived from the map created in #1

        Non-string map-keys

        ----------------------------------------

        Serialization
        No extra API required.

        Without this patch, Avro throws an exception for non-string map-keys.
        This patch converts such maps into an array of records where each record has two fields: key and value. Example:
        Map<ObjX, ObjY> is converted to [{"key":

        {ObjX}

        , "value":{ObjY}}]
        To do this, following is done:

        1. In ReflectData.java, create schema for key as well as value in the non-string hash-map.
          Encapsulate these two schemas into a record schema and create an array schema of such records.
          Set property NS_MAP_FLAG to "1" and store the actual class of the map as a CLASS_PROP
        1. While writing out a non-string map field, if NS_MAP_FLAG is set, convert map to array of records using map.entrySet()

        Deserialization
        No extra API required.

        Deserialization for non-string map-keys is pretty simple since data and the schema match exactly.
        So it just deserializes automatically.
        To create an actual map (like when using ReflectDatumReader with actual-class type-parameter), map is instantiated using CLASS_PROP if the property NS_MAP_FLAG is set to "1"

        Testcases included

        ------------------------------------
        The unit tests cover the following:

        1. Circular references at multiple levels of hierarchy
        2. Circular references within Collections and Maps.
        3. Circular and non-circular deserialization of circularly serialized objects.
        4. Non-string map-keys having circular references.
        5. Non-string map-keys with nested maps.
        Show
        Sachin Goyal added a comment - Circular References ------------------------------------ Serialization Extra API required (not optional): ReflectData.setCircularRefIdPrefix( "some-field-name" ) If the above is set, following happens: During serialization, each record contains the extra field specified above. The value for this field is just a monotonically increasing number meant to uniquely identify each record in one particular serialization. While writing schema, each RECORD schema is converted into a UNION schema such that it can either be a record or a string. During object serialization, if a record is seen before, it is not written as a record. Rather it is written as a string in the format: "some-field-name" + "ID-generated-in #1 above". With this structure, reading applications have enough information to restore the circular reference if they want. This structure is also usable by languages not supporting circular reference because they will read that circular-reference as a normal string. (AllowNull also works with this). Above field-name is included in each record as a property. This allows the readers to become aware of this field-name so that the clients do not have to specify this just to populate the circular references. Basically it makes the schema self-sufficient. Deserialization Extra API required (optional): GenericDatumReader.setResolveCircularRefs( boolean ) Based on #3 above, GenericDatumReader becomes circular-reference-aware. But since all GenericDatumReaders share a common GenericData instance, they are provided with another flag "resolveCircularRefs" to control whether they want to resolve circular references or not. If this flag is set and the serialized schema has non-null value for circular-reference-field, GenericDatumReader does the following: If any record has circular-ref-field, store its value and the corresponding record in a map. Look for unions which can be serialized as a record as well as a string. On finding such a record serialized as a string, replace the string with the record retreived from the map created in #1 Non-string map-keys ---------------------------------------- Serialization No extra API required. Without this patch, Avro throws an exception for non-string map-keys. This patch converts such maps into an array of records where each record has two fields: key and value. Example: Map<ObjX, ObjY> is converted to [{"key": {ObjX} , "value":{ObjY}}] To do this, following is done: In ReflectData.java, create schema for key as well as value in the non-string hash-map. Encapsulate these two schemas into a record schema and create an array schema of such records. Set property NS_MAP_FLAG to "1" and store the actual class of the map as a CLASS_PROP While writing out a non-string map field, if NS_MAP_FLAG is set, convert map to array of records using map.entrySet() Deserialization No extra API required. Deserialization for non-string map-keys is pretty simple since data and the schema match exactly. So it just deserializes automatically. To create an actual map (like when using ReflectDatumReader with actual-class type-parameter), map is instantiated using CLASS_PROP if the property NS_MAP_FLAG is set to "1" Testcases included ------------------------------------ The unit tests cover the following: Circular references at multiple levels of hierarchy Circular references within Collections and Maps. Circular and non-circular deserialization of circularly serialized objects. Non-string map-keys having circular references. Non-string map-keys with nested maps.
        Hide
        Doug Cutting added a comment -

        For some reason I cannot apply the patch file you've generated, so it's hard for me to analyze it in detail. What tool are you using to generate this patch?

        I'd prefer we use an explicit type rather than overload string for this purpose. A union with a record like:

        {"type":"record","name":"org.apache.avro.CircularRef", "fields":[{"name":"ref", "type":int}]}
        

        We don't ever have to create or define such a record. Rather, a CustomEncoding can be used to directly resolve such references at read time. (If circular references are not enabled then a GenericRecord would be read.) No string prefixing, etc. would then be required.

        Rather than modifying ReflectData to support this, might we instead create a subclass of ReflectData that supports circular references?

        Non-string map key support in reflection should be addressed in a separate issue.

        Show
        Doug Cutting added a comment - For some reason I cannot apply the patch file you've generated, so it's hard for me to analyze it in detail. What tool are you using to generate this patch? I'd prefer we use an explicit type rather than overload string for this purpose. A union with a record like: { "type" : "record" , "name" : "org.apache.avro.CircularRef" , "fields" :[{ "name" : "ref" , "type" : int }]} We don't ever have to create or define such a record. Rather, a CustomEncoding can be used to directly resolve such references at read time. (If circular references are not enabled then a GenericRecord would be read.) No string prefixing, etc. would then be required. Rather than modifying ReflectData to support this, might we instead create a subclass of ReflectData that supports circular references? Non-string map key support in reflection should be addressed in a separate issue.
        Hide
        Sachin Goyal added a comment -

        Thanks Doug.

        I created the patch using diff -r and it can be patched using patch -i <patch-file>

        For non-string map-keys, I have submitted a separate patch at https://issues.apache.org/jira/browse/AVRO-680

        Could you please explain your comment regarding circular references in more detail?
        I will change my patch accordingly.

        For example, here is a circular reference schema generated using Avro:

        {
          "type" : "record",
          "name" : "SimpleParent",
          "namespace" : "org.apache.avro.generic",
          "fields" : [ {
            "name" : "parentName",
            "type" : [ "null", "string" ],
            "default" : null
          }, {
            "name" : "child",
            "type" : [ "null", {
              "type" : "record",
              "name" : "SimpleChild",
              "fields" : [ {
                "name" : "childName",
                "type" : [ "null", "string" ],
                "default" : null
              }, {
                "name" : "parent",
                "type" : [ "null", "SimpleParent"],
                "default" : null
              } ]
            }, "string" ],
            "default" : null
          } ]
        }
        

        The current code converts it to the following:

        {
          "type" : "record",
          "name" : "SimpleParent",
          "namespace" : "org.apache.avro.generic",
          "fields" : [ {
            "name" : "__crefId",
            "type" : "string"
          }, {
            "name" : "parentName",
            "type" : [ "null", "string" ],
            "default" : null
          }, {
            "name" : "child",
            "type" : [ "null", {
              "type" : "record",
              "name" : "SimpleChild",
              "fields" : [ {
                "name" : "__crefId",
                "type" : "string"
              }, {
                "name" : "childName",
                "type" : [ "null", "string" ],
                "default" : null
              }, {
                "name" : "parent",
                "type" : [ "null", "SimpleParent", "string" ],
                "default" : null
              } ],
              "circularRefIdPrefix" : "__crefId"
            }, "string" ],
            "default" : null
          } ],
          "circularRefIdPrefix" : "__crefId"
        }
        

        Can you please apply your comments above to this example?
        It will help me in understanding how it would be different from the above solution.

        As per my understanding, circular references can come in any record-type element.
        So CustomeEncoder approach would need to write it as a record sometimes or a string/int sometimes.
        Can a CustomEncoder pass the control to regular Avro Encoder to write a record normally?

        Show
        Sachin Goyal added a comment - Thanks Doug. I created the patch using diff -r and it can be patched using patch -i <patch-file> For non-string map-keys, I have submitted a separate patch at https://issues.apache.org/jira/browse/AVRO-680 Could you please explain your comment regarding circular references in more detail? I will change my patch accordingly. For example, here is a circular reference schema generated using Avro: { "type" : "record", "name" : "SimpleParent", "namespace" : "org.apache.avro.generic", "fields" : [ { "name" : "parentName", "type" : [ " null ", "string" ], " default " : null }, { "name" : "child", "type" : [ " null ", { "type" : "record", "name" : "SimpleChild", "fields" : [ { "name" : "childName", "type" : [ " null ", "string" ], " default " : null }, { "name" : "parent", "type" : [ " null ", "SimpleParent"], " default " : null } ] }, "string" ], " default " : null } ] } The current code converts it to the following: { "type" : "record", "name" : "SimpleParent", "namespace" : "org.apache.avro.generic", "fields" : [ { "name" : "__crefId", "type" : "string" }, { "name" : "parentName", "type" : [ " null ", "string" ], " default " : null }, { "name" : "child", "type" : [ " null ", { "type" : "record", "name" : "SimpleChild", "fields" : [ { "name" : "__crefId", "type" : "string" }, { "name" : "childName", "type" : [ " null ", "string" ], " default " : null }, { "name" : "parent", "type" : [ " null ", "SimpleParent", "string" ], " default " : null } ], "circularRefIdPrefix" : "__crefId" }, "string" ], " default " : null } ], "circularRefIdPrefix" : "__crefId" } Can you please apply your comments above to this example? It will help me in understanding how it would be different from the above solution. As per my understanding, circular references can come in any record-type element. So CustomeEncoder approach would need to write it as a record sometimes or a string/int sometimes. Can a CustomEncoder pass the control to regular Avro Encoder to write a record normally?
        Hide
        Martin Kleppmann added a comment -

        I don't mean to be grouchy, but I'm not convinced this is a feature that Avro really needs to have. It adds complexity to all the language implementations of Avro that need to support it, and it creates difficulties for anyone who needs to map between Avro's data model and another data model (e.g. representing Avro data in JSON format, or accessing Avro data in Pig through Pig's type system).

        For an application that needs to model an arbitrary graph of references between objects in Avro, I would argue that it's actually better for the object IDs to be explicitly visible in the data model, rather than being automatically resolved and thus hidden.

        If there's some really awesome use case for circular references that I'm missing, I'm open to being convinced. But I don't currently see it.

        Show
        Martin Kleppmann added a comment - I don't mean to be grouchy, but I'm not convinced this is a feature that Avro really needs to have. It adds complexity to all the language implementations of Avro that need to support it, and it creates difficulties for anyone who needs to map between Avro's data model and another data model (e.g. representing Avro data in JSON format, or accessing Avro data in Pig through Pig's type system). For an application that needs to model an arbitrary graph of references between objects in Avro, I would argue that it's actually better for the object IDs to be explicitly visible in the data model, rather than being automatically resolved and thus hidden. If there's some really awesome use case for circular references that I'm missing, I'm open to being convinced. But I don't currently see it.
        Hide
        Doug Cutting added a comment -

        I imagine a schema like the following, to permit potentially circular linked lists:

        {"type":"record", "name":"List", "fields":[
          {"name":"nodeName", "type":"string"},
          {"name":"next", 
           "type": ["null", "List", 
                   {"type":"record","name":"org.apache.avro.CircularRef","fields":[{"name":"id","type":"int"}] }] }] }
        

        The writer would add an entry to an IdentityHashMap<Object,Integer> for every sub-record it writes. Whenever it encounters a previously-written record, it writes a ref instead. Similarly, the reader would add each records it reads to an array, and when a ref is read, return the corresponding element of the array.

        Such functionality should entirely be contained in CircularData (subclass of ReflectData) and CircularDatumReader and CircularDatumWriter, with no alterations to existing classes (except perhaps to expose functionality to subclasses if needed).

        Show
        Doug Cutting added a comment - I imagine a schema like the following, to permit potentially circular linked lists: { "type" : "record" , "name" : "List" , "fields" :[ { "name" : "nodeName" , "type" : "string" }, { "name" : "next" , "type" : [ " null " , "List" , { "type" : "record" , "name" : "org.apache.avro.CircularRef" , "fields" :[{ "name" : "id" , "type" : " int " }] }] }] } The writer would add an entry to an IdentityHashMap<Object,Integer> for every sub-record it writes. Whenever it encounters a previously-written record, it writes a ref instead. Similarly, the reader would add each records it reads to an array, and when a ref is read, return the corresponding element of the array. Such functionality should entirely be contained in CircularData (subclass of ReflectData) and CircularDatumReader and CircularDatumWriter, with no alterations to existing classes (except perhaps to expose functionality to subclasses if needed).
        Hide
        Sachin Goyal added a comment -

        The writer would add an entry to an IdentityHashMap<Object,Integer> for every sub-record it writes. Whenever it encounters a previously-written record, it writes a ref instead. Similarly, the reader would add each records it reads to an array, and when a ref is read, return the corresponding element of the array.

        The current fix does use an IdentityHashMap to do this. Reference code in patch:

        1. GenericDatumWriter.java, line 40 and
        2. GenericDatumReader.java, line 46

        Please correct me if I am wrong, but it appears the schema generated for a circular list should look somewhat like this:

        {
          "type" : "record",
          "name" : "CircularList",
          "namespace" : "org.apache.avro.generic",
          "fields" : [ {
            "name" : "__crefId",
            "type" : "string"
          }, {
            "name" : "nodeData",
            "type" : [ "null", "string" ],
            "default" : null
          }, {
            "name" : "next",
            "type" : [ "null", "CircularList", "string" ],
            "default" : null
          } ],
          "circularRefIdPrefix" : "__crefId"
        }
        

        (This is generated using current patch)



        Circular references could be just anywhere in the code.
        For example, in a family-tree involving grandparents, uncles, aunts, cousins, children, grandchildren etc. circular references could be encountered for many branches outgoing from a single node.

        Since we do not know which outgoing link would reveal itself as an already-traversed-node, the __crefId field needs to be written in advance for each and every record. Hence the need for a separate field in each record.

        "fields" : [ {
            "name" : "__crefId",
            "type" : "string"
          }, ....
        

        Now, when we do encounter an already-traversed-node, the node must be written as an ID. Hence every record's type must be a union with string:

            "type" : [ "null", "CircularList", "string" ]
        

        I would be happy to consider other options if the above seems incorrect.
        If it seems correct, I will submit a patch without non-string map-keys.




        Martin Kleppmann, Currently Avro supports circular references in schema.
        So supporting circular references in data should be a natural extension of the same.

        Also, circular references are very common in ORM (like Hibernate/JPA) and Java based programs in general.
        http://stackoverflow.com/questions/11007247/are-circular-references-in-jpa-an-antipattern

        And parsers like Gson and Jackson support this feature too.

        The serialized data from the above patch should work with all language implementations and also with Hive/Pig (because we are breaking the circular reference by changing it to an ID).
        Please share if you think otherwise.

        Show
        Sachin Goyal added a comment - The writer would add an entry to an IdentityHashMap<Object,Integer> for every sub-record it writes. Whenever it encounters a previously-written record, it writes a ref instead. Similarly, the reader would add each records it reads to an array, and when a ref is read, return the corresponding element of the array. The current fix does use an IdentityHashMap to do this. Reference code in patch: GenericDatumWriter.java, line 40 and GenericDatumReader.java, line 46 Please correct me if I am wrong, but it appears the schema generated for a circular list should look somewhat like this: { "type" : "record", "name" : "CircularList", "namespace" : "org.apache.avro.generic", "fields" : [ { "name" : "__crefId", "type" : "string" }, { "name" : "nodeData", "type" : [ " null ", "string" ], " default " : null }, { "name" : "next", "type" : [ " null ", "CircularList", "string" ], " default " : null } ], "circularRefIdPrefix" : "__crefId" } (This is generated using current patch) Circular references could be just anywhere in the code. For example, in a family-tree involving grandparents, uncles, aunts, cousins, children, grandchildren etc. circular references could be encountered for many branches outgoing from a single node. Since we do not know which outgoing link would reveal itself as an already-traversed-node, the __crefId field needs to be written in advance for each and every record. Hence the need for a separate field in each record. "fields" : [ { "name" : "__crefId", "type" : "string" }, .... Now, when we do encounter an already-traversed-node, the node must be written as an ID. Hence every record's type must be a union with string: "type" : [ " null ", "CircularList", "string" ] I would be happy to consider other options if the above seems incorrect. If it seems correct, I will submit a patch without non-string map-keys . Martin Kleppmann , Currently Avro supports circular references in schema. So supporting circular references in data should be a natural extension of the same. Also, circular references are very common in ORM (like Hibernate/JPA) and Java based programs in general. http://stackoverflow.com/questions/11007247/are-circular-references-in-jpa-an-antipattern And parsers like Gson and Jackson support this feature too. The serialized data from the above patch should work with all language implementations and also with Hive/Pig (because we are breaking the circular reference by changing it to an ID). Please share if you think otherwise.
        Hide
        Doug Cutting added a comment -

        I realize this is what you've implemented, however I think the implementation can be improved.

        I would prefer that the circular reference not be a string, but rather be a unique type, e.g., a record named CircularRef that contains an integer id field.

        I agree that all object references in a schema need to be changed to unions containing possible circular references. However I don't see that a target id field needs to be added to all records. Rather the target id can be implicit, based on the order that sub-objects are written, as follows.

        In each call to DatumWriter#write, an IdentityHashMap<Object,Integer> table is created. In #resolveUnion, when a union containing a record and a CircularRef is passed a record, it looks for the record in the table. If the record already exists in the table, then the CircularRef branch is returned, writing a reference containing the integer value from the table. If the record does not exist in the table, then an entry for it is added to the table whose value is the current size of the table, and the record branch of the union is returned and written.

        Then, each call to DatumReader#read creates a Vector<Object>. If #readRecord reads a CircularRef, then it returns the item indicated by its id in the vector, otherwise it adds a reference to each record read to the vector.

        I believe this can work, is more strongly-typed, and avoids adding new fields to schemas, only inserting unions for record references.

        Show
        Doug Cutting added a comment - I realize this is what you've implemented, however I think the implementation can be improved. I would prefer that the circular reference not be a string, but rather be a unique type, e.g., a record named CircularRef that contains an integer id field. I agree that all object references in a schema need to be changed to unions containing possible circular references. However I don't see that a target id field needs to be added to all records. Rather the target id can be implicit, based on the order that sub-objects are written, as follows. In each call to DatumWriter#write, an IdentityHashMap<Object,Integer> table is created. In #resolveUnion, when a union containing a record and a CircularRef is passed a record, it looks for the record in the table. If the record already exists in the table, then the CircularRef branch is returned, writing a reference containing the integer value from the table. If the record does not exist in the table, then an entry for it is added to the table whose value is the current size of the table, and the record branch of the union is returned and written. Then, each call to DatumReader#read creates a Vector<Object>. If #readRecord reads a CircularRef, then it returns the item indicated by its id in the vector, otherwise it adds a reference to each record read to the vector. I believe this can work, is more strongly-typed, and avoids adding new fields to schemas, only inserting unions for record references.
        Hide
        Sachin Goyal added a comment -

        Thanks Doug Cutting.
        I agree that this will eliminate the need for another field in each record.
        Will make this change and submit another patch shortly.

        Show
        Sachin Goyal added a comment - Thanks Doug Cutting . I agree that this will eliminate the need for another field in each record. Will make this change and submit another patch shortly.
        Hide
        Sachin Goyal added a comment -

        avro_circular_refs6.patch is created using

        lang/java/avro% svn diff src/
        

        It eliminates the need to add an extra field per record and relies on the order of record-writing and reading.

        In this patch, there is an equals check at 3 places of the form:

        schema.getFullName().equals(CircularRef.class.getName())
        

        I think this could be improved by adding a boolean-property in the schema for CircularRef or somehow comparing the class.
        Comments are welcome.

        Show
        Sachin Goyal added a comment - avro_circular_refs6.patch is created using lang/java/avro% svn diff src/ It eliminates the need to add an extra field per record and relies on the order of record-writing and reading. In this patch, there is an equals check at 3 places of the form: schema.getFullName().equals(CircularRef.class.getName()) I think this could be improved by adding a boolean-property in the schema for CircularRef or somehow comparing the class. Comments are welcome.
        Hide
        Doug Cutting added a comment -

        Some comments on the latest patch:

        • the CircularRef class is not used in generic. in fact, it probably doesn't need to exist at all, we could always use a GenericData.Record. instead we perhaps need constants with the CircularRef schema, it's name, etc.
        • it might be more efficient to use IndexedRecord#get(int) to access the circular ref id.
        • we should benchmark performance with and without this change with circular references disabled using Perf.java, to make sure that changes to read & write methods, clearing state etc., don't significantly affect performance.
        Show
        Doug Cutting added a comment - Some comments on the latest patch: the CircularRef class is not used in generic. in fact, it probably doesn't need to exist at all, we could always use a GenericData.Record. instead we perhaps need constants with the CircularRef schema, it's name, etc. it might be more efficient to use IndexedRecord#get(int) to access the circular ref id. we should benchmark performance with and without this change with circular references disabled using Perf.java, to make sure that changes to read & write methods, clearing state etc., don't significantly affect performance.
        Hide
        Sachin Goyal added a comment -

        avro_circular_refs7.patch fixes the above.

        Performance has been checked by running Perf.java for 8k cycles (PERF_8000_cycles.zip contains a small Tcl utility to compare two outputs of Perf.java)

        % tclsh perf_processor.tcl all/orig.txt all/circular.txt
        Finding tests where time is 1.05 times more than the original
                                       orig.txt    circular.txt
        GenericWrite:                  34552 ms     37010 ms
        ReflectNestedObjectArrayWrite: 51701 ms     54562 ms
        LongRead:                      12508 ms     16907 ms
        
        % tclsh perf_processor.tcl reduced/orig.txt reduced/circular.txt
        Finding tests where time is 1.05 times more than the original
                    orig.txt  circular.txt
        GenericWrite: 33359 ms   35552 ms
        
        % tclsh perf_processor.tcl reduced_further/orig.txt reduced_further/circular.txt
        Finding tests where time is 1.05 times more than the original
        

        So the Perf.java shows irregular performance degradation and that too varies only in small numbers.

        Show
        Sachin Goyal added a comment - avro_circular_refs7.patch fixes the above. Performance has been checked by running Perf.java for 8k cycles (PERF_8000_cycles.zip contains a small Tcl utility to compare two outputs of Perf.java) % tclsh perf_processor.tcl all/orig.txt all/circular.txt Finding tests where time is 1.05 times more than the original orig.txt circular.txt GenericWrite: 34552 ms 37010 ms ReflectNestedObjectArrayWrite: 51701 ms 54562 ms LongRead: 12508 ms 16907 ms % tclsh perf_processor.tcl reduced/orig.txt reduced/circular.txt Finding tests where time is 1.05 times more than the original orig.txt circular.txt GenericWrite: 33359 ms 35552 ms % tclsh perf_processor.tcl reduced_further/orig.txt reduced_further/circular.txt Finding tests where time is 1.05 times more than the original So the Perf.java shows irregular performance degradation and that too varies only in small numbers.
        Hide
        Doug Cutting added a comment -

        Here's a modified version of the patch. It moves all significant changes to reflect. Since reflect is the only implementation that can generate an appropriate schema, changes should be confined to there.

        The tests need to be updated, as they still assume generic.

        Show
        Doug Cutting added a comment - Here's a modified version of the patch. It moves all significant changes to reflect. Since reflect is the only implementation that can generate an appropriate schema, changes should be confined to there. The tests need to be updated, as they still assume generic.
        Hide
        Sachin Goyal added a comment -

        Here is a patch with the updated test-cases.
        I also confirm that all my changes are there in the patch.

        Show
        Sachin Goyal added a comment - Here is a patch with the updated test-cases. I also confirm that all my changes are there in the patch.
        Hide
        Sachin Goyal added a comment -

        On a related note, Hive AvroSerDe will also be supporting circular references from now on.
        Associated JIRA: HIVE-7653
        Pull request: Hive code changes

        Show
        Sachin Goyal added a comment - On a related note, Hive AvroSerDe will also be supporting circular references from now on. Associated JIRA: HIVE-7653 Pull request: Hive code changes
        Hide
        ASF GitHub Bot added a comment -

        GitHub user sachingsachin opened a pull request:

        https://github.com/apache/avro/pull/23

        AVRO-695: Support for circular references.

        All objects are put into a temporary thread-local hash-map whose key is the object and value is an integer ID.
        If any object is seen again while serializaing, its ID is taken from the hash-map, wrapped into 'CircularRef' class and the CircularRef wrapper is serialized instead.

        On deserializing, if the CircularRef is encountered, we know that it has to be the ID of a previously seen object.
        And so we restore the same.

        On the schema side, we create unions of all classes with CircularRef if the user suspects circular references in his code. This union makes sure the above writers are able to write a CircularRef instead of the actual object.

        Note that this strategy is perfectly safe in other languages' deserialization of a circularly referenced data.

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/sachingsachin/avro AVRO-695

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/avro/pull/23.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #23


        commit e3f9295474a3f45c022850cb4ac2ba84a8ac31f4
        Author: Sachin Goyal <sgoyal@walmart.com>
        Date: 2015-03-09T22:18:01Z

        AVRO-695: Support for circular references.

        All objects are put into a temporary thread-local hash-map whose key is the object and value is an integer ID.
        If any object is seen again while serializaing, its ID is taken from the hash-map,
        wrapped into 'CircularRef' class and the CircularRef wrapper is serialized instead.
        On deserializing, if the CircularRef is encountered, we know that it has to be the ID of a previously seen object.
        And so we restore the same.

        On the schema side, we create unions of all classes with CircularRef if the user suspects circular references in his code.
        This union makes sure the above writers are able to write a CircularRef instead of the actual object.

        Note that this strategy is perfectly safe in other languages' deserialziation of a circularly referenced data.


        Show
        ASF GitHub Bot added a comment - GitHub user sachingsachin opened a pull request: https://github.com/apache/avro/pull/23 AVRO-695 : Support for circular references. All objects are put into a temporary thread-local hash-map whose key is the object and value is an integer ID. If any object is seen again while serializaing, its ID is taken from the hash-map, wrapped into 'CircularRef' class and the CircularRef wrapper is serialized instead. On deserializing, if the CircularRef is encountered, we know that it has to be the ID of a previously seen object. And so we restore the same. On the schema side, we create unions of all classes with CircularRef if the user suspects circular references in his code. This union makes sure the above writers are able to write a CircularRef instead of the actual object. Note that this strategy is perfectly safe in other languages' deserialization of a circularly referenced data. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sachingsachin/avro AVRO-695 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/avro/pull/23.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23 commit e3f9295474a3f45c022850cb4ac2ba84a8ac31f4 Author: Sachin Goyal <sgoyal@walmart.com> Date: 2015-03-09T22:18:01Z AVRO-695 : Support for circular references. All objects are put into a temporary thread-local hash-map whose key is the object and value is an integer ID. If any object is seen again while serializaing, its ID is taken from the hash-map, wrapped into 'CircularRef' class and the CircularRef wrapper is serialized instead. On deserializing, if the CircularRef is encountered, we know that it has to be the ID of a previously seen object. And so we restore the same. On the schema side, we create unions of all classes with CircularRef if the user suspects circular references in his code. This union makes sure the above writers are able to write a CircularRef instead of the actual object. Note that this strategy is perfectly safe in other languages' deserialziation of a circularly referenced data.
        Hide
        Sachin Goyal added a comment -

        Ryan Blue, Doug Cutting, Circular references are very common in Hibernate related code and in Java.
        So I would really appreciate your help in getting this one merged.

        Show
        Sachin Goyal added a comment - Ryan Blue , Doug Cutting , Circular references are very common in Hibernate related code and in Java. So I would really appreciate your help in getting this one merged.
        Hide
        Sachin Goyal added a comment -

        Updated the pull request with the merged AVRO-680 changes in trunk.
        Also, added a test for circular references in the keys and values of a non-string map.


        Example:

        Map <Foo, Bar> abc;
        

        The above code can now be a non-string map with circular references for Foo as well as Bar.
        This can help support very complex Java objects with ease.

        Show
        Sachin Goyal added a comment - Updated the pull request with the merged AVRO-680 changes in trunk. Also, added a test for circular references in the keys and values of a non-string map. Example: Map <Foo, Bar> abc; The above code can now be a non-string map with circular references for Foo as well as Bar. This can help support very complex Java objects with ease.

          People

          • Assignee:
            Unassigned
            Reporter:
            Moustapha Cherri
          • Votes:
            4 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - 672h
              672h
              Remaining:
              Remaining Estimate - 672h
              672h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development