Details

    • Type: Bug Bug
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.6.1, 1.6.2
    • Fix Version/s: None
    • Component/s: python
    • Labels:

      Description

      Python's union does not respect the order in which type is specified.

      For following schema:

      {"type":"map","values":["int","long","float","double","string","boolean"]}

      , an integer value is written as double, but it should respect the order in which types have been specified.

      Fixed Code (io.py):

      def write_union(self, writers_schema, datum, encoder):
      """
      A union is encoded by first writing a long value indicating
      the zero-based position within the union of the schema of its value.
      The value is then encoded per the indicated schema within the union.
      """

      1. resolve union
        index_of_schema = -1
        for i, candidate_schema in enumerate(writers_schema.schemas):
        if validate(candidate_schema, datum):
        index_of_schema = i
        break // XXX Add break statement here XXX//
        if index_of_schema < 0: raise AvroTypeException(writers_schema, datum)
      1. AVRO-973-patch-1.patch
        7 kB
        Marcio Silva
      2. AVRO-973-patch-2.patch
        6 kB
        Marcio Silva
      3. AVRO-973-patch-3.patch
        6 kB
        Marcio Silva
      4. AVRO-973-wrapper.patch
        5 kB
        Douglas Kaminsky
      5. AVRO-973-wrapper.patch
        3 kB
        Douglas Kaminsky
      6. test_unions.py
        2 kB
        Marcio Silva

        Issue Links

          Activity

          Hide
          Marcio Silva added a comment -

          Attached is a unit test that illustrates this issue.

          Show
          Marcio Silva added a comment - Attached is a unit test that illustrates this issue.
          Hide
          Marcio Silva added a comment -

          Attached is a patch that addresses this issue. At root, the
          current python implementation doesn't always enforce strict types for numeric data.

          The union resolution code chooses the first type that's valid for the given input, and the current validate logic allows int values in long fields etc.

          Show
          Marcio Silva added a comment - Attached is a patch that addresses this issue. At root, the current python implementation doesn't always enforce strict types for numeric data. The union resolution code chooses the first type that's valid for the given input, and the current validate logic allows int values in long fields etc.
          Hide
          Marcio Silva added a comment -

          patch that adds unit test coverage and attempts to resolve the issue.

          Show
          Marcio Silva added a comment - patch that adds unit test coverage and attempts to resolve the issue.
          Hide
          Marcio Silva added a comment -

          Updated patch. Rewrote the validate functionality to keep the original functions
          semantics, but still allow for hinting at union's typing for int,long,float
          by checking exact type before allowing loose matches.

          Show
          Marcio Silva added a comment - Updated patch. Rewrote the validate functionality to keep the original functions semantics, but still allow for hinting at union's typing for int,long,float by checking exact type before allowing loose matches.
          Hide
          Doug Cutting added a comment -

          I'll commit this tomorrow unless there are objections.

          Show
          Doug Cutting added a comment - I'll commit this tomorrow unless there are objections.
          Hide
          Marcio Silva added a comment -

          Upon further reflection, there's one issue that the current patch doesn't address, and that's union's containing ["float","double"].

          As these are both represented as python float's there's currently no way to hint to the validate logic that a given python float is meant to represent a double. A possible solution would be to create a double class whose sole purpose would be to serve as that hint.

          I'll generate a third patch that takes this approach, but I'm open to any other implementation suggestions (I'm not crazy about introducing an avro.io double class).

          Show
          Marcio Silva added a comment - Upon further reflection, there's one issue that the current patch doesn't address, and that's union's containing ["float","double"] . As these are both represented as python float's there's currently no way to hint to the validate logic that a given python float is meant to represent a double. A possible solution would be to create a double class whose sole purpose would be to serve as that hint. I'll generate a third patch that takes this approach, but I'm open to any other implementation suggestions (I'm not crazy about introducing an avro.io double class).
          Hide
          Marcio Silva added a comment -

          Modification of patch-2 to introduce a double class to resolve the final ambiguity for ["float","double"] unions.

          Show
          Marcio Silva added a comment - Modification of patch-2 to introduce a double class to resolve the final ambiguity for ["float","double"] unions.
          Hide
          Douglas Kaminsky added a comment -

          While a good start for primitives specifically, tese union hint semantics are insufficient to solve the larger underlying problem. Union hint semantics need to be available for all types.

          I have the same issue where we have several message types that extend from one another on the Java side, but in Python it's impossible to serialize correctly. This is generalized to any two types appearing in a single union whose "data domain" intersects.

          e.g. Java Code contains two classes: com.foo.Foo and com.foo.Bar
          Avro schema specifies record type "Message" with field "event" : ["null", "com.foo.Foo", "com.foo.Bar"]

          When serializing "event" field of type "Message":
          Does this validate against a NullSchema? False - index=-1
          Does this validate against a "com.foo.Foo"? True - index=1, BREAK

          It then serializes as a Foo and all Bar-unique fields are lost

          There's no simple solution:

          • If you keep the break, this problem occurs
          • If you reverse the order of union traversal, you couple the behaviors in an inappropriate way
          • If you remove the break, you introduce an extremely inefficient (up to 255 validations) process to serialization (BTW, this process is already pretty inefficient)

          The best I came up with was to add union hints in the form of a wrapper class and extension to the datum writer (attachment to follow). This mimics the Java behavior of coupling the datum and its schema.

          Show
          Douglas Kaminsky added a comment - While a good start for primitives specifically, tese union hint semantics are insufficient to solve the larger underlying problem. Union hint semantics need to be available for all types. I have the same issue where we have several message types that extend from one another on the Java side, but in Python it's impossible to serialize correctly. This is generalized to any two types appearing in a single union whose "data domain" intersects. e.g. Java Code contains two classes: com.foo.Foo and com.foo.Bar Avro schema specifies record type "Message" with field "event" : ["null", "com.foo.Foo", "com.foo.Bar"] When serializing "event" field of type "Message": Does this validate against a NullSchema? False - index=-1 Does this validate against a "com.foo.Foo"? True - index=1, BREAK It then serializes as a Foo and all Bar-unique fields are lost There's no simple solution: If you keep the break, this problem occurs If you reverse the order of union traversal, you couple the behaviors in an inappropriate way If you remove the break, you introduce an extremely inefficient (up to 255 validations) process to serialization (BTW, this process is already pretty inefficient) The best I came up with was to add union hints in the form of a wrapper class and extension to the datum writer (attachment to follow). This mimics the Java behavior of coupling the datum and its schema.
          Hide
          Douglas Kaminsky added a comment -

          This defines a lightweight AvroRecord wrapper class and an extension to DatumWriter that recognizes records wrapped by this class and properly.

          Note: There's a whole bunch of "utility" methods below - you need not use them to use the general idea presented

          Show
          Douglas Kaminsky added a comment - This defines a lightweight AvroRecord wrapper class and an extension to DatumWriter that recognizes records wrapped by this class and properly. Note: There's a whole bunch of "utility" methods below - you need not use them to use the general idea presented
          Hide
          Douglas Kaminsky added a comment -

          Please ignore/delete first version of attachment

          Show
          Douglas Kaminsky added a comment - Please ignore/delete first version of attachment
          Hide
          Douglas Kaminsky added a comment -

          Alternatively, you could solve for record types by creating a reserved name such as "__hint" or some such thing, with the only caveat being you would no longer be able to create user fields with that name in python.

          Show
          Douglas Kaminsky added a comment - Alternatively, you could solve for record types by creating a reserved name such as "__hint" or some such thing, with the only caveat being you would no longer be able to create user fields with that name in python.
          Hide
          Douglas Kaminsky added a comment -

          And to expound on my previous comment:

          There is no change you can make to the current validation-based mechanic that guarantees correctness for record types - for example, consider that you could have complex numeric types that are similar in structure but distinct in meaning:

          Amount

          { "mantissa" : "string", "exp" : "string" }

          Money { "mantissa" : "string", "exp" : "string", "currency" :

          {"type" : "string", "default" : "USD"}

          }

          This is a trivial example, but believe me when I tell you that we have 209 types in our schema and several build on each other.

          I contend that to be correct, the implementation should work correctly regardless of union order, ie. Serializing against ["null", "Amount", "Money"] should yield the same result as ["null", "Money", "Amount"]

          Now suppose I serialize datum:

          { "mantissa" : "314159", "exp" : "-5" }
          • If you validate without the break, this will serialize as "Money" against ["null", "Amount", "Money"] but "Amount" against ["null", "Money", "Amount"]
          • With the break, this will serialize as "Amount" against ["null", "Amount", "Money"] but "Money" against ["null", "Money", "Amount"]

          Either way, the intention of the message sender is lost.

          Show
          Douglas Kaminsky added a comment - And to expound on my previous comment: There is no change you can make to the current validation-based mechanic that guarantees correctness for record types - for example, consider that you could have complex numeric types that are similar in structure but distinct in meaning: Amount { "mantissa" : "string", "exp" : "string" } Money { "mantissa" : "string", "exp" : "string", "currency" : {"type" : "string", "default" : "USD"} } This is a trivial example, but believe me when I tell you that we have 209 types in our schema and several build on each other. I contend that to be correct, the implementation should work correctly regardless of union order, ie. Serializing against ["null", "Amount", "Money"] should yield the same result as ["null", "Money", "Amount"] Now suppose I serialize datum: { "mantissa" : "314159", "exp" : "-5" } If you validate without the break, this will serialize as "Money" against ["null", "Amount", "Money"] but "Amount" against ["null", "Money", "Amount"] With the break, this will serialize as "Amount" against ["null", "Amount", "Money"] but "Money" against ["null", "Money", "Amount"] Either way, the intention of the message sender is lost.
          Hide
          Doug Cutting added a comment -

          > There is no change you can make to the current validation-based mechanic that guarantees correctness for record types

          That's right.

          Full recursive field validation is not required for unions. See AVRO-654 and http://avro.apache.org/docs/current/spec.html#Unions

          The object's schema name should be checked against the names of the schemas in the union. If the fields don't match but the names are the same then a runtime error should be generated.

          This is a longstanding misfeature of the Python implementation.

          Show
          Doug Cutting added a comment - > There is no change you can make to the current validation-based mechanic that guarantees correctness for record types That's right. Full recursive field validation is not required for unions. See AVRO-654 and http://avro.apache.org/docs/current/spec.html#Unions The object's schema name should be checked against the names of the schemas in the union. If the fields don't match but the names are the same then a runtime error should be generated. This is a longstanding misfeature of the Python implementation.
          Hide
          Douglas Kaminsky added a comment -

          @Mr. Cutting

          AVRO-656 is more directly related, but yeah I see your point and it's clear this has been around a long time.

          The question is how can we fix it and can we get it fixed in the official distro?

          I can just use my wrapper business locally but I'd rather have a more long-term supported solution.

          Show
          Douglas Kaminsky added a comment - @Mr. Cutting AVRO-656 is more directly related, but yeah I see your point and it's clear this has been around a long time. The question is how can we fix it and can we get it fixed in the official distro? I can just use my wrapper business locally but I'd rather have a more long-term supported solution.
          Hide
          Doug Cutting added a comment -

          Ultimately I think the fix is AVRO-283. Python would use objects to represent records, not dictionaries. This was recently discussed on the dev list where Marcio expressed interest in working on it (http://s.apache.org/deA). Perhaps you can collaborate there?

          Show
          Doug Cutting added a comment - Ultimately I think the fix is AVRO-283 . Python would use objects to represent records, not dictionaries. This was recently discussed on the dev list where Marcio expressed interest in working on it ( http://s.apache.org/deA ). Perhaps you can collaborate there?
          Hide
          Marcio Silva added a comment -

          I agree, a full solution that resolves this issue for records would require record instances that provided access to their schema. I've started stubbing out some code for AVRO-283 that uses meta-classes to create specific records, but maybe we need the equivalent of the Java GenericContainer interface on the python side.

          In order to preserve some backwards compatibility with the existing code-base, we should keep a dict-like record option in addition to the specific record implementation. Both should probably provide access to the schema information in a common way.

          Show
          Marcio Silva added a comment - I agree, a full solution that resolves this issue for records would require record instances that provided access to their schema. I've started stubbing out some code for AVRO-283 that uses meta-classes to create specific records, but maybe we need the equivalent of the Java GenericContainer interface on the python side. In order to preserve some backwards compatibility with the existing code-base, we should keep a dict-like record option in addition to the specific record implementation. Both should probably provide access to the schema information in a common way.
          Hide
          Douglas Kaminsky added a comment -

          Take a look at the patch I attached to AVRO-283 (minus the broken __setattr__ implementation)

          Show
          Douglas Kaminsky added a comment - Take a look at the patch I attached to AVRO-283 (minus the broken __ setattr __ implementation)

            People

            • Assignee:
              Unassigned
              Reporter:
              Gaurav Nanda
            • Votes:
              2 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 0.25h
                0.25h
                Remaining:
                Remaining Estimate - 0.25h
                0.25h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development