Thrift
  1. Thrift
  2. THRIFT-735

Required field checking is broken when the field type is a Union struct

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.2
    • Fix Version/s: None
    • Labels:
      None

      Description

      The validate() method on generated structs verifies that required fields are set after validation. However, if the type of the field is a Union struct, then just checking that the field isn't null is not a valid check. The value may be a non-null union, but have an unset field. (We encountered this when deserializing a type that had a union for a field, and the union's set value was an enum value that had been removed from the definition, making it a skip.)

      In order to perform the correct validation, if the value is a Union, then we must also check that the set field and value are non-null.

      1. thrift-735.patch
        1 kB
        Bryan Duxbury
      2. thrift-735-v2.patch
        1 kB
        Nathan Marz
      3. unknown_enum_tests.zip
        3 kB
        Jeff DeCew

        Activity

        Hide
        Bryan Duxbury added a comment -

        This appears to fix the problem.

        Show
        Bryan Duxbury added a comment - This appears to fix the problem.
        Hide
        David Reiss added a comment -

        Wait, so does this mean that if I write something like

        union Bar {
          whatever
        }
        
        struct Foo {
          required Bar bar;
        }
        

        and bar is present but in a "null union" state because it contains a field we don't know about, then Foo will fail validation?

        Show
        David Reiss added a comment - Wait, so does this mean that if I write something like union Bar { whatever } struct Foo { required Bar bar; } and bar is present but in a "null union" state because it contains a field we don't know about, then Foo will fail validation?
        Hide
        Bryan Duxbury added a comment -

        With this patch committed, yes, that is the behavior. Note that this essentially the behavior we have today for non-union types - if the field is required and not present after deserialization for any reason, then it will fail validation. It just happens that in the union case, the field can appear set when in fact there is no value present.

        Show
        Bryan Duxbury added a comment - With this patch committed, yes, that is the behavior. Note that this essentially the behavior we have today for non-union types - if the field is required and not present after deserialization for any reason, then it will fail validation. It just happens that in the union case, the field can appear set when in fact there is no value present.
        Hide
        David Reiss added a comment -

        But I would argue that bar is present and completely valid in this case. It just happens to hold a value that you don't know how to interpret.

        Show
        David Reiss added a comment - But I would argue that bar is present and completely valid in this case. It just happens to hold a value that you don't know how to interpret.
        Hide
        Bryan Duxbury added a comment -

        The idea of a present value that you happen to not know how to interpret is not represented anywhere else in the Thrift data model. Thrift's solution to unrecognizable fields is to skip them. Also, since we don't allow unset unions (or structs that contain unset unions) to be serialized, deserializing one of these structs gets you something that already cannot be serialized straight out of the box. I think we need this change in order for Thrift to continue to behave consistently.

        Show
        Bryan Duxbury added a comment - The idea of a present value that you happen to not know how to interpret is not represented anywhere else in the Thrift data model. Thrift's solution to unrecognizable fields is to skip them. Also, since we don't allow unset unions (or structs that contain unset unions) to be serialized, deserializing one of these structs gets you something that already cannot be serialized straight out of the box. I think we need this change in order for Thrift to continue to behave consistently.
        Hide
        David Reiss added a comment -

        > Also, since we don't allow unset unions (or structs that contain unset unions) to be serialized,
        That's compelling, but before I give in, consider this:

        If Bar were not a union, but a struct containing all optional fields (which was originally supposed to be how unions are interpreted), and bar contained only fields that the receiver didn't know about, that would not cause bar to be skipped. The Foo would just contain an empty Bar.

        Do you remember why decided not to allow serializing empty unions?

        Show
        David Reiss added a comment - > Also, since we don't allow unset unions (or structs that contain unset unions) to be serialized, That's compelling, but before I give in, consider this: If Bar were not a union, but a struct containing all optional fields (which was originally supposed to be how unions are interpreted), and bar contained only fields that the receiver didn't know about, that would not cause bar to be skipped. The Foo would just contain an empty Bar. Do you remember why decided not to allow serializing empty unions?
        Hide
        Bryan Duxbury added a comment -

        If Bar was a struct, then yes, this behavior would be exactly right. That's because a struct is a collection of fields, all or some of which may be unset. However, if it was a struct, and a required field was unset, it would be invalid when deserialized under the current behavior.

        On the flip side, we've defined a union as a struct where exactly one field must be set at all times. If we didn't enforce this, then we'd essentially be introducing a null type to Thrift, which is something I think we actually want to discourage pretty actively. The fact that a union and a struct are implemented the same on the wire is only an artifact of our desire to extend union support to other languages in a backwards-compatible fashion.

        Show
        Bryan Duxbury added a comment - If Bar was a struct, then yes, this behavior would be exactly right. That's because a struct is a collection of fields, all or some of which may be unset. However, if it was a struct, and a required field was unset, it would be invalid when deserialized under the current behavior. On the flip side, we've defined a union as a struct where exactly one field must be set at all times. If we didn't enforce this, then we'd essentially be introducing a null type to Thrift, which is something I think we actually want to discourage pretty actively. The fact that a union and a struct are implemented the same on the wire is only an artifact of our desire to extend union support to other languages in a backwards-compatible fashion.
        Hide
        David Reiss added a comment -

        Then how come we don't fail validation if a structure has an optional field that contains an empty union? Or a list contains an empty union? What I don't like is that information about the context in which the union appears affects the decision about whether or not it is valid.

        Show
        David Reiss added a comment - Then how come we don't fail validation if a structure has an optional field that contains an empty union? Or a list contains an empty union? What I don't like is that information about the context in which the union appears affects the decision about whether or not it is valid.
        Hide
        Bryan Duxbury added a comment -

        Actually, we do fail validation in the cases you described during serialization. During deserialization, we don't fail validation in optional fields or collections simply because values in those fields are aren't required.

        As a side note, something I would consider totally valid is just not "setting" fields and collection elements when the contents of a Union aren't deserializable. This would just mean expanding the detection and treatment of unset unions in a bunch of other places. This would probably be a bit more consistent.

        Show
        Bryan Duxbury added a comment - Actually, we do fail validation in the cases you described during serialization. During deserialization, we don't fail validation in optional fields or collections simply because values in those fields are aren't required. As a side note, something I would consider totally valid is just not "setting" fields and collection elements when the contents of a Union aren't deserializable. This would just mean expanding the detection and treatment of unset unions in a bunch of other places. This would probably be a bit more consistent.
        Hide
        David Reiss added a comment -

        Man, that's a can of worms. What if a list contains only empty unions? Do we just skip the list? What if the value in a map is an empty union? Do we discard the key? What if a union contains an empty union?

        Show
        David Reiss added a comment - Man, that's a can of worms. What if a list contains only empty unions? Do we just skip the list? What if the value in a map is an empty union? Do we discard the key? What if a union contains an empty union?
        Hide
        David Reiss added a comment -

        Here's a higher level question. If you have a struct with a required field that is a union, is there a way to safely evolve the schema of the union?

        Show
        David Reiss added a comment - Here's a higher level question. If you have a struct with a required field that is a union, is there a way to safely evolve the schema of the union?
        Hide
        Bryan Duxbury added a comment -

        I don't think it's that much of a can of worms. Imagine that instead of a union type, the list contains enums, and you remove an enum value from the type. Those values would be dropped silently. If a struct has a required field of type enum and a value goes away, again, it will be treated as an unset field. If this field was required, then it would not pass deserialization validation.

        If you have a struct with a required field that's a union, you can add or rename fields with impunity. If you remove a union field, and you can't get a valid value in that struct field anymore, and you've marked the struct field as required, then you've literally made any serialized data containing those now-defunct values invalid. The exact same argument applies cleanly to enumerated types.

        Show
        Bryan Duxbury added a comment - I don't think it's that much of a can of worms. Imagine that instead of a union type, the list contains enums, and you remove an enum value from the type. Those values would be dropped silently. If a struct has a required field of type enum and a value goes away, again, it will be treated as an unset field. If this field was required, then it would not pass deserialization validation. If you have a struct with a required field that's a union, you can add or rename fields with impunity. If you remove a union field, and you can't get a valid value in that struct field anymore, and you've marked the struct field as required, then you've literally made any serialized data containing those now-defunct values invalid. The exact same argument applies cleanly to enumerated types.
        Hide
        David Reiss added a comment -

        > If a struct has a required field of type enum and a value goes away, again, it will be treated as an unset field. If this field was required, then it would not pass deserialization validation.
        Yeah, I don't think we should do that. It means that once you use some type as a required field, you've limited the flexiblity of that type. I think that adding a required field should only limit the flexibility of the type that has the required field.

        > If you have a struct with a required field that's a union, you can add or rename fields with impunity. If you remove a union field, and you can't get a valid value in that struct field anymore
        Ah, but you're showing your bias toward offline processing, where it's always the old code during the writing and the new code doing the reading. My bias is toward RPC, where clients and servers both read and both write, so upgrades can go in either direction. Adding a field on the write side is the same as removing a field on the read side.

        Show
        David Reiss added a comment - > If a struct has a required field of type enum and a value goes away, again, it will be treated as an unset field. If this field was required, then it would not pass deserialization validation. Yeah, I don't think we should do that. It means that once you use some type as a required field, you've limited the flexiblity of that type. I think that adding a required field should only limit the flexibility of the type that has the required field. > If you have a struct with a required field that's a union, you can add or rename fields with impunity. If you remove a union field, and you can't get a valid value in that struct field anymore Ah, but you're showing your bias toward offline processing, where it's always the old code during the writing and the new code doing the reading. My bias is toward RPC, where clients and servers both read and both write, so upgrades can go in either direction. Adding a field on the write side is the same as removing a field on the read side.
        Hide
        Bryan Duxbury added a comment -

        I don't think that using a type in a required field limits the flexibility of that type. It certainly limits the flexibility of the struct that uses that type - but that's fundamentally intentional.

        While you're right that adding a field in the RPC context is the same as "deleting" a field, again, this is only an issue if you are using required fields. If you say that a field is required, you are stating unequivocally that without some well-understood contents in that field, you can't work with the entire struct. If you plan to evolve your union fields or enum values, then you must not use required fields. I don't see how a well-meaning IDL coder could specify that a field is required and then be OK getting a null in some spot because you want to evolve elegantly.

        Let's look at an RPC example.

        client_side.thrift
        
        union MyResponse {
         1: string response_type_1;
        }
        
        service MyService {
          MyResponse someMethod(string blah);
        }
        
        server_side.thrift
        
        union MyResponse {
         1: string response_type_1;
         2: string response_type_2;
        }
        
        service MyService {
          MyResponse someMethod(string blah);
        }
        

        With these two thrift files, how does it help the client to get back an unset union instead of a validation error? As far as the client is concerned, the result is unintelligible either way. At least if the validation works like I describe, they won't have to implement arbitrarily complex app-level validation to determine that they can't interpret the message.

        Show
        Bryan Duxbury added a comment - I don't think that using a type in a required field limits the flexibility of that type. It certainly limits the flexibility of the struct that uses that type - but that's fundamentally intentional. While you're right that adding a field in the RPC context is the same as "deleting" a field, again, this is only an issue if you are using required fields. If you say that a field is required, you are stating unequivocally that without some well-understood contents in that field, you can't work with the entire struct. If you plan to evolve your union fields or enum values, then you must not use required fields . I don't see how a well-meaning IDL coder could specify that a field is required and then be OK getting a null in some spot because you want to evolve elegantly. Let's look at an RPC example. client_side.thrift union MyResponse { 1: string response_type_1; } service MyService { MyResponse someMethod(string blah); } server_side.thrift union MyResponse { 1: string response_type_1; 2: string response_type_2; } service MyService { MyResponse someMethod(string blah); } With these two thrift files, how does it help the client to get back an unset union instead of a validation error? As far as the client is concerned, the result is unintelligible either way. At least if the validation works like I describe, they won't have to implement arbitrarily complex app-level validation to determine that they can't interpret the message.
        Hide
        David Reiss added a comment -

        > I don't think that using a type in a required field limits the flexibility of that type. It certainly limits the flexibility of the struct that uses that type - but that's fundamentally intentional.

        I disagree. Compare...

        struct Contained {
          whatever
        }
        struct Outer {
          1: required Contained contents;
        }
        

        fields can be added or removed from Contained with no problems. If the recipient gets an Outer with a Contained that has no intelligible fields, it's no problem.

        union Contained {
          whatever
        }
        struct Outer {
          1: required Contained contents;
        }
        

        Now, if you add something to Contained, suddenly Outers can fail validation.

        I definitely sympathize with the need no be able to easily check a deeply-nested structure, but I don't think "require all unions to be non-empty" should be the only option. What do you think of a parameter to validate to control whether an empty union should be considered a failure. It would apply to unions in all contexts, not just required fields.

        Show
        David Reiss added a comment - > I don't think that using a type in a required field limits the flexibility of that type. It certainly limits the flexibility of the struct that uses that type - but that's fundamentally intentional. I disagree. Compare... struct Contained { whatever } struct Outer { 1: required Contained contents; } fields can be added or removed from Contained with no problems. If the recipient gets an Outer with a Contained that has no intelligible fields, it's no problem. union Contained { whatever } struct Outer { 1: required Contained contents; } Now, if you add something to Contained, suddenly Outers can fail validation. I definitely sympathize with the need no be able to easily check a deeply-nested structure, but I don't think "require all unions to be non-empty" should be the only option. What do you think of a parameter to validate to control whether an empty union should be considered a failure. It would apply to unions in all contexts, not just required fields.
        Hide
        Bryan Duxbury added a comment -

        I think the example you've laid out perfectly describes my desired behavior. You have to think of a union as a struct with exactly one required field that can take on many different values/types/names. Cast that way, it would be as though your Contained struct had a required field with an unset value for any reason. I think we can all agree in that case it should fail validation.

        As long as we continue to define a valid union as containing exactly one field, I don't see any other options that make it behave consistently. We could discuss making unions contain zero or one field, but as previously mentioned, that essentially puts a literal null into the thrift vocabulary, which I'm against. I'm also not crazy about a parameter to change this semantic, since it's probably a more context-specific thing than a global thing.

        I do think that existing required/optional semantics apply just fine to these situations and give the schema designer the flexibility to say what they need. If you think about how we treat enums in the same kind of situation, then the behavior would be exactly the same.

        Show
        Bryan Duxbury added a comment - I think the example you've laid out perfectly describes my desired behavior. You have to think of a union as a struct with exactly one required field that can take on many different values/types/names. Cast that way, it would be as though your Contained struct had a required field with an unset value for any reason. I think we can all agree in that case it should fail validation. As long as we continue to define a valid union as containing exactly one field, I don't see any other options that make it behave consistently. We could discuss making unions contain zero or one field, but as previously mentioned, that essentially puts a literal null into the thrift vocabulary, which I'm against. I'm also not crazy about a parameter to change this semantic, since it's probably a more context-specific thing than a global thing. I do think that existing required/optional semantics apply just fine to these situations and give the schema designer the flexibility to say what they need. If you think about how we treat enums in the same kind of situation, then the behavior would be exactly the same.
        Hide
        David Reiss added a comment -

        I think of a union as a struct with many optional fields, with the assumption that only one will be present.

        Your last paragraph make it seem like adding values to enums is potentially dangerous?

        Show
        David Reiss added a comment - I think of a union as a struct with many optional fields, with the assumption that only one will be present. Your last paragraph make it seem like adding values to enums is potentially dangerous?
        Hide
        Bryan Duxbury added a comment -

        It's not an assumption - it's a guarantee. Only when dealing with languages that don't implement unions directly is that an "assumption".

        Maybe part of the problem is that we're not in agreement about the nature of unions. In my use case, I definitely want "exactly one" in a required-like fashion. It sounds like you want to offer "zero or one". This sounds like the difference between an "optional" union and a "required" union. We currently don't support this distinction - rather, we just have "required" unions.

        If this formulation is correct, then I'd suggest that the way to achieve this is to let the field's required or optional modifier specify the "zero" possibility (through unset fields) and required specify the "exactly one". I think this will most support users' expectations.

        Indeed, adding a new value to an enum IS dangerous. If two different consumers have different conceptions about the valid contents of an enum, then one side can reject a value as invalid, leading to an identical situation. I don't think this is a problem - Thrift will skip it in the optional case and complain in the required case. Again, the onus is on the schema designer to specify, through optional/required, whether they can tolerate missing values in given fields.

        Show
        Bryan Duxbury added a comment - It's not an assumption - it's a guarantee. Only when dealing with languages that don't implement unions directly is that an "assumption". Maybe part of the problem is that we're not in agreement about the nature of unions. In my use case, I definitely want "exactly one" in a required-like fashion. It sounds like you want to offer "zero or one". This sounds like the difference between an "optional" union and a "required" union. We currently don't support this distinction - rather, we just have "required" unions. If this formulation is correct, then I'd suggest that the way to achieve this is to let the field's required or optional modifier specify the "zero" possibility (through unset fields) and required specify the "exactly one". I think this will most support users' expectations. Indeed, adding a new value to an enum IS dangerous. If two different consumers have different conceptions about the valid contents of an enum, then one side can reject a value as invalid, leading to an identical situation. I don't think this is a problem - Thrift will skip it in the optional case and complain in the required case. Again, the onus is on the schema designer to specify, through optional/required, whether they can tolerate missing values in given fields.
        Hide
        David Reiss added a comment -

        > If this formulation is correct, then I'd suggest that the way to achieve this is to let the field's required or optional modifier specify the "zero" possibility (through unset fields) and required specify the "exactly one". I think this will most support users' expectations.

        > Again, the onus is on the schema designer to specify, through optional/required, whether they can tolerate missing values in given fields.

        I general, I don't like the idea of showing a field as having been absent when it was present but contained and unexpected value. How does that work with containers? I think the decision of whether a field is set of unset has to be completely independent of any attempts to interpret it.

        The idea of an annotation on enum and union types that indicates whether an unexpected value should trigger a validation error is appealing to me.

        Show
        David Reiss added a comment - > If this formulation is correct, then I'd suggest that the way to achieve this is to let the field's required or optional modifier specify the "zero" possibility (through unset fields) and required specify the "exactly one". I think this will most support users' expectations. > Again, the onus is on the schema designer to specify, through optional/required, whether they can tolerate missing values in given fields. I general, I don't like the idea of showing a field as having been absent when it was present but contained and unexpected value. How does that work with containers? I think the decision of whether a field is set of unset has to be completely independent of any attempts to interpret it. The idea of an annotation on enum and union types that indicates whether an unexpected value should trigger a validation error is appealing to me.
        Hide
        Bryan Duxbury added a comment -

        How does that work with containers?

        That depends on the container. Generally, we can always read the list itself, since we read the type header and size. However, if some of the values are invalid, things get a little icky. In a list, you'd get one or more null fields. In a set, you'd get exactly one null field, thus reducing the size reported on the wire. In a map, if the key was null, then the entire entry would go away (at least in java, which doesn't allow null in HashMap keys); if the value was null, only that part of the entry would be null.

        The idea of an annotation on enum and union types that indicates whether an unexpected value should trigger a validation error is appealing to me.

        We already have this annotation - it's called "required". If it's an unexpected value, you'll get null back, and if that field is required, that's a validation error. If you don't want the validation error, don't make the field required. Can you imagine a scenario where you'd want a field of type enum to be required yet not care if the value is recognizable?

        I don't think that "unset" and "unexpected value" have a meaningful difference to any user of Thrift, since in either case the "value" will just appear to be a null. If we treat them differently, then perhaps "unset" could be determined whether or not the value is null. However, I don't really know how you would make use of this new feature other than in debugging a null value that occurred unexpectedly. Also, trying to achieve this duality would eliminate some optimizations we've made to the memory footprint of structs.

        I should also point out that in Java, our enums are actual enums, not ints. This means that if we don't recognize the value's id on the wire, we literally have no way other than null to represent that it was unrecognized.

        Show
        Bryan Duxbury added a comment - How does that work with containers? That depends on the container. Generally, we can always read the list itself, since we read the type header and size. However, if some of the values are invalid, things get a little icky. In a list, you'd get one or more null fields. In a set, you'd get exactly one null field, thus reducing the size reported on the wire. In a map, if the key was null, then the entire entry would go away (at least in java, which doesn't allow null in HashMap keys); if the value was null, only that part of the entry would be null. The idea of an annotation on enum and union types that indicates whether an unexpected value should trigger a validation error is appealing to me. We already have this annotation - it's called "required". If it's an unexpected value, you'll get null back, and if that field is required, that's a validation error. If you don't want the validation error, don't make the field required. Can you imagine a scenario where you'd want a field of type enum to be required yet not care if the value is recognizable? I don't think that "unset" and "unexpected value" have a meaningful difference to any user of Thrift, since in either case the "value" will just appear to be a null. If we treat them differently, then perhaps "unset" could be determined whether or not the value is null. However, I don't really know how you would make use of this new feature other than in debugging a null value that occurred unexpectedly. Also, trying to achieve this duality would eliminate some optimizations we've made to the memory footprint of structs. I should also point out that in Java, our enums are actual enums, not ints. This means that if we don't recognize the value's id on the wire, we literally have no way other than null to represent that it was unrecognized.
        Hide
        Bryan Duxbury added a comment -

        I general, I don't like the idea of showing a field as having been absent when it was present but contained an unexpected value

        Forgot to mention, we are already doing this for enums values. We're also doing this when we encounter a value of an unexpected type (for instance, string on the wire but field is supposed to be an int). All this ticket is aiming to do is bring union treatment in line with the rest of the types.

        Show
        Bryan Duxbury added a comment - I general, I don't like the idea of showing a field as having been absent when it was present but contained an unexpected value Forgot to mention, we are already doing this for enums values. We're also doing this when we encounter a value of an unexpected type (for instance, string on the wire but field is supposed to be an int). All this ticket is aiming to do is bring union treatment in line with the rest of the types.
        Hide
        David Reiss added a comment -

        Unexpected type is different, because that's an error on the level of the top struct, so it makes sense that it affects the top struct.

        What do you do if you see an unexpected enum value in a list?

        Show
        David Reiss added a comment - Unexpected type is different, because that's an error on the level of the top struct, so it makes sense that it affects the top struct. What do you do if you see an unexpected enum value in a list?
        Hide
        Bryan Duxbury added a comment -

        All unexpected enum values turn into null. In a list, that will lead to one or more null instances in the list.

        Show
        Bryan Duxbury added a comment - All unexpected enum values turn into null. In a list, that will lead to one or more null instances in the list.
        Hide
        David Reiss added a comment -

        Oh. I didn't know Java enums were nullable.

        Show
        David Reiss added a comment - Oh. I didn't know Java enums were nullable.
        Hide
        David Reiss added a comment -

        Sorry, I totally missed your 3:14 comment.

        > We already have this annotation - it's called "required".
        I disagree. That's an annotation on the field, not the type. There is no way to say that a list member is required, for example. Basically, I think the decision of whether an enum or union is valid should only affect the enum or union itself. It should not leak up into the containing context.

        I think it's definitely important to distinguish between "unset" and "set, but I couldn't understand it". It's the difference between "someone sent me bad data" and "Someone sent me new data that I don't know about".

        Show
        David Reiss added a comment - Sorry, I totally missed your 3:14 comment. > We already have this annotation - it's called "required". I disagree. That's an annotation on the field, not the type. There is no way to say that a list member is required, for example. Basically, I think the decision of whether an enum or union is valid should only affect the enum or union itself. It should not leak up into the containing context. I think it's definitely important to distinguish between "unset" and "set, but I couldn't understand it". It's the difference between "someone sent me bad data" and "Someone sent me new data that I don't know about".
        Hide
        Bryan Duxbury added a comment -

        Basically, I think the decision of whether an enum or union is valid should only affect the enum or union itself. It should not leak up into the containing context.

        I'm essentially saying the inverse of this. The context is what defines whether it's OK for the field to be absent or not.

        There is no way to say that a list member is required

        I think this is a red herring. The fundamental contract of a list is that it contains zero or more items. Saying that a list must have a certain number of elements is something totally separate that I don't think we'd want to support.

        I think it's definitely important to distinguish between "unset" and "set, but I couldn't understand it"

        We don't do this anywhere else, at least in Java and Ruby. If it was "set" but got skipped, then it's just as if it was unset. This is because we use the null-ness of the value to indicate unset-ness. It is possible, but not necessarily desirable, to have a separate way of detecting set-ness differently from value. Do other languages do this dfiferently? If so, with what tradeoffs?

        Show
        Bryan Duxbury added a comment - Basically, I think the decision of whether an enum or union is valid should only affect the enum or union itself. It should not leak up into the containing context. I'm essentially saying the inverse of this. The context is what defines whether it's OK for the field to be absent or not. There is no way to say that a list member is required I think this is a red herring. The fundamental contract of a list is that it contains zero or more items. Saying that a list must have a certain number of elements is something totally separate that I don't think we'd want to support. I think it's definitely important to distinguish between "unset" and "set, but I couldn't understand it" We don't do this anywhere else, at least in Java and Ruby. If it was "set" but got skipped, then it's just as if it was unset. This is because we use the null-ness of the value to indicate unset-ness. It is possible, but not necessarily desirable, to have a separate way of detecting set-ness differently from value. Do other languages do this dfiferently? If so, with what tradeoffs?
        Hide
        David Reiss added a comment -

        >> Basically, I think the decision of whether an enum or union is valid should only affect the enum or union itself. It should not leak up into the containing context.
        > I'm essentially saying the inverse of this. The context is what defines whether it's OK for the field to be absent or not.
        But it's not absent. It's present and valid. The code just isn't equipped to understand its value.

        >> I think it's definitely important to distinguish between "unset" and "set, but I couldn't understand it"
        > We don't do this anywhere else, at least in Java and Ruby. If it was "set" but got skipped, then it's just as if it was unset.
        The only time you would skip a set field is if the type were wrong. That's an error in the structure of the outer struct, so it makes sense that it affects the outer struct. In the case of the enum or union, its an error (not even a real error) in the internal structure of the enum/union, so I don't think that should bubble up into the outer struct.

        If you had something like

        union U1 { whatever }
        union U2 { 1: U1 u1 }
        union U3 { 1: U2 u2 }
        union U4 { 1: U4 u3 }
        struct S { 1: U4 u4 }
        

        and the U1 contained an unrecognized value, would you expect the top-level S to show u4 as unset?

        Show
        David Reiss added a comment - >> Basically, I think the decision of whether an enum or union is valid should only affect the enum or union itself. It should not leak up into the containing context. > I'm essentially saying the inverse of this. The context is what defines whether it's OK for the field to be absent or not. But it's not absent. It's present and valid. The code just isn't equipped to understand its value. >> I think it's definitely important to distinguish between "unset" and "set, but I couldn't understand it" > We don't do this anywhere else, at least in Java and Ruby. If it was "set" but got skipped, then it's just as if it was unset. The only time you would skip a set field is if the type were wrong. That's an error in the structure of the outer struct, so it makes sense that it affects the outer struct. In the case of the enum or union, its an error (not even a real error) in the internal structure of the enum/union, so I don't think that should bubble up into the outer struct. If you had something like union U1 { whatever } union U2 { 1: U1 u1 } union U3 { 1: U2 u2 } union U4 { 1: U4 u3 } struct S { 1: U4 u4 } and the U1 contained an unrecognized value, would you expect the top-level S to show u4 as unset?
        Hide
        Bryan Duxbury added a comment -

        I think that's a bit of an extreme example, but yes. The U1 doesn't meet the definition of a set union, which causes U2 not to, etc. The alternative would be to read in the whole stack of unions with the bottom one unset, which could then neither be used nor serialized again. You could argue that just the names of the fields that were supposed to be set above U1 would be valuable, if for nothing else than debugging.

        Also note that in your example it wouldn't be a problem for S, since the field isn't required.

        Show
        Bryan Duxbury added a comment - I think that's a bit of an extreme example, but yes. The U1 doesn't meet the definition of a set union, which causes U2 not to, etc. The alternative would be to read in the whole stack of unions with the bottom one unset, which could then neither be used nor serialized again. You could argue that just the names of the fields that were supposed to be set above U1 would be valuable, if for nothing else than debugging. Also note that in your example it wouldn't be a problem for S, since the field isn't required.
        Hide
        David Reiss added a comment -

        I think it is still a problem that you cannot distinguish absence from a newer schema.

        Show
        David Reiss added a comment - I think it is still a problem that you cannot distinguish absence from a newer schema.
        Hide
        Bryan Duxbury added a comment -

        I think it is still a problem that you cannot distinguish absence from a newer schema.

        Even if we accept that, that is a completely separate issue than the one being discussed.

        Show
        Bryan Duxbury added a comment - I think it is still a problem that you cannot distinguish absence from a newer schema. Even if we accept that, that is a completely separate issue than the one being discussed.
        Hide
        David Reiss added a comment -

        Really? It seems like it's a direct cause the decision to treat "present but uninterpretable" and "absent" as the same.

        Show
        David Reiss added a comment - Really? It seems like it's a direct cause the decision to treat "present but uninterpretable" and "absent" as the same.
        Hide
        Bryan Duxbury added a comment -

        I'm saying that regardless of whether unions are considered, it's not possible to tell the two cases apart with the way the libraries are currently structured. I feel like trying to offer this capability is worth discussion, but in my view, it's a new requirement.

        Show
        Bryan Duxbury added a comment - I'm saying that regardless of whether unions are considered, it's not possible to tell the two cases apart with the way the libraries are currently structured. I feel like trying to offer this capability is worth discussion, but in my view, it's a new requirement.
        Hide
        David Reiss added a comment -

        It is definitely possible. I would say that "the on-the-wire type of this field is different from what we expect" is an error in the structure, rather than the value. In all cases where there is a problem with the value, it will still appear set.

        Show
        David Reiss added a comment - It is definitely possible. I would say that "the on-the-wire type of this field is different from what we expect" is an error in the structure , rather than the value. In all cases where there is a problem with the value , it will still appear set.
        Hide
        Nathan Marz added a comment -

        I just ran into this issue. I agree that the behavior should be to throw a deserialization error. Part of the usefulness of unions is the guarantee that they always have exactly one field in all cases. If you want the behavior that David describes, where the object is empty if the field is missing, then you should use a struct instead of a union IMO. Alternatively, there could be a "nullable-union" alternative to union that is allowed to be empty.

        Show
        Nathan Marz added a comment - I just ran into this issue. I agree that the behavior should be to throw a deserialization error. Part of the usefulness of unions is the guarantee that they always have exactly one field in all cases. If you want the behavior that David describes, where the object is empty if the field is missing, then you should use a struct instead of a union IMO. Alternatively, there could be a "nullable-union" alternative to union that is allowed to be empty.
        Hide
        Nathan Marz added a comment -

        That said, I don't understand this patch. Shouldn't the checking be happening in the union code somewhere? For example, what if I'm deserializing a union directly that has no field set b/c of this issue?

        Show
        Nathan Marz added a comment - That said, I don't understand this patch. Shouldn't the checking be happening in the union code somewhere? For example, what if I'm deserializing a union directly that has no field set b/c of this issue?
        Hide
        Nathan Marz added a comment -

        it's also worth noting that with the current behavior of Thrift (without this issue fixed), you can fail to serialize something that you successfully deserialized. That should never be possible.

        Show
        Nathan Marz added a comment - it's also worth noting that with the current behavior of Thrift (without this issue fixed), you can fail to serialize something that you successfully deserialized. That should never be possible.
        Hide
        Nathan Marz added a comment -

        Here is a patch that will cause unions without a valid set field on the wire to throw an exception during deserialization.

        Show
        Nathan Marz added a comment - Here is a patch that will cause unions without a valid set field on the wire to throw an exception during deserialization.
        Hide
        Jeff DeCew added a comment -

        This seems odd / bad to me. When deserializing an enum with an unrecognized value, as long as the field is optional it is set to null rather than causing an exception. The same should be true of unions, which have the same "exactly one of" semantic. So a union with an unrecognized value should be null, rather than empty or always failing. Then, it is up to the context to decide if null is an acceptable value.

        It sounds like this patch is analogous to making MyEnum.findByValue() throw an exception rather than return null for an unrecognized value. This means that even if I declare my union optional, I will get protocol exceptions when I try to expand the union to include new types, rather than just having the field set to null. If we preserve the "exactly one" guarantee, we need to also ensure that unions are still able to be used as flexible types. The logical way to do this (in Java) is to set the field of type Union to null when optional, and throw a protocol exception only when the union field is required (or when the context says that null values are not allowed).

        If unions were read with a static method, like enums are, then this would probably be more intuitive.

        Reading a Struct
        if (field.type == TType.STRUCT) {
          this.myField = new MyStruct();
          this.myField.read(iprot);
        } else { 
          TProtocolUtil.skip(iprot, field.type);
        }
        
        Reading an Enum
        if (field.type == TType.I32) {
          this.myField = MyEnum.findByValue(iprot.readI32());
        } else { 
          TProtocolUtil.skip(iprot, field.type);
        }
        
        Proposed Reading a Union
        if (field.type == TType.STRUCT) {
          this.myField = MyUnion.staticRead(iprot);
        } else { 
          TProtocolUtil.skip(iprot, field.type);
        }
        

        I don't know how this ports to reading into lists, sets, or maps or to reading return values, but it makes sense that the behavior should match exactly with enums for all those cases.
        Once enums and unions are handled symmetrically, changes to their behavior should be considered together.

        Show
        Jeff DeCew added a comment - This seems odd / bad to me. When deserializing an enum with an unrecognized value, as long as the field is optional it is set to null rather than causing an exception. The same should be true of unions, which have the same "exactly one of" semantic. So a union with an unrecognized value should be null, rather than empty or always failing. Then, it is up to the context to decide if null is an acceptable value. It sounds like this patch is analogous to making MyEnum.findByValue() throw an exception rather than return null for an unrecognized value. This means that even if I declare my union optional, I will get protocol exceptions when I try to expand the union to include new types, rather than just having the field set to null. If we preserve the "exactly one" guarantee, we need to also ensure that unions are still able to be used as flexible types. The logical way to do this (in Java) is to set the field of type Union to null when optional, and throw a protocol exception only when the union field is required (or when the context says that null values are not allowed). If unions were read with a static method, like enums are, then this would probably be more intuitive. Reading a Struct if (field.type == TType.STRUCT) { this .myField = new MyStruct(); this .myField.read(iprot); } else { TProtocolUtil.skip(iprot, field.type); } Reading an Enum if (field.type == TType.I32) { this .myField = MyEnum.findByValue(iprot.readI32()); } else { TProtocolUtil.skip(iprot, field.type); } Proposed Reading a Union if (field.type == TType.STRUCT) { this .myField = MyUnion.staticRead(iprot); } else { TProtocolUtil.skip(iprot, field.type); } I don't know how this ports to reading into lists, sets, or maps or to reading return values, but it makes sense that the behavior should match exactly with enums for all those cases. Once enums and unions are handled symmetrically, changes to their behavior should be considered together.
        Hide
        Nathan Marz added a comment -

        That's a good point, I agree with what you've said. So we need a more complicated patch.

        Show
        Nathan Marz added a comment - That's a good point, I agree with what you've said. So we need a more complicated patch.
        Hide
        David Reiss added a comment -

        FWIW, I like the patch you posted better than the proposed solution. In my mind, the "required" annotation means that we expect it to be present, not that we expect it to be fully understood. My feeling is that the processing of a union or enum should be the same whether it is appearing in the context of a required field, optional field, union, or container.

        Show
        David Reiss added a comment - FWIW, I like the patch you posted better than the proposed solution. In my mind, the "required" annotation means that we expect it to be present, not that we expect it to be fully understood. My feeling is that the processing of a union or enum should be the same whether it is appearing in the context of a required field, optional field, union, or container.
        Hide
        Nathan Marz added a comment -

        That's a fair point. Does it make sense to open a new issue called "Deserializer should throw exception if union is missing field", apply my patch to fix that, and keep the discussion open on this issue on what the semantics for an invalid union on an optional field means?

        Show
        Nathan Marz added a comment - That's a fair point. Does it make sense to open a new issue called "Deserializer should throw exception if union is missing field", apply my patch to fix that, and keep the discussion open on this issue on what the semantics for an invalid union on an optional field means?
        Hide
        Bryan Duxbury added a comment -

        I like Jeff's suggestion. Then we actually wouldn't have to worry about changing the way that unions are treated in required fields - they'd just turn out null.

        If we read a null union into a list, it'll just make multiple null occurrences. In a set, you'll probably get an NPE, since null isn't hashable. In a map key, you'll get the same NPE, but in a value, it'd just be null.

        Show
        Bryan Duxbury added a comment - I like Jeff's suggestion. Then we actually wouldn't have to worry about changing the way that unions are treated in required fields - they'd just turn out null. If we read a null union into a list, it'll just make multiple null occurrences. In a set, you'll probably get an NPE, since null isn't hashable. In a map key, you'll get the same NPE, but in a value, it'd just be null.
        Hide
        David Reiss added a comment -

        What about Nathan's point about not being able to serialize a list with nulls in it? Shouldn't container members be considered required?

        Show
        David Reiss added a comment - What about Nathan's point about not being able to serialize a list with nulls in it? Shouldn't container members be considered required?
        Hide
        Bryan Duxbury added a comment -

        That's a fair point. I could see it going either way. Would we rather deserialize whatever we could and shrink the list appropriately, or maintain the count but allow nulls in the list? The former guarantees you can serialize again immediately; the latter preserves the most information, albeit not very usefully.

        Show
        Bryan Duxbury added a comment - That's a fair point. I could see it going either way. Would we rather deserialize whatever we could and shrink the list appropriately, or maintain the count but allow nulls in the list? The former guarantees you can serialize again immediately; the latter preserves the most information, albeit not very usefully.
        Hide
        David Reiss added a comment -

        I think shrinking the list is really bad, because the list indices might be significant.

        Show
        David Reiss added a comment - I think shrinking the list is really bad, because the list indices might be significant.
        Hide
        Jeff DeCew added a comment -

        I think it is a great point, but I ask: what do we do in the same case with enums?
        If we can find satisfactory solutions to all these problems for enums, the same solutions will work just fine for Unions.
        If we can't find the solutions for enums then maybe enums should be stored as integers, and unions should always be allowed to be empty.

        Getting back to my question, it looks like the deserialization leaves unknown enums values as null in lists. This means that the data can't be immediately reserialized. Certainly, this is preferable to receiving an exception on deserialization and not even getting the list. I agree with David, leaving null entries is better than shortening the list. However, if we did shorten the list, we could also (intuitively) shorten sets and maps (rather than throwing NPE).

        But all this complexity is weighed against simply changing the definition of a union such that it is allowed to be serialized and deserialized with no value at all. What was the argument against that again? We don't want a Null type in thrift? Isn't that what an empty struct is? I still think it's worth trying to make unions like enums rather than structs, but if things get much messier, I might have to change my mind.

        Show
        Jeff DeCew added a comment - I think it is a great point, but I ask: what do we do in the same case with enums? If we can find satisfactory solutions to all these problems for enums, the same solutions will work just fine for Unions. If we can't find the solutions for enums then maybe enums should be stored as integers, and unions should always be allowed to be empty. Getting back to my question, it looks like the deserialization leaves unknown enums values as null in lists. This means that the data can't be immediately reserialized. Certainly, this is preferable to receiving an exception on deserialization and not even getting the list. I agree with David, leaving null entries is better than shortening the list. However, if we did shorten the list, we could also (intuitively) shorten sets and maps (rather than throwing NPE). But all this complexity is weighed against simply changing the definition of a union such that it is allowed to be serialized and deserialized with no value at all. What was the argument against that again? We don't want a Null type in thrift? Isn't that what an empty struct is? I still think it's worth trying to make unions like enums rather than structs, but if things get much messier, I might have to change my mind.
        Hide
        Jeff DeCew added a comment -

        I did some tests, and thought I would follow up with how unknown values for enums are treated on when reading (in Java with 0.3.0-rc4).

        Bottom line: unknown enum values are interpreted as null when being deserialized, and all other code operates as you might expect it would when reading a 'null' off the wire.
        Result: Objects successfully deserialize that cannot be reserailized

        • In Collections
          • All unknown enum values in a set will be replaced with a single 'null' entry in the set. Size of the set may change.
          • All unknown enum keys in a map will be replaced with a single 'null' key in the map. Size of the map may change. The value for that key will be the last value for any unknown enum key in that map that was sent over the wire.
          • All unknown enum values in a list will be replaced with 'null', but the ordering and size of the list will not change.
          • All unknown enum values in a map will be replaced with 'null', but the existence of the key will be unaltered. Size of the map won't change.
        • In Structs
          • An unknown value for a standard (optional) enum field will result in a null (unset) field (even overwriting a default value)
          • An unknown value for a required enum field will result in a validation failure during the read (TProtocolException)
        • In Service Calls
          • An unknown (to client) enum value as a return value results in a TApplicationException (like "getEnum failed: unknown result") on the client
          • An unknown (to server) enum value as a standard (optional) method argument still invokes the method, but with the value of 'null' for the enum
          • An unknown (to server) enum value as a required method argument results in a TApplicationException (like "Required field 'value' is unset! Struct:fromEnumRequired_args(value:null)") on the client

        Based on what I gleaned from the discussion and some peeking at the code, it should be quite possible (if there were consensus to do so) to make sure Unions behave the same way.

        Show
        Jeff DeCew added a comment - I did some tests, and thought I would follow up with how unknown values for enums are treated on when reading (in Java with 0.3.0-rc4). Bottom line: unknown enum values are interpreted as null when being deserialized, and all other code operates as you might expect it would when reading a 'null' off the wire. Result: Objects successfully deserialize that cannot be reserailized In Collections All unknown enum values in a set will be replaced with a single 'null' entry in the set. Size of the set may change. All unknown enum keys in a map will be replaced with a single 'null' key in the map. Size of the map may change. The value for that key will be the last value for any unknown enum key in that map that was sent over the wire. All unknown enum values in a list will be replaced with 'null', but the ordering and size of the list will not change. All unknown enum values in a map will be replaced with 'null', but the existence of the key will be unaltered. Size of the map won't change. In Structs An unknown value for a standard (optional) enum field will result in a null (unset) field (even overwriting a default value) An unknown value for a required enum field will result in a validation failure during the read (TProtocolException) In Service Calls An unknown (to client) enum value as a return value results in a TApplicationException (like "getEnum failed: unknown result") on the client An unknown (to server) enum value as a standard (optional) method argument still invokes the method, but with the value of 'null' for the enum An unknown (to server) enum value as a required method argument results in a TApplicationException (like "Required field 'value' is unset! Struct:fromEnumRequired_args(value:null)") on the client Based on what I gleaned from the discussion and some peeking at the code, it should be quite possible (if there were consensus to do so) to make sure Unions behave the same way.
        Hide
        Bryan Duxbury added a comment -

        An unknown (to server) enum value as a required method argument results in a TApplicationException (like "Required field 'value' is unset! Struct:fromEnumRequired_args(value:null)") on the client

        Right. So what I propose is that when we are checking if a required field is set, if it's a union field, we also check if the union itself is set. This would make it behave similarly to enum values.

        Thoughts?

        Show
        Bryan Duxbury added a comment - An unknown (to server) enum value as a required method argument results in a TApplicationException (like "Required field 'value' is unset! Struct:fromEnumRequired_args(value:null)") on the client Right. So what I propose is that when we are checking if a required field is set, if it's a union field, we also check if the union itself is set. This would make it behave similarly to enum values. Thoughts?
        Hide
        Jeff DeCew added a comment -

        I agree.

        How do you feel about using the MyUnion.staticRead(TProtocol) pattern (which would return null for an unrecognized value) to accomplish this? And the same logic would hold for all structs, not just method_args structs?

        Show
        Jeff DeCew added a comment - I agree. How do you feel about using the MyUnion.staticRead(TProtocol) pattern (which would return null for an unrecognized value) to accomplish this? And the same logic would hold for all structs, not just method_args structs?

          People

          • Assignee:
            Bryan Duxbury
            Reporter:
            Bryan Duxbury
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development