Issue Details (XML | Word | Printable)

Key: THRIFT-409
Type: New Feature New Feature
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Bryan Duxbury
Reporter: Bryan Duxbury
Votes: 0
Watchers: 6
Operations

If you were logged in you would be able to see more operations.
Thrift

Add "union" to Thrift

Created: 26/Mar/09 06:06 PM   Updated: 01/Sep/09 11:04 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.2

Time Tracking:
Not Specified

File Attachments:
  Size
File Licensed for inclusion in ASF works one_rule.diff 2009-04-15 09:49 PM David Reiss 2 kB
Text File Licensed for inclusion in ASF works thrift-409-v2.patch 2009-07-29 06:12 PM Bryan Duxbury 33 kB
Text File Licensed for inclusion in ASF works thrift-409-v3.patch 2009-07-29 06:44 PM Bryan Duxbury 31 kB
Text File Licensed for inclusion in ASF works thrift-409-v4.patch 2009-08-05 07:20 PM Bryan Duxbury 32 kB
Text File Licensed for inclusion in ASF works thrift-409-v5.patch 2009-08-08 12:48 AM Bryan Duxbury 34 kB
Text File Licensed for inclusion in ASF works thrift-409-v6.patch 2009-08-11 10:52 PM Bryan Duxbury 35 kB
Text File Licensed for inclusion in ASF works thrift-409-v7.patch 2009-08-12 05:01 PM Bryan Duxbury 36 kB
Text File Licensed for inclusion in ASF works thrift-409-v8.patch 2009-08-26 08:01 PM Bryan Duxbury 36 kB
Text File Licensed for inclusion in ASF works thrift-409-v9.patch 2009-09-01 10:08 PM Bryan Duxbury 36 kB
Text File Licensed for inclusion in ASF works thrift-409.patch 2009-04-15 08:37 PM Bryan Duxbury 31 kB

Patch Info: Patch Available
Resolution Date: 01/Sep/09 11:04 PM


 Description  « Hide
It would be very helpful to have a "union" construct in Thrift. Let's decide on the design and then break up into sub-issues to add this feature.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Bryan Duxbury added a comment - 26/Mar/09 06:23 PM
For the IDL portion, I think we should do:
union <union name> {
  <field id>: <type> <field name> <optional line ender>
}

An example would look like:

union MyUnion {
  1: string a_string;
  2: i32 an_i32;
}

Bryan Duxbury added a comment - 26/Mar/09 06:35 PM
Some features of unions:
  • May contain exactly one of the fields in its definition. (Should we go so far as to say that the wire protocol only allows for one field in unions?)
  • When reading, if it encounters a field it doesn't recognize, it should skip the entire field, essentially reporting the field as unset.
  • It should be impossible to have a union with no set field. For nullable types, nulls should not be allowed.
  • When you are dealing with a union in-memory, it should be able to tell you programmatically which of its fields are set.

Mathias Herberts added a comment - 26/Mar/09 06:46 PM
I can't really view the semantics and uses of this. If we take the semantics of a C union, wouldn't

union MyUnion { 1: string a_string, 2: i32 an_i32, }

be roughly equivalent to

struct MyUnion { 1: optional string a_string, 2: optional i32 an_i32, }

with exclusion constraints enforced at the app level?


Bryan Duxbury added a comment - 26/Mar/09 08:02 PM
Maybe it's not really a union. It's kind of more like a "one-of" struct. I should have pointed out as an additional feature that unlike a true union, you should NOT be able to use this structure to "convert" between one datatype and another. Whatever type it is when you create it, it's that type until you set it to another value.

As such, it may not actually be implemented in c/cpp as a true union. It might need to be class wrapped around a union.


David Reiss added a comment - 26/Mar/09 08:03 PM

May contain exactly one of the fields in its definition. (Should we go so far as to say that the wire protocol only allows for one field in unions?)

I say "yes and no" to this. Union-aware readers should probably throw an exception if more than one field is set, but I think that we should not create a special wire-format for unions. Just use the struct format.

When reading, if it encounters a field it doesn't recognize, it should skip the entire field, essentially reporting the field as unset.

That doesn't really work in containers, and I think it actually makes the code quite a bit trickier in structs. Maybe we could have a special setting for each union indicating that none of the fields are set.

It should be impossible to have a union with no set field. For nullable types, nulls should not be allowed.

I think I disagree with this. I think we have to have this type of representation in order to be able to read a union with no set field. Therefore, we might as well give apps the ability to use it.

When you are dealing with a union in-memory, it should be able to tell you programmatically which of its fields are set.

Do you mean "which of its fields is set"? If so, I agree. Probably just an int equal to the id of the field that was read, which will make it faster than checking isset for each individual field.

I can't really view the semantics and uses of this.

Your description of the semantics is exactly correct. In addition, we can make it more efficient to determine which field is set. I'll let Bryan comment on the uses.

Also, there is no need to add any new syntax for this while we are experimenting with it. We can just make it an annotation.


David Reiss added a comment - 26/Mar/09 08:04 PM
I think a class wrapped around a boost::variant is more likely. You can't put strings and vectors in unions.

Bryan Duxbury added a comment - 26/Mar/09 08:18 PM

Just use the struct format.

I thought of this as an option, but I think that will lead to suboptimal wire performance. For instance, we definitely do not need a whole extra stop byte at the end of the union just to say there's nothing more coming. There should only ever be one field, so I say let's tailor it accordingly.

That doesn't really work in containers, and I think it actually makes the code quite a bit trickier in structs. Maybe we could have a special setting for each union indicating that none of the fields are set.

I'll admit I'm not sure on how the container thing might work, but I don't think it will make the struct code that much more complicated. I was thinking that our generated union classes would have a static read() method that could return null if an invalid union was read from the wire. If a null union is unacceptable in any circumstance, I think we should just use some constant "invalid" union or something to indicate there was a problem. However, I don't think an empty union should be serializable.

Do you mean "which of its fields is set"

Yep.

Also, there is no need to add any new syntax for this while we are experimenting with it. We can just make it an annotation.

Maybe. I think we'd kind of have to do a lot of working around the built in generator behavior to do this, and if we ultimately decide to have this feature, it'd be silly to use annotations for it.


David Reiss added a comment - 26/Mar/09 10:32 PM

I thought of this as an option, but I think that will lead to suboptimal wire performance.

It's one byte. Using a different wire format makes it impossible to implement this incrementally. Every language must be able to skip unions before you can start writing them. Plus it makes it impossible to transparently upgrade from current "all-optional" structures.

I'll admit I'm not sure on how the container thing might work, but I don't think it will make the struct code that much more complicated.

I think it is important to figure the containers out, and I disagree about the struct case. If you have a state that represents a union whose content could not be read, then the code for reading a union looks exactly the same as the code for reading a structure, when viewed from the perspective of the caller.


Bryan Duxbury added a comment - 26/Mar/09 10:47 PM
I see what you're saying about incremental roll-out. At the cost of one additional byte, it might be worth it to keep the same wire format. To make the code efficient, though, I don't think that a union should include the logic for "skipping" subsequent fields. That is, I don't want to write a loop at all in the read() method, so it might just look like, read field id, read type, read value, read stop, if not stop, throw protocol exception.

David Reiss added a comment - 26/Mar/09 11:21 PM
Sounds good.

Bryan Duxbury added a comment - 15/Apr/09 08:37 PM
Here's my first bit of work on this issue. This patch includes some changes to the lexer and parser to add the "union" construct, some changes to the compiler to support unions, and work on the java library and generator to support generating unions.

There's a small test for the java unions, but it needs to be greatly expanded upon.


David Reiss added a comment - 15/Apr/09 09:23 PM
I think this should probably be done with less code duplication. Specifically:
  • In the parser, make it accept struct or union as the beginning of the definition, rather than creating a whole separate rule for unions.
  • Rather than "add_union" in programs and "generate_union" in generators, just make unions a special structure with a flag (or annotation!) set. That way all languages will have instant interoperability with unions (with a little help from the application).

Bryan Duxbury added a comment - 15/Apr/09 09:39 PM

In the parser, make it accept struct or union as the beginning of the definition, rather than creating a whole separate rule for unions.

I'd be open to this change, but I'm not exactly clear on how all the parser stuff works. I did more of a copy and paste approach than a genuine change.

... just make unions a special structure ...

I was sort of going for this, though I can see how it would make sense to refine it a little. My overall intent was to diverge the code generators quite a bit between structs and unions, since I'm expecting to be able to do a lot of simplifying things for unions. (Take a look at the TUnion base class that's included in the patch - there doesn't have to be as much generated code in unions as structs.)


David Reiss added a comment - 15/Apr/09 09:49 PM
Well, the easiest way to parse it would be to just add it as an annotation.
This patch (on top of yours) should do the trick. It uses add_struct for both structs and unions.
You might have to delete trailing whitespace to apply it.

Chris Goffinet added a comment - 30/Jun/09 04:20 PM
What's the status of this? Can we get someone to review and commit?

Bryan Duxbury added a comment - 30/Jun/09 04:27 PM
This issue is still very much work in progress. There isn't even one language that implements union as something other than a struct. We could conceivably commit the general compiler stuff first and work on languages later, but I'm not in any hurry to do that.

Bryan Duxbury added a comment - 29/Jul/09 06:12 PM
Ok, here is an incremental improvement on the patch I put up so long ago. It was a lot further along than I remembered! This now includes David's patch.

I'd love to get some comments on what needs improvement, both from the thrift code and user-facing code standpoint, if people have input to give.


David Reiss added a comment - 29/Jul/09 06:22 PM
You added some whitespace to a blank line at the end of the parser (.yy). Also, I think a lot of the stuff you did under src/parse can be reverted since my version of the parser change just treats unions as structs with a special flag set. I haven't looked at the Java-specific stuff yet.

Bryan Duxbury added a comment - 29/Jul/09 06:44 PM
OK, I removed some of the unnecessary parser changes. I also reorganized some of the generated code, and generated a field metadata map so that it looks like other TBase objects.

Bryan Duxbury added a comment - 05/Aug/09 07:20 PM
Here's the latest patch that includes a compareTo for compatible unions.

David Goldblatt added a comment - 06/Aug/09 07:19 PM
I read through thrift-409-v4.patch and saw a couple pointer versus value issues:

187: value_ = other.value_
781: getFieldValue() == other.getFieldValue()

The todo on 237 is probably worth doing, since a default constructor is declared.


Bryan Duxbury added a comment - 08/Aug/09 12:48 AM
This patch addresses David's issues, and as a side effect cleans up some confusing stuff in TBase.java.

Bryan Duxbury added a comment - 11/Aug/09 10:52 PM
This fixes a bug in the equals method and switches all the shorts to ints.

Bryan Duxbury added a comment - 12/Aug/09 05:01 PM
This version includes some minor bug fixes, as well as helper "constructors" so that you can instantiate a union of a given field type in a java-type-safe manner.

Bryan Duxbury added a comment - 12/Aug/09 05:06 PM
I would love for some more reviews of this code. I think it's getting pretty darn close to ready to commit. Rapleaf is already gearing up to make use of it in near-production. Anyone have time to take a look?

Piotr Kozikowski added a comment - 14/Aug/09 02:52 PM
In the Java-specific stuff, things seem a bit too type-unsafe to me. A lot of it has to do with limitations inherent to Java generics, but in some places type info is lost where it mustn't. For instance, the deep copy constructor uses generic code in TUnion.java instead of custom generated code like a normal struct would do. The constructor could instead check for a valid value of setField_, and invoke the corresponding (generated) type-safe code (I see there is a checkType() method, but it doesn't work for containers).

Likewise, instead of having just one constructor like this:

TUnion(int setField, Object value)

there could be one with an explicit type for each field, with the corresponding check that the appropriate value for setField is used according to the field's type. This way user code attempting to assign an object of invalid type to a field would not even compile.


Bryan Duxbury added a comment - 14/Aug/09 04:16 PM
The generic constructor is intentionally type-unsafe for flexibility. If you want type safety, you can use the FooUnion.field_name() style static helper constructors. Those are type safe and guarantee that you are getting the field id you expected.

I made the choice not to use a ton of generated code in TUnion because it greatly simplifies the generator and makes the code much easier to follow.


Piotr Kozikowski added a comment - 14/Aug/09 04:56 PM

I made the choice not to use a ton of generated code in TUnion because it greatly simplifies the generator and makes the code much easier to follow.

I would imagine most of the required code is already there in the generator since normal structs use it.


Jonathan Ellis added a comment - 20/Aug/09 05:46 PM
this patch only adds actual union code generation for Java, right?

Bryan Duxbury added a comment - 21/Aug/09 04:16 PM
That's correct. It's intended to serve as a reference implementation for other language developers. As a bonus, languages that don't yet generate unions specifically can just generate a struct, and everything will continue to work.

Bryan Duxbury added a comment - 26/Aug/09 08:01 PM
Updated to work with the latest trunk.

Bryan Duxbury added a comment - 31/Aug/09 11:51 PM
Does anyone object to me committing this? It has been available for comment for quite some time, and there aren't any real outstanding bugs in the implementation. I believe it is ready to be used, and that improvements can be made in separate tickets after this initial issue is committed. If no one chimes in to the contrary, I will commit this patch tomorrow evening.

David Reiss added a comment - 01/Sep/09 09:08 PM
Undo changes to t_type.h, and this hunk:
  • } else if (type->is_struct() || type->is_xception()) { + } else if (type->is_struct() || type->is_xception() || type->is_union()) {

Unions are a subset of structures, so there is no need to have t_type be aware of their existence, or check for them specially.

Should deepCopyObject be used for TStruct as well?

+ protected TUnion(int setField, Object value) {
should call setFieldValue

It looks like read doesn't verify that the on-the-wire type is the same as the type that is expected. All of the struct implementations skip the field when this is the case.

Unless I'm missing something, the biggest problem is that you do not call writeFieldStop in write. This means that a union-unaware recipient would be unable to skip over a union that was sent to them, and their deserialization would desynchronize. I think this is a blocker. See https://issues.apache.org/jira/browse/THRIFT-409?focusedCommentId=12689715&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12689715


David Reiss added a comment - 01/Sep/09 09:23 PM
Never mind about verifying on-the-wire types. I see it now. Also, you are missing a newline in the gen-code at the end of the single-field named constructors.

Nathan Marz added a comment - 01/Sep/09 09:25 PM
Printing of byte[] in Java implementation is broken. It prints as [B@6e0142c.

Bryan Duxbury added a comment - 01/Sep/09 09:35 PM

Unless I'm missing something, the biggest problem is that you do not call writeFieldStop in write...

Wow, yeah, this is definitely an oversight on my part. Is it just me or is writeFieldStop actually misnamed? Isn't it really the end of the struct, not a field?


David Reiss added a comment - 01/Sep/09 09:49 PM
It is a special field header that signifies the end of the fields. Misnamed? No. Unintuitive? Yes.

David Reiss added a comment - 01/Sep/09 09:51 PM
I think the simplest test for the skip functionality is to write your StructWithAUnion to a buffer, then try to read that buffer as a "Emtpy" (defined in the same .thrift file).

Bryan Duxbury added a comment - 01/Sep/09 10:08 PM
OK, this patch fixes the problems mentioned in the comments. Can you let me know if this looks ok? I included the mentioned Empty skip test, which passes now.

David Reiss added a comment - 01/Sep/09 10:56 PM
This seems good. I'm pretty sure there is one more bug, but it is not a blocker for me if it is not a blocker for you. As you are aware, you don't verify that that the second field header is actually the stop field. As a result, if someone sends you a struct with more than one field (like, a C++ client that has non-nullable fields and currently ignores the union/struct distinction) that you expect to be a union, you will desynchronize. I think the simplest fix would be to check it and throw an exception if it is not the stop field. You could also

There is probably a similar problem if you attempt to deserialize a struct with no fields.


David Reiss added a comment - 01/Sep/09 11:00 PM
sigh. You could also just skip the rest of the fields and say the first one wins. Or the last one wins. Or a random one wins, to make sure we don't depend on coincidental but repeatable behavior. Seriously, though, it is probably best to check and throw.

Bryan Duxbury added a comment - 01/Sep/09 11:04 PM
I just committed this.