Currently, LazySimpleSerDe is used to send data to the user transformation functions - it would be useful to let the user specify the format of the data.
Specifically, it would be very easy and useful to accommodate:
(cut and paste from Venky's mail)
Here's the typedbytes stuff that Dumbo uses.
Timings for IP count program on 300gigs of weblogs:
Java: 8 minutes
Dumbo with typed bytes: 10 minutes
Hive: 13 minutes
Dumbo without typed bytes: 16 minutes
They also have a fast python decoder for this, which is apparently 25% faster than the python version.