Affects Version/s: None
Fix Version/s: None
In short, I'd like a way to get a Kryo serializer for "private" use that has the same configuration used by storm for serializing inter-worker tuples. Ideally, that would be a version of SerializationFactory.getKryo() that returned a Kryo object with the same configuration of storm's, but that doesn't exist.
Obviously, we can pass the topology configuration Map to getKryo(), but there's no way for a library, whose internal workings should be opaque to storm components, to get access to that Map.
We've worked around this by adding an initialization call in our components' open/prepare methods, but that's just icky. It would be much cleaner if the library could handle this on its own without bothering component developers that shouldn't have to deal with this.
I can think of several solutions, any of which would be acceptable. Some would probably be useful for cases other than ours.
- As mentioned above, a variant of SerializationUtils.getKryo() that returned a Kryo object with the same configuration as storm's.
- An API that could be called anywhere that returned the Map passed to components' open/prepare methods.
- A mechanism to allow registering an initialization function to be called on worker startup. It would be passed the above-mentioned Map, and all initializers would be called before the worker started any components. (I kind of like this one best. Seems most flexible.)
We have a custom Kryo serializer for our events that implements lazy deserialization. Most tuples have just a single event object. On serialization, some fields of the event are serialized using Kryo, others with a a more primitive method. But on deserialization on receipt in a worker, no fields are actually deserialized; fields are only deserialized when referenced by the receiving bolt. On re-serialization for output, only fields modified within the worker are serialized. Since a large majority of fields in our events never change as they flow through multiple bolts, this saves considerable CPU in serialization/deserialization.
The issue is: When the event is serialized by storm, we use storm's Kryo serializer. But deserialization of a field may happen when a bolt references a serialized field, and at that point we don't have storm's Kryo, only one we created ourselves. Ensuring the two Kryos are configured the same requires access to the storm configuration Map.
Like I say, we've hacked around this issue, but would prefer a cleaner solution.