Description
Due to the sequential nature of today's implementation of GenericData.resolveUnion() (used when serializing an object):
public int resolveUnion(Schema union, Object datum) { int i = 0; for (Schema type : union.getTypes()) { if (instanceOf(type, datum)) return i; i++; } throw new UnresolvedUnionException(union, datum); }
it showed up when we were doing some serialization performance analysis. A simple optimization can be implemented by keeping a map within the UnionSchema object (in fact, this could actually be a perfect hash map given the potential values in the map are known in advance). The optimization is obviously most notable when a Union within the schema contains many types (in our particular use case, more than 40 in some cases). In this scenario, we observed a 25% improvement by using an identity hash map.
Even though using an identity map provides a significant boost, we have observed an even further improvement (and removed some of the restrictions of relying on object identity) by using a perfect hash map on the schema names (an extra 15% on top of that in some cases). This implementation, unfortunately, is not something we could contribute at this point, but we thought it'd be a good idea to allow users to provide alternative implementations of the indexing behavior, such as adding the following static method to Schema:
public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory factory) { unionIndexCacheFactory = factory; }
This is what the interface and identity hash map-based implementation would look like:
/** * A factory interface for creating UnionTypeIndexCache instances. */ public static interface UnionIndexCacheFactory { UnionIndexCache createUnionIndexCache(List<Schema> types); /** * Used for caching schema indices within a union. */ public static interface UnionIndexCache { void setTypeIndex(Schema schema, int index); int getTypeIndex(Schema schema); } } private static class IdentityMapUnionIndexCacheFactory implements UnionIndexCacheFactory { @Override public UnionIndexCache createUnionIndexCache(List<Schema> types) { return new UnionIndexCache() { private final IdentityHashMap<Schema, Integer> schemaToIndex = new IdentityHashMap<Schema, Integer>(); @Override public void setTypeIndex(Schema schema, int index) { schemaToIndex.put(schema, index); } @Override public int getTypeIndex(Schema schema) { Integer index = schemaToIndex.get(schema); return index == null ? -1 : index; } }; } }
I will attach a patch later today or early tomorrow.
Thanks in advance,
Hernan Otero