Upon further analysis, the poor performance is due to (slowly) expanding the CAS (heaps). IF the XMI CAS contains hundreds of thousands to millions of 8, 16, or 64 bit feature values which are being added roughly one (or a few) at a time (through the XMI deserialization), then there is a high CAS expansion cost with the current implementation.
The existing byte/short/long heap expansion algorithm is defined in CommonAuxHeap. The idea is that the heap will exponentially grow until a threshold is reached (DEFAULT_HEAP_MULT_LIMIT), after which point, exponential growth is replaced with a linear growth. All that sounds reasonable, but, DEFAULT_HEAP_MULT_LIMIT is only defined to be 1024. Once the array grows to 1024, the algorithm only expands by 1024 entries at a time.
In our case, the XMI deserialization is only adding a few feature values at a time, and slowly expanding out to 1.5 million 18 seconds later (for 5 million, it's over 2 minutes).
Furthermore, the byte/short/long heap seeding values (dealing with CAS expansion) are not currently configurable or exposed through the CAS (CASImpl) or CAS creation utility.
I think, minimally, we should consider increasing the DEFAULT_HEAP_MULT_LIMIT to something a bit larger to allow for quicker expansion. The regular (32-bit) Heap uses a default size of 500,000. For what it's worth, I changed DEFAULT_HEAP_MULT_LIMIT to 512K, and cut the deserialization time by 2/3 or more (from 18 seconds down to 6 seconds, and 140 seconds down to 20 seconds). From a memory footprint perspective, that would allow an exponential expansion up to 4MBs (512K*8), and then linear expansion from there, which seems reasonable.
We could also consider exposing DEFAULT_HEAP_MULT_LIMIT as a "setable" property (something analogous to CAS_INITIAL_HEAP_SIZE), and allow the cas creation utility to honor the requested limit.
It should also be noted that binary CAS deserialization does not suffer this same fate, as the total heap sizes are self described within the binary blob, so the total heaps sizes are known prior to allocating the (heap) arrays.
And it's also worth noting that this overhead may or may not be problematic, depending on the various types of use cases.