Right now GroupStep is defined as:
Look at reduceTraversal. It takes a Collection<V> of "values" and reduces them to a "reduction" R. Why are we using Collection<V>, why is this not:
Now, when a new K is created (and reduce is defined), we clone reduceTraversal. Thus, each key has a reduceTraversal (identical clones) that operate in a stream like fashion on V to yield R. This enables us to remove the Collection<V> (memory hog) and allows us to defined GroupCountStep in terms of GroupStep without (?limited?) computational cost. HOWEVER, this changes the API as people who did this:
would now have to do this:
Its very minor, given the speed up we would gain and the ability for us to now do "groupCount" efficiently on arbitrary values – not just bulks (e.g. sacks).