Yes, this is pretty much the top-n nodes reordering that I did, albeit without any optimization of how many n to take (the hardcoded magic constants should probably be justified somehow? Or replaced by a default in FST somewhere?).
Deciding how many nodes to reorder is I think hard – I failed to provide any sensible fast heuristic for that and I simply do a simulated annealing to find a local optimum.
Yeah the algo is simplistic now... and it does force caller to pick the params (though minInCountDeref=3 and maxDerefNodes=Inf are probably pretty good)... we can and should make it more sophisticated over time. We have at least one spare bit to still use in the arc flags
One thing I was wondering is why you decided to integrate the packer with the fst – wouldn't it be cleaner to separate packing from construction? Granted, this requires a double pass over the fst nodes and more intermediate memory but it wouldn't add any more complexity to the builder which is already, ehm, a bit complex . You can compare this design in Morfologik:
Well... it's tricky. First, I decided to change node targets to ords so that pack could use an array to (relatively!) efficiently hold node data, eg inCount, newAddress, etc. That required making the first pass FST "modal" (deref'd or not). If we didn't do this then the packer would have to use even less RAM efficient data structures (eg Map<Int,X>) I think?
Second, the format written by the packer is tightly coupled with the FST reading, ie there are sizable differences when reading packed vs unpacked FST.
But I agree it'd be cleaner if we could move packing out (eg Util.pack), and more strongly decouple packing from FST format...
One thing I'd really like to somehow do is create different classes for FST writing vs reading, and different classes for each format. We now have write-ord, write-non-ord, read-packed, read-unpacked (and even the two readers have 3 different modes depending on INPUT_TYPE).