The difference in build time is surprising to me. Any theory why SynonymFilterFactory takes so much more time to build?
Yes, its the n^2 portion where you have a synonym entry like this: a, b, c, d
in reality this is creating entries like this:
a -> a
a -> b
a -> c
a -> d
b -> a
b -> b
in the current impl, this is done using some inefficient datastructures (like nested chararraymaps with Token),
as well as calling merge().
In the FST impl, we don't use any nested structures (instead input and output entries are just phrases), and we explicitly
deduplicate both inputs and outputs during construction, the FST output is just a
List<Integer> basically pointing to ords in the deduplicated bytesrefhash.
so during construction when you add() its just a hashmap lookup on the input phrase, a bytesrefhash get/put on the UTF16toUTF8WithHash
to get the output ord, and an append to an arraylist.
this code isn't really optimized right now and we can definitely speed it up even more in the future. but the main thing
right now is to ensure the filter performance is good.