I think the branch is ready to land... I'll post an applyable patch
In Mode.SEARCH the tokenizer produces the same tokens as current
The only real end-user visible change is the addition of
Mode.SEARCH_WITH_COMPOUNDS, which can produce two paths (compound
token + its segmentation). This mode uses the new
PositionLengthAttribute to record how "long" the compound token is.
In this mode, the Viterbi search first runs without penalties, but
then, if a too-long token (a token where the penalty would have been >
0) is in the best path, we effectively re-run the Viterbi under that
compound token, this time with penalties included. If this results in
a different backtrace, we add that into the output tokens as well.
Note that this will not produce congruent results as Mode.SEARCH,
because the 2nd segmentation runs "in context" of the best path,
meaning the chosen best wordID before and after the compound token are
"enforced" in the 2nd segmentation. Sometimes this results in still
picking only the compound token where trunk today would have split it
up. From TestQuality, the total number of edits was 4418 vs trunk's
I didn't explore this, but, we may want to use harsher penalties in
SEARCH_WITH_COMPOUNDS mode, ie, since we're going to output the
compound as well we may as well "try harder" to produce the 2nd best
I left the default mode as Mode.SEARCH... maybe if we can somehow
run some relevance tests we can make the default SEARCH_WITH_COMPOUNDS.
But it'd also be tricky at query time...
It looks like the rolling Viterbi is a bit faster (~16%: 1460
bytes/msec vs 1700 bytes/msec on TestQuality.testSingleText).