Here's a patch to speed up the spellchecker build.
- i wired the default RamMB to IWConfig's default
- i didnt mess with the mergefactor for now (because the default is still to optimize)
- but i added an additional 'optimize' parameter so you can update your spellcheck index without re-optimizing.
- when updating, i changed the exists() to work per-segment, so its reasonable if the index isn't optimized.
- the exists() check now bypasses the term dictionary cache, which is stupid and just slows it down.
- we don't do any of the exists() logic if the index is empty (this is the case for i think solr which completely rebuilds
and doesnt do an incremental update)
- the startXXX, endXXX, and word fields can only contain one term per document. I turned off norms, positions,
and tf for these.
- the gramXXX field is unchanged, i didnt want to change spellchecker scoring in any way. But we could
reasonably in the future likely omit norms here too since i think its gonna be very short.
scratch build time: 229,803ms
index size: 214,322,200 bytes
no-op update time (updating but there is no new terms to add): 4,619ms
scratch build time: 99,214ms
index size: 177,781,273 bytes
no-op update time: 2,504ms
i still left the optimize default on, but really i think for most users (e.g. solr) they should set
mergefactor to be maybe a bit more reasonable, set optimize to false, and the scratch build
is then must faster (60,000 ms), but the no-op update time is heavier (eg 16,000ms). Still,
if you are rebuilding on every commit for smallish updates something like 20-30 seconds
is a lot better than 100seconds, but for now I kept the defaults as is (optimizing every time).