However due to the growing array the likely hood of it landing on 2.1 exactly is probably not likely. So it would probably error out sometime before that.
Actually ArrayUtil.grow is careful about this limit: on that final
grow() it'll go right up to Java's max allowed array size.
I'm also building up a 2B terms index (using Test2BTerms), and then I'll compare patch/3.x on that index.
OK this finished – the test passed with the patch (good news!), and
With 3.x, IR.open takes 43.69 seconds and uses 2955 MB of heap.
With the patch, IR.open takes 9.94 seconds (4.4X faster) and uses 505
MB of heap (5.9X less): AWESOME!
The test then does a lookup of a random set of terms. 3.x does this
in 51.2 sec; patch does it in 48.5 sec, good! (Same set of terms).
I can back port PagedBytes instead if you think it's really needed.
I think we should cutover to PagedBytes. Today the number of terms we
can support is 2.1B times index interval (default 128), so ~274.9 B
But with the current patch, we can roughly estimate bytes per indexed
- 15 bytes for term UTF8 bytes (non-English content)
- 1 byte for docFreq (vast majority of terms are < 128 df)
- 1 byte for skipOffset (vast majority of terms have no skip).
- 4 bytes for indexToTerms entry
So total ~37 bytes per indexed term, which means ~58.0 M indexed terms
can fit in the 2.1B byte limit, or 7.4 B total terms at the default
128 index interval. This makes me a little nervous... we've already
seen have apps that are well over 2.1 B terms.
Even before the 2.1B limit, it makes me nervous relying on the JRE to
allocate such a large contiguous chunk of RAM.
A couple other random things I noticed:
- When we estimate the initial size of the byte (based on .tii
file size), I think we should divide by indexDivisor?
- We should conditionally write the skipOffset, only when docFreq is
>= skipInterval. Since most terms won't have skip data we can
save 1 byte for them...