Hmm so we also copy-on-write a given byte block? Is
this because JMM can't make the guarantees we need about other
threads reading the bytes written?
Correct. The example of where everything could go wrong is the
rewriting of a byte slice forwarding address while a reader is
traversing the same slice. The forwarding address could be
half-written, and suddenly we're bowling in lane 6 when we
should be in lane 9. By making a [read-only] ref copy of the
bytes we're ensuring that the bytes are in a consistent
state while being read.
So I'm using a boolean to tell the writer whether it needs to
make a copy of the byte. The boolean also tells the writer
if it's already made a copy. Whereas in IndexReader.clone we're
keeping ref counts of the norms byte, and decrementing each
time we make a copy until finally it's 0, and then we give it to
the GC (here we'd do the same or give it back to the allocator).
But even if we do reuse, we will cause tons of garbage,
until the still-open readers are closed? Ie we cannot re-use the
byte being "held open" by any NRT reader that's still
referencing the in-RAM segment after that segment had been
flushed to disk.
If we do pool, it won't be very difficult to implement, we have
a single point of check-in/out of the bytes in the allocator
In terms of the first implementation, by all means we should
minimize "tricky" areas of the code by not implementing skip
lists and byte pooling.
It's not like 3.x's situation with FieldCache or terms
dict index, for example....
What's the GC issue with FieldCache and terms dict?
BTW I'm assuming IW will now be modal? Ie caller must
tell IW up front if NRT readers will be used? Because non-NRT
users shouldn't have to pay all this added RAM cost?
At present it's still all on demand. Skip lists will require
going modal because we need to build those upfront (well we
could go back and build them on demand, that'd be fun). There's
the term-freq parallel array, however if getReader is never
called, it's a single additional array that's essentially
innocuous, if useful.
Hmm your'e right that each reader needs a private copy,
to remain truly "point in time". This (4 bytes per unique term X
number of readers reading that term) is a non-trivial addition
PagedInt time? However even that's not going to help much if in
between getReader calls, 10,000s of terms were seen, we could
have updated 1000s of pages. AtomicIntArray does not help
because concurrency isn't the issue, it's point-in-timeness
that's required. Still I guess PagedInt won't hurt, and in the
case of minimal term freq changes, we'd still be potentially
saving RAM. Is there some other data structure we could pull out
of a hat and use?