TermVectorsTermsWriter has the same issue.
You're right: with "irregular" sized documents coming through, you can
end up with PerDoc instances that waste space, because the RAMFile has
buffers allocated from past huge docs that the latest tiny docs don't
Note that the number of outstanding PerDoc instances is a function of
how "out of order" the docs are being indexed, because the PerDoc
holds any state only until that doc can be written to the store files
(stored fields, term vectors). It's transient.
EG with a single thread, there will only be one PerDoc – it's written
immediately. With 2 threads, if you have a massive doc (which thread
1 get stuck indexing) and then zillions of tiny docs (which thread 2
burns through, while thread 1 is busy), then you can get a large
number of PerDocs created, waiting for their turn because thread 1
hasn't finished yet.
But this process won't use unbounded RAM – the RAM used by the
RAMFiles is accounted for, and once it gets too high (10% of the RAM
buffer size), we forcefully idle the incoming threads until the "out
of orderness" is resolved. EG in this case, thread 2 will stall until
thread 1 has finished its doc. That byte accounting does account for
the allocated but not used byte inside RAMFile (we use
So... this is not really a memory leak. But it is a potential
starvation issue, in that if your PerDoc instances all grow to large
RAMFiles over time (as each has had to service a very large document),
then it can mean the amount of concurrency that DW allows will become
"pinched". Especially if these docs are large relative to your
ram buffer size.
Are you hitting this issue? Ie seeing poor concurrency during
indexing despite using many threads, because DW is forcefully idleing
the threads? It should only happen if you sometimes index docs
that are larger than RAMBufferSize/10/numberOrIndexingThreads.
I'll work out a fix. I think we should fix RAMFile.reset to trim its
buffers using ArrayUtil.getShrinkSize.