At a minimum I think before committing we should make the SegmentWriteState accessible.
OK. Will that be the subject of a new Jira?
No, I mean we shouldn't commit this patch until SegmentWriteState is
accessible when creating the FuzzySet. I think we can just pass it to
BloomFilterFactory.getSetForField? This way if the app knows it's a
PK field then it can use maxDoc to always size an appropriate
bit set up front.
I think we are in agreement on the broad principles. The fundamental question here though is do you want to treat an index's choice of Hash algo as something that would require a new SPI-registered PostingsFormat to decode or can that be handled as I have done here with a general purpose SPI framework for hashing algos?
+1, that's exactly the question.
Ie, where to draw the line between "config of an existing PF" and
But I guess swapping in different hash impl should be seen as simple
config change, so I think using SPI to find it at read time is OK.
I still don't like how trappy this approach is: the default hardwired
(8 MB) can be way too big (silently slows down your NRT reopens,
especially if you bloom all fields) or way too small (silently turns
off bloom filter for fields that have too many unique terms).
I also don't think this PF should be per-field: we have
PerFieldPostingsFormat for that, and if there are limitations in PFPF,
we should address them rather than having to make all future PFs
handle per-field-ness themselves. This PF should really handle one
But I don't think these issues need to hold up commit (except for
making SegmentWriteState accessible)... we can improve over time. I
think we may simply want to fold this into the terms dict somehow.
Can you add @lucene.experimental to all the new APIs?