but any facet accumulation which would rely on document scores would be hit by the second as the scores
That's a great point Gilad. We need a test which covers that with random sampling collector.
Is there a reason to add more randomness to one test?
It depends. I have a problem with numDocs=10,000 and percents being 10% .. it creates too perfect numbers if you know what I mean. I prefer a random number of documents to add some spice to the test. Since we're testing a random sampler, I don't think it makes sense to test it with a fixed seed (0xdeadbeef) ... this collector is all about randomness, so we should stress the randomness done there. Given our test framework, randomness is not a big deal at all, since once we get a test failure, we can deterministically reproduce the failure (when there is no multi-threading). So I say YES, in this test I think we should have randomness.
But e.g. when you add a test which ensures the collector works well w/ sampled docs and scores, I don't think you should add randomness – it's ok to test it once.
Also, in terms of test coverage, there are other cases which I think would be good if they were tested:
- Docs + Scores (discussed above)
- Multi-segment indexes (ensuring we work well there)
- Different number of hits per-segment (to make sure our sampling on tiny segments works well too)
I wouldn't for example use RandomIndexWriter because we're only testing search (and so it just adds noise in this test). If we want many segments, we should commit/nrt-open every few docs, disable merge policy etc. These can be separate, real "unit-", tests.
Sorry, I don't get what you mean by this.
I meant that if you set numDocs = atLeast(8000), then the 10% sampler should not be hardcoded to 1,000, but numDocs * 0.1.
the original totalHits .. is used
I think that's OK. In fact, if we don't record that, it would be hard to fix the counts no?
There will be 5 facet values (0, 2, 4, 6 and 8), as only the even documents (i % 10) are hits. There is a REAL small chance that one of the five values will be entirely missed when sampling. But is that 0.8 (chance not to take a value) ^ 2000 * 5 (any can be missing) ~ 10^-193, so that is probable not going to happen
Ahh thanks, I missed that. I agree it's very improbable that one of the values is missing, but if we can avoid that at all it's better. First, it's not one of the values, we could be missing even 2 right – really depends on randomness. I find this assert just redundant – if we always expect 5, we shouldn't assert that we received 5. If we say that very infrequently we might get <5 and we're OK with it .. what's the point of asserting that at all?
I renamed the sampleThreshold to sampleSize. It currently picks a samplingRatio that will reduce the number of hits to the sampleSize, if the number of hits is greater.
It looks like it hasn't changed? I mean besides the rename. So if I set sampleSize=100K, it's 100K whether there are 101K docs or 100M docs, right? Is that your intention?