Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.6, 4.0-ALPHA
    • Fix Version/s: 4.0-BETA, 5.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access.

      Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests.

      Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments

      Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU

      Patch based on 3.6 codebase attached.
      There are no 3.6 API changes currently - to play just add a field with "_blm" on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration would need adding to APIs to configure the service properly.

      Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

      1. 4069Failure.zip
        10 kB
        Mark Harwood
      2. BloomFilterPostingsBranch4x.patch
        54 kB
        Mark Harwood
      3. LUCENE-4069-tryDeleteDocument.patch
        3 kB
        Michael McCandless
      4. LUCENE-4203.patch
        3 kB
        Michael McCandless
      5. MHBloomFilterOn3.6Branch.patch
        17 kB
        Mark Harwood
      6. PKLookupUpdatePerfTest.java
        19 kB
        Michael McCandless
      7. PKLookupUpdatePerfTest.java
        19 kB
        Mark Harwood
      8. PKLookupUpdatePerfTest.java
        19 kB
        Michael McCandless
      9. PKLookupUpdatePerfTest.java
        9 kB
        Mark Harwood
      10. PrimaryKeyPerfTest40.java
        10 kB
        Mark Harwood

        Issue Links

          Activity

          Hide
          Mark Harwood added a comment -

          Removing misleading 2X perf gain: it seems to depend heavily on the exact use case.

          Fair enough - the original patch targeted Lucene 3.6 which benefited more heavily from this technique. The issue then morphed into a 4.x patch where performance gains were harder to find.
          I think the sweet spot is in primary key searches on indexes with ongoing heavy changes (more segment fragmentation, less OS-level caching?). This is the use case I am targeting currently and my final tests using our primary-key-counting test rig saw a 10 to 15% improvement over Pulsing.

          I'm asking because I need his feature but I'm stuck with 3.x for a while.

          I have a client in a similar situation who are contemplating using the 3.6 patch.

          Is there bugs which should be fixed in initial 3.6 patch?

          It has been a while since I looked at it - a quick run of "ant test" on my copy here showed no errors. I will be giving it a closer review if my client decides to go down this route and can post any fixes here.
          I expect if you use the patch and get into trouble you can use an un-patched version of 3.6 to read the same index files (it should just ignore the extra "blm" files created by the patched version).

          Show
          Mark Harwood added a comment - Removing misleading 2X perf gain: it seems to depend heavily on the exact use case. Fair enough - the original patch targeted Lucene 3.6 which benefited more heavily from this technique. The issue then morphed into a 4.x patch where performance gains were harder to find. I think the sweet spot is in primary key searches on indexes with ongoing heavy changes (more segment fragmentation, less OS-level caching?). This is the use case I am targeting currently and my final tests using our primary-key-counting test rig saw a 10 to 15% improvement over Pulsing. I'm asking because I need his feature but I'm stuck with 3.x for a while. I have a client in a similar situation who are contemplating using the 3.6 patch. Is there bugs which should be fixed in initial 3.6 patch? It has been a while since I looked at it - a quick run of "ant test" on my copy here showed no errors. I will be giving it a closer review if my client decides to go down this route and can post any fixes here. I expect if you use the patch and get into trouble you can use an un-patched version of 3.6 to read the same index files (it should just ignore the extra "blm" files created by the patched version).
          Hide
          Mikhail Khludnev added a comment -

          Mark,

          I see that several issues were fixed to land it to trunk/4.0. Is there bugs which should be fixed in initial 3.6 patch? I'm asking because I need his feature but I'm stuck with 3.x for a while. Do you recommend MHBloomFilterOn3.6Branch.patch for production use?

          PS thanks for your contribution. it's awesome!

          Show
          Mikhail Khludnev added a comment - Mark, I see that several issues were fixed to land it to trunk/4.0. Is there bugs which should be fixed in initial 3.6 patch? I'm asking because I need his feature but I'm stuck with 3.x for a while. Do you recommend MHBloomFilterOn3.6Branch.patch for production use? PS thanks for your contribution. it's awesome!
          Hide
          Michael McCandless added a comment -

          Removing misleading 2X perf gain: it seems to depend heavily on the exact use case.

          Show
          Michael McCandless added a comment - Removing misleading 2X perf gain: it seems to depend heavily on the exact use case.
          Hide
          Michael McCandless added a comment -

          Hmm in testing Bloom + Lucene40 PF on luceneutil's "id" field I don't see any gains with what was committed (base=Lucene40 for id field, comp=Bloom + Lucene40 for id field; all other fields are just Lucene40):

                          Task    QPS base StdDev base   QPS bloomStdDev bloom      Pct diff
                      PKLookup      196.71        9.29      192.94        5.40   -8% -    5%
              HighSloppyPhrase        0.75        0.04        0.75        0.04  -10% -    9%
                    HighPhrase        1.78        0.02        1.76        0.01   -2% -    0%
                     MedPhrase        5.00        0.09        4.95        0.15   -5% -    3%
                    AndHighMed       74.66        3.52       74.02        4.50  -11% -   10%
                   AndHighHigh       16.94        0.82       16.83        0.78   -9% -    9%
               MedSloppyPhrase        4.10        0.11        4.08        0.12   -5% -    4%
                     LowPhrase       13.35        0.19       13.31        0.24   -3% -    2%
               LowSloppyPhrase       31.91        0.52       31.90        0.58   -3% -    3%
                    AndHighLow     1114.88       43.26     1121.52       33.72   -6% -    7%
                   LowSpanNear        9.67        0.22        9.76        0.26   -4% -    5%
                       LowTerm      421.55       14.96      426.55        5.23   -3% -    6%
                       MedTerm      176.45        7.66      179.40        2.84   -4% -    7%
                  HighSpanNear        1.43        0.03        1.45        0.05   -3% -    7%
                   MedSpanNear        2.51        0.05        2.56        0.09   -3% -    7%
                      HighTerm       15.52        0.95       15.90        0.24   -4% -   10%
                       Respell       69.77        3.20       71.65        3.61   -6% -   13%
                        Fuzzy1       88.19        2.49       91.04        3.06   -2% -    9%
                        Fuzzy2       36.18        1.26       37.50        1.44   -3% -   11%
                      Wildcard       45.33        1.52       47.04        1.10   -1% -    9%
                       Prefix3       65.65        2.53       68.26        1.99   -2% -   11%
                     OrHighLow       37.45        0.88       39.87        2.48   -2% -   15%
                    OrHighHigh        9.58        0.10       10.23        0.64    0% -   14%
                     OrHighMed       14.89        0.14       15.96        1.01    0% -   15%
                        IntNRQ       10.24        1.04       11.07        1.33  -13% -   34%
          

          I just used the DefaultBloomFilterFactory, which should be good here since it targets 10% fill based on doc count.

          Maybe even slight loss (but could easily be noise). I confirmed that I'm getting .blm files and that at search time the filter does in fact "work" (quickly returns false from seekExact).

          The differences vs. the dedicated benchmark on this issue is 1) luceneutil is only testing lookups, 2) the "id" field is just one field (ie we also have eg body/title/date for searching), 3) luceneutil is testing a bunch of other queries in other threads in addition to PKLookup.

          Out of curiosity I also tried Memory vs Bloom + Memory, and Bloom was ~15% slower, I guess because computing the hash & doing the lookup is non-trivial cost vs just walking the FST to find out the term doesn't exist.

          Show
          Michael McCandless added a comment - Hmm in testing Bloom + Lucene40 PF on luceneutil's "id" field I don't see any gains with what was committed (base=Lucene40 for id field, comp=Bloom + Lucene40 for id field; all other fields are just Lucene40): Task QPS base StdDev base QPS bloomStdDev bloom Pct diff PKLookup 196.71 9.29 192.94 5.40 -8% - 5% HighSloppyPhrase 0.75 0.04 0.75 0.04 -10% - 9% HighPhrase 1.78 0.02 1.76 0.01 -2% - 0% MedPhrase 5.00 0.09 4.95 0.15 -5% - 3% AndHighMed 74.66 3.52 74.02 4.50 -11% - 10% AndHighHigh 16.94 0.82 16.83 0.78 -9% - 9% MedSloppyPhrase 4.10 0.11 4.08 0.12 -5% - 4% LowPhrase 13.35 0.19 13.31 0.24 -3% - 2% LowSloppyPhrase 31.91 0.52 31.90 0.58 -3% - 3% AndHighLow 1114.88 43.26 1121.52 33.72 -6% - 7% LowSpanNear 9.67 0.22 9.76 0.26 -4% - 5% LowTerm 421.55 14.96 426.55 5.23 -3% - 6% MedTerm 176.45 7.66 179.40 2.84 -4% - 7% HighSpanNear 1.43 0.03 1.45 0.05 -3% - 7% MedSpanNear 2.51 0.05 2.56 0.09 -3% - 7% HighTerm 15.52 0.95 15.90 0.24 -4% - 10% Respell 69.77 3.20 71.65 3.61 -6% - 13% Fuzzy1 88.19 2.49 91.04 3.06 -2% - 9% Fuzzy2 36.18 1.26 37.50 1.44 -3% - 11% Wildcard 45.33 1.52 47.04 1.10 -1% - 9% Prefix3 65.65 2.53 68.26 1.99 -2% - 11% OrHighLow 37.45 0.88 39.87 2.48 -2% - 15% OrHighHigh 9.58 0.10 10.23 0.64 0% - 14% OrHighMed 14.89 0.14 15.96 1.01 0% - 15% IntNRQ 10.24 1.04 11.07 1.33 -13% - 34% I just used the DefaultBloomFilterFactory, which should be good here since it targets 10% fill based on doc count. Maybe even slight loss (but could easily be noise). I confirmed that I'm getting .blm files and that at search time the filter does in fact "work" (quickly returns false from seekExact). The differences vs. the dedicated benchmark on this issue is 1) luceneutil is only testing lookups, 2) the "id" field is just one field (ie we also have eg body/title/date for searching), 3) luceneutil is testing a bunch of other queries in other threads in addition to PKLookup. Out of curiosity I also tried Memory vs Bloom + Memory, and Bloom was ~15% slower, I guess because computing the hash & doing the lookup is non-trivial cost vs just walking the FST to find out the term doesn't exist.
          Hide
          Mark Harwood added a comment -

          Applied to trunk in revision 1368567

          Show
          Mark Harwood added a comment - Applied to trunk in revision 1368567
          Hide
          Mark Harwood added a comment -

          Will do.

          Show
          Mark Harwood added a comment - Will do.
          Hide
          Adrien Grand added a comment -

          Mark, is there a reason why this patch hasn't been committed to trunk too?

          Show
          Adrien Grand added a comment - Mark, is there a reason why this patch hasn't been committed to trunk too?
          Hide
          Robert Muir added a comment -

          Hi Mark: I noticed this was committed only to the 4.x branch.

          can you also merge the change to trunk?

          Show
          Robert Muir added a comment - Hi Mark: I noticed this was committed only to the 4.x branch. can you also merge the change to trunk?
          Hide
          Mark Harwood added a comment -

          Committed to 4.0 branch, revision 1368442

          Show
          Mark Harwood added a comment - Committed to 4.0 branch, revision 1368442
          Hide
          Mark Harwood added a comment -

          Updated patch to bring in line with latest core API changes.
          All tests now pass clean so will commit soon

          Show
          Mark Harwood added a comment - Updated patch to bring in line with latest core API changes. All tests now pass clean so will commit soon
          Hide
          Mark Harwood added a comment -

          Updated with fix to issue explored in Lucene-4275

          Show
          Mark Harwood added a comment - Updated with fix to issue explored in Lucene-4275
          Hide
          Mark Harwood added a comment -

          Attached a log of thread activity showing how TestIndexWriterCommit.testCommitThreadSafety() is failing.
          At this stage I can't tell if this is a failing in MockDirectoryWrapper or the test or the BloomPF class but it is related to files being removed unexpectedly.

          Show
          Mark Harwood added a comment - Attached a log of thread activity showing how TestIndexWriterCommit.testCommitThreadSafety() is failing. At this stage I can't tell if this is a failing in MockDirectoryWrapper or the test or the BloomPF class but it is related to files being removed unexpectedly.
          Hide
          Mark Harwood added a comment -

          I wonder if it has to do w/ only opening the file in the close() method (

          Just tried opening the file earlier (in BloomFilteredConsumer constructor) and that didn't fix it.
          I previously also added an extra Directory.fileExists() sanity check immediately after closing the IndexOutput and all was well so I think it's something happening after that. Will need to dig deeper.
          I'm running on WinXP 64bit if that is of any significance to MDW's behaviour.

          Show
          Mark Harwood added a comment - I wonder if it has to do w/ only opening the file in the close() method ( Just tried opening the file earlier (in BloomFilteredConsumer constructor) and that didn't fix it. I previously also added an extra Directory.fileExists() sanity check immediately after closing the IndexOutput and all was well so I think it's something happening after that. Will need to dig deeper. I'm running on WinXP 64bit if that is of any significance to MDW's behaviour.
          Hide
          Michael McCandless added a comment -

          Presumably I am doing the "right thing" in always throwing an exception if the .blm file is missing?

          Definitely! This should be considered a corrupt index.

          IF MDW is intended to only delete uncommitted files I'm not sure how we end up in a scenario where BloomPF is being asked to open the uncommitted segment?

          Very strange... IW will sync all files written by the codec, and only once all files for a given segment are sync'd will it write a segments_N referencing that segment. Not sure offhand why this file isn't being sync'd... I wonder if it has to do w/ only opening the file in the close() method (most other PFs open files well before then). It should be fine to do that ... but it's something different.

          Show
          Michael McCandless added a comment - Presumably I am doing the "right thing" in always throwing an exception if the .blm file is missing? Definitely! This should be considered a corrupt index. IF MDW is intended to only delete uncommitted files I'm not sure how we end up in a scenario where BloomPF is being asked to open the uncommitted segment? Very strange... IW will sync all files written by the codec, and only once all files for a given segment are sync'd will it write a segments_N referencing that segment. Not sure offhand why this file isn't being sync'd... I wonder if it has to do w/ only opening the file in the close() method (most other PFs open files well before then). It should be fine to do that ... but it's something different.
          Hide
          Mark Harwood added a comment -

          One more remaining issue before I commit which has appeared sporadically and looks to be consistently raised by this test:
          ant test -Dtestcase=TestIndexWriterCommit -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE -Dtests.slow=true -Dtests.postingsformat=TestBloomFilteredLucene40Postings -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=ISO-8859-1

          The error it produces is this:
          [junit4:junit4] > Caused by: java.lang.IllegalStateException: Missing file:_9_TestBloomFilteredLucene40Postings_0.blm
          [junit4:junit4] > at org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat$BloomFilteredFieldsProducer.<init>(BloomFilteringPostingsFormat.java:175)
          [junit4:junit4] > at org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat.fieldsProducer(BloomFilteringPostingsFormat.java:156)

          MockDirectoryWrapper looks to be randomly deleting files (probably my "blm" file shown above) to simulate the effects of crashes.
          Presumably I am doing the "right thing" in always throwing an exception if the .blm file is missing? The alternative would be to silently ignore the missing file which seems undesirable.
          IF MDW is intended to only delete uncommitted files I'm not sure how we end up in a scenario where BloomPF is being asked to open the uncommitted segment?

          Show
          Mark Harwood added a comment - One more remaining issue before I commit which has appeared sporadically and looks to be consistently raised by this test: ant test -Dtestcase=TestIndexWriterCommit -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE -Dtests.slow=true -Dtests.postingsformat=TestBloomFilteredLucene40Postings -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=ISO-8859-1 The error it produces is this: [junit4:junit4] > Caused by: java.lang.IllegalStateException: Missing file:_9_TestBloomFilteredLucene40Postings_0.blm [junit4:junit4] > at org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat$BloomFilteredFieldsProducer.<init>(BloomFilteringPostingsFormat.java:175) [junit4:junit4] > at org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat.fieldsProducer(BloomFilteringPostingsFormat.java:156) MockDirectoryWrapper looks to be randomly deleting files (probably my "blm" file shown above) to simulate the effects of crashes. Presumably I am doing the "right thing" in always throwing an exception if the .blm file is missing? The alternative would be to silently ignore the missing file which seems undesirable. IF MDW is intended to only delete uncommitted files I'm not sure how we end up in a scenario where BloomPF is being asked to open the uncommitted segment?
          Hide
          Michael McCandless added a comment -

          Thanks Mark. +1 to commit.

          I still think we could/should somehow fold this into the terms dict but we can do that another day... this is a good step forward.

          because the PerFieldPF isn't smart enough to figure out that these Bloom-ing choices do not require different physical files for all the delegated tii etc structures.

          We need to fix that, so that every PF wrapper that comes along doesn't feel like it must handle per-field-ness itself.

          Show
          Michael McCandless added a comment - Thanks Mark. +1 to commit. I still think we could/should somehow fold this into the terms dict but we can do that another day... this is a good step forward. because the PerFieldPF isn't smart enough to figure out that these Bloom-ing choices do not require different physical files for all the delegated tii etc structures. We need to fix that, so that every PF wrapper that comes along doesn't feel like it must handle per-field-ness itself.
          Hide
          Mark Harwood added a comment -

          A quick benchmark looks like the new right-sized bitset as opposed to the old worst-case-scenario-sized bitset is buying us a small performance improvement.

          I also don't think this PF should be per-field

          There was a lengthy discussion earlier on this topic. The approach presented here seems reasonable.
          For the average user you have the DefaultBloomFilterFactory default which now has reasonable sizing for all fields passed its way (assuming a heuristic based on numDocs=numKeys to anticipate). For expert users you can provide a BloomFilterFactory with a custom choice of sizing heuristic per-field and can also simply return "null" for non-bloomed fields.

          Having a single, carefully configured BloomPF wrapper is preferable because you can channel appropriately configured bloom settings to a common PF delegate and avoid creating multiple .tii, .tis files etc because the PerFieldPF isn't smart enough to figure out that these Bloom-ing choices do not require different physical files for all the delegated tii etc structures.

          You don't have to use the Per-field stuff in BloomPF but there are benefits to be had in doing so which can't otherwise be achieved.

          Can you add @lucene.experimental to all the new APIs?

          Done.

          Show
          Mark Harwood added a comment - A quick benchmark looks like the new right-sized bitset as opposed to the old worst-case-scenario-sized bitset is buying us a small performance improvement. I also don't think this PF should be per-field There was a lengthy discussion earlier on this topic. The approach presented here seems reasonable. For the average user you have the DefaultBloomFilterFactory default which now has reasonable sizing for all fields passed its way (assuming a heuristic based on numDocs=numKeys to anticipate). For expert users you can provide a BloomFilterFactory with a custom choice of sizing heuristic per-field and can also simply return "null" for non-bloomed fields. Having a single, carefully configured BloomPF wrapper is preferable because you can channel appropriately configured bloom settings to a common PF delegate and avoid creating multiple .tii, .tis files etc because the PerFieldPF isn't smart enough to figure out that these Bloom-ing choices do not require different physical files for all the delegated tii etc structures. You don't have to use the Per-field stuff in BloomPF but there are benefits to be had in doing so which can't otherwise be achieved. Can you add @lucene.experimental to all the new APIs? Done.
          Hide
          Mark Harwood added a comment -

          New patch with use of SegmentWriteState to right-size the choice if bitset for volume of content.

          Show
          Mark Harwood added a comment - New patch with use of SegmentWriteState to right-size the choice if bitset for volume of content.
          Hide
          Michael McCandless added a comment -

          At a minimum I think before committing we should make the SegmentWriteState accessible.

          OK. Will that be the subject of a new Jira?

          No, I mean we shouldn't commit this patch until SegmentWriteState is
          accessible when creating the FuzzySet. I think we can just pass it to
          BloomFilterFactory.getSetForField? This way if the app knows it's a
          PK field then it can use maxDoc to always size an appropriate
          bit set up front.

          I think we are in agreement on the broad principles. The fundamental question here though is do you want to treat an index's choice of Hash algo as something that would require a new SPI-registered PostingsFormat to decode or can that be handled as I have done here with a general purpose SPI framework for hashing algos?

          +1, that's exactly the question.

          Ie, where to draw the line between "config of an existing PF" and
          "different PF".

          But I guess swapping in different hash impl should be seen as simple
          config change, so I think using SPI to find it at read time is OK.

          I still don't like how trappy this approach is: the default hardwired
          (8 MB) can be way too big (silently slows down your NRT reopens,
          especially if you bloom all fields) or way too small (silently turns
          off bloom filter for fields that have too many unique terms).

          I also don't think this PF should be per-field: we have
          PerFieldPostingsFormat for that, and if there are limitations in PFPF,
          we should address them rather than having to make all future PFs
          handle per-field-ness themselves. This PF should really handle one
          field.

          But I don't think these issues need to hold up commit (except for
          making SegmentWriteState accessible)... we can improve over time. I
          think we may simply want to fold this into the terms dict somehow.

          Can you add @lucene.experimental to all the new APIs?

          Show
          Michael McCandless added a comment - At a minimum I think before committing we should make the SegmentWriteState accessible. OK. Will that be the subject of a new Jira? No, I mean we shouldn't commit this patch until SegmentWriteState is accessible when creating the FuzzySet. I think we can just pass it to BloomFilterFactory.getSetForField? This way if the app knows it's a PK field then it can use maxDoc to always size an appropriate bit set up front. I think we are in agreement on the broad principles. The fundamental question here though is do you want to treat an index's choice of Hash algo as something that would require a new SPI-registered PostingsFormat to decode or can that be handled as I have done here with a general purpose SPI framework for hashing algos? +1, that's exactly the question. Ie, where to draw the line between "config of an existing PF" and "different PF". But I guess swapping in different hash impl should be seen as simple config change, so I think using SPI to find it at read time is OK. I still don't like how trappy this approach is: the default hardwired (8 MB) can be way too big (silently slows down your NRT reopens, especially if you bloom all fields) or way too small (silently turns off bloom filter for fields that have too many unique terms). I also don't think this PF should be per-field: we have PerFieldPostingsFormat for that, and if there are limitations in PFPF, we should address them rather than having to make all future PFs handle per-field-ness themselves. This PF should really handle one field. But I don't think these issues need to hold up commit (except for making SegmentWriteState accessible)... we can improve over time. I think we may simply want to fold this into the terms dict somehow. Can you add @lucene.experimental to all the new APIs?
          Hide
          Mark Harwood added a comment -

          MessageDigest.getInstance(name) should be the way to go

          I'm less keen now - a quick scan of the docs around MessageDigest throws up some issues:
          1) SPI registration of MessageDigest providers looks to get into permissions hell as it is closely related to security - see http://docs.oracle.com/javase/1.4.2/docs/guide/security/CryptoSpec.html#ProviderInstalling which talks about the steps required to approve a trusted "provider".
          2) MessageDigest as an interface is designed to stream content in potentially many method calls past the hashing algo. MurmurHash2.java is not currently written to process content this way and suits our needs in hashing small blocks of content in one hit.

          For these 2 reasons it looks like MessageDigest may be a pain to adopt and the existing approach proposed in this patch may be preferable.

          Show
          Mark Harwood added a comment - MessageDigest.getInstance(name) should be the way to go I'm less keen now - a quick scan of the docs around MessageDigest throws up some issues: 1) SPI registration of MessageDigest providers looks to get into permissions hell as it is closely related to security - see http://docs.oracle.com/javase/1.4.2/docs/guide/security/CryptoSpec.html#ProviderInstalling which talks about the steps required to approve a trusted "provider". 2) MessageDigest as an interface is designed to stream content in potentially many method calls past the hashing algo. MurmurHash2.java is not currently written to process content this way and suits our needs in hashing small blocks of content in one hit. For these 2 reasons it looks like MessageDigest may be a pain to adopt and the existing approach proposed in this patch may be preferable.
          Hide
          Uwe Schindler added a comment -

          Actually, re-thinking this, I suspect rather than creating our own, I can use Java's existing SPI framework for hashing in the form of MessageDigest. I'll take a closer look into that...

          That's a good idea. MessageDigest.getInstance(name) should be the way to go. And this name string is encoded to the index. If the MessageDigest API is enough then one can e.g. use bouncycastle by just plugging into the calsspath. If on opening index, this is not available, he will get Exception just like when codec/postingsformat is missing.

          Show
          Uwe Schindler added a comment - Actually, re-thinking this, I suspect rather than creating our own, I can use Java's existing SPI framework for hashing in the form of MessageDigest. I'll take a closer look into that... That's a good idea. MessageDigest.getInstance(name) should be the way to go. And this name string is encoded to the index. If the MessageDigest API is enough then one can e.g. use bouncycastle by just plugging into the calsspath. If on opening index, this is not available, he will get Exception just like when codec/postingsformat is missing.
          Hide
          Mark Harwood added a comment -

          If a special decoder for foobar is needed, it must be loadable by SPI.

          I think we are in agreement on the broad principles. The fundamental question here though is do you want to treat an index's choice of Hash algo as something that would require a new SPI-registered PostingsFormat to decode or can that be handled as I have done here with a general purpose SPI framework for hashing algos?

          Actually, re-thinking this, I suspect rather than creating our own, I can use Java's existing SPI framework for hashing in the form of MessageDigest. I'll take a closer look into that...

          Show
          Mark Harwood added a comment - If a special decoder for foobar is needed, it must be loadable by SPI. I think we are in agreement on the broad principles. The fundamental question here though is do you want to treat an index's choice of Hash algo as something that would require a new SPI-registered PostingsFormat to decode or can that be handled as I have done here with a general purpose SPI framework for hashing algos? Actually, re-thinking this, I suspect rather than creating our own, I can use Java's existing SPI framework for hashing in the form of MessageDigest. I'll take a closer look into that...
          Hide
          Uwe Schindler added a comment -

          It would be a pain if user config settings require a custom SPI-registered class around just to decode the index contents. There's the resource/classpath hell, the chance for misconfiguration and running Luke suddenly gets more complex.
          The line to be drawn is between what are just config settings (field names, memory limits) and what are fundamentally different file formats (e.g. codec choices).
          The design principle that looks to be adopted is that the former ought to be accommodated without the need for custom SPI-registered classes and the latter would need to locate an implementation via SPI to decode stored content. Seems reasonable.
          The choice of hash algo does not fundamentally alter the on-disk format (they all produce an int) so I would suggest we treat this as a config setting rather than a fundamentally different choice of file format.

          The design principle here is very easy: We must follow the SPI pattern, if you write an index that could otherwise not be read with default settings and produces e.g. CorruptIndexException. If you have a codec that writes some special things for specific fields, it is required to write this information about the fields to the index. If you want to open this index using IndexReader again, there must not be any requirement for configuration settings on the reader itsself - a simple DirectoryReader.open() must be possible and a query must be able to execute. The IndexReader must be able to get all this information from the index files. If a special decoder for foobar is needed, it must be loadable by SPI. This is similar to postings. A new postings format needs a new SPI, otherwise you cannot read the index.

          And it is not true that Luke is more complex to configure. Just put the JAR file into classpath that contains the SPI and you are fine. Setting up a build environment is more complicated but thats more the problem of shitty Eclipse resource handling. ANT/MAVEN/IDEA is easy.

          Show
          Uwe Schindler added a comment - It would be a pain if user config settings require a custom SPI-registered class around just to decode the index contents. There's the resource/classpath hell, the chance for misconfiguration and running Luke suddenly gets more complex. The line to be drawn is between what are just config settings (field names, memory limits) and what are fundamentally different file formats (e.g. codec choices). The design principle that looks to be adopted is that the former ought to be accommodated without the need for custom SPI-registered classes and the latter would need to locate an implementation via SPI to decode stored content. Seems reasonable. The choice of hash algo does not fundamentally alter the on-disk format (they all produce an int) so I would suggest we treat this as a config setting rather than a fundamentally different choice of file format. The design principle here is very easy: We must follow the SPI pattern, if you write an index that could otherwise not be read with default settings and produces e.g. CorruptIndexException. If you have a codec that writes some special things for specific fields, it is required to write this information about the fields to the index. If you want to open this index using IndexReader again, there must not be any requirement for configuration settings on the reader itsself - a simple DirectoryReader.open() must be possible and a query must be able to execute. The IndexReader must be able to get all this information from the index files. If a special decoder for foobar is needed, it must be loadable by SPI. This is similar to postings. A new postings format needs a new SPI, otherwise you cannot read the index. And it is not true that Luke is more complex to configure. Just put the JAR file into classpath that contains the SPI and you are fine. Setting up a build environment is more complicated but thats more the problem of shitty Eclipse resource handling. ANT/MAVEN/IDEA is easy.
          Hide
          Mark Harwood added a comment -

          At a minimum I think before committing we should make the SegmentWriteState accessible.

          OK. Will that be the subject of a new Jira?

          Hmm why is anonymity at search time important?

          It would seem to be an established design principle - see https://issues.apache.org/jira/browse/LUCENE-4069#comment-13285726

          It would be a pain if user config settings require a custom SPI-registered class around just to decode the index contents. There's the resource/classpath hell, the chance for misconfiguration and running Luke suddenly gets more complex.
          The line to be drawn is between what are just config settings (field names, memory limits) and what are fundamentally different file formats (e.g. codec choices).
          The design principle that looks to be adopted is that the former ought to be accommodated without the need for custom SPI-registered classes and the latter would need to locate an implementation via SPI to decode stored content. Seems reasonable.
          The choice of hash algo does not fundamentally alter the on-disk format (they all produce an int) so I would suggest we treat this as a config setting rather than a fundamentally different choice of file format.

          Show
          Mark Harwood added a comment - At a minimum I think before committing we should make the SegmentWriteState accessible. OK. Will that be the subject of a new Jira? Hmm why is anonymity at search time important? It would seem to be an established design principle - see https://issues.apache.org/jira/browse/LUCENE-4069#comment-13285726 It would be a pain if user config settings require a custom SPI-registered class around just to decode the index contents. There's the resource/classpath hell, the chance for misconfiguration and running Luke suddenly gets more complex. The line to be drawn is between what are just config settings (field names, memory limits) and what are fundamentally different file formats (e.g. codec choices). The design principle that looks to be adopted is that the former ought to be accommodated without the need for custom SPI-registered classes and the latter would need to locate an implementation via SPI to decode stored content. Seems reasonable. The choice of hash algo does not fundamentally alter the on-disk format (they all produce an int) so I would suggest we treat this as a config setting rather than a fundamentally different choice of file format.
          Hide
          Michael McCandless added a comment -

          It's the unique term count (for this one segment) that you need right?

          Yes, I need it before I start processing the stream of terms being flushed.

          At a minimum I think before committing we should make the
          SegmentWriteState accessible. EG then at least I can use numDocs for
          my primary key field.

          I really don't like how easy it is to silently mis-configure this PF:
          the default 8 MB is way to high for an NRT setting and way too low for
          a large index.

          Currently all PostingFormat impls that extend BloomFilterPostingsFormat can be anonymous (i.e. unregistered via SPI).

          Hmm why is anonymity at search time important? I think that's a
          non-feature, and we shouldn't make our core code more complex for it?

          Ie, it's fine to require the app to have to make a named PF (that is
          accessible via SPI), implementing all their custom bloom logic (which
          fields are bloom'd, what hash to use, etc.).

          When an app makes a custom Codec/PostingsFormat, it's expected that
          that class is accessible via SPI at both index time and search time.

          Show
          Michael McCandless added a comment - It's the unique term count (for this one segment) that you need right? Yes, I need it before I start processing the stream of terms being flushed. At a minimum I think before committing we should make the SegmentWriteState accessible. EG then at least I can use numDocs for my primary key field. I really don't like how easy it is to silently mis-configure this PF: the default 8 MB is way to high for an NRT setting and way too low for a large index. Currently all PostingFormat impls that extend BloomFilterPostingsFormat can be anonymous (i.e. unregistered via SPI). Hmm why is anonymity at search time important? I think that's a non-feature, and we shouldn't make our core code more complex for it? Ie, it's fine to require the app to have to make a named PF (that is accessible via SPI), implementing all their custom bloom logic (which fields are bloom'd, what hash to use, etc.). When an app makes a custom Codec/PostingsFormat, it's expected that that class is accessible via SPI at both index time and search time.
          Hide
          Mark Harwood added a comment -

          It's the unique term count (for this one segment) that you need right?

          Yes, I need it before I start processing the stream of terms being flushed.

          Seems like LUCENE-4198 needs to solve this same problem.

          Another possibly related point on more access to "merge context" - custom codecs have a great opportunity at merge time to piggy-back some analysis on the data being streamed e.g. to spot "trending" terms whose term frequencies differ drastically between the merging source segments. This would require access to "source segment" as term postings are streamed to observe the change in counts.

          Also, why do we need to use SPI to find the HashFunction? Seems like overkill... we don't (yet) have a bunch of hash functions that are vying here right?

          There's already a MurmurHash3 algo - we're currently using v2 and so could anticipate an upgrade at some stage. This patch provides that future proofing.

          can't the postings format impl pass in an instance of HashFunction when making the FuzzySet

          I don't think that is going to work. Currently all PostingFormat impls that extend BloomFilterPostingsFormat can be anonymous (i.e. unregistered via SPI). All their settings (fields, hash algo, thresholds) etc are recorded at write time by the base class in the segment. At read-time it is the BloomFilterPostingsFormat base class that is instantiated, not the write-time subclass and so we need to store the hash algo choice. We can't rely on the original subclass being around and configured appropriately with the original write-time choice of hashing function.

          I think the current way feels safer over all and also allows other Lucene functions to safely record hashes along with a hashname string that can be used to reconstitute results.

          Can you move the imports under the copyright header?

          Will do

          Show
          Mark Harwood added a comment - It's the unique term count (for this one segment) that you need right? Yes, I need it before I start processing the stream of terms being flushed. Seems like LUCENE-4198 needs to solve this same problem. Another possibly related point on more access to "merge context" - custom codecs have a great opportunity at merge time to piggy-back some analysis on the data being streamed e.g. to spot "trending" terms whose term frequencies differ drastically between the merging source segments. This would require access to "source segment" as term postings are streamed to observe the change in counts. Also, why do we need to use SPI to find the HashFunction? Seems like overkill... we don't (yet) have a bunch of hash functions that are vying here right? There's already a MurmurHash3 algo - we're currently using v2 and so could anticipate an upgrade at some stage. This patch provides that future proofing. can't the postings format impl pass in an instance of HashFunction when making the FuzzySet I don't think that is going to work. Currently all PostingFormat impls that extend BloomFilterPostingsFormat can be anonymous (i.e. unregistered via SPI). All their settings (fields, hash algo, thresholds) etc are recorded at write time by the base class in the segment. At read-time it is the BloomFilterPostingsFormat base class that is instantiated, not the write-time subclass and so we need to store the hash algo choice. We can't rely on the original subclass being around and configured appropriately with the original write-time choice of hashing function. I think the current way feels safer over all and also allows other Lucene functions to safely record hashes along with a hashname string that can be used to reconstitute results. Can you move the imports under the copyright header? Will do
          Hide
          Michael McCandless added a comment -

          I don't think we should use a separate factory class to provide the
          bloom filters; I think we should simplify. EG add overridable methods
          to BloomFilteringPF; this way the default impl (8 MB, isSaturated)
          would be in BloomFilteringPF.

          Also, why do we need to use SPI to find the HashFunction? Seems like
          overkill... we don't (yet) have a bunch of hash functions that are
          vying here right? We can make this pluggable at a later time... can't
          the postings format impl pass in an instance of HashFunction when
          making the FuzzySet? Default would just be MurmurHash.

          Can you move the imports under the copyright header? (eg in
          HashFunction.java but maybe others).

          Show
          Michael McCandless added a comment - I don't think we should use a separate factory class to provide the bloom filters; I think we should simplify. EG add overridable methods to BloomFilteringPF; this way the default impl (8 MB, isSaturated) would be in BloomFilteringPF. Also, why do we need to use SPI to find the HashFunction? Seems like overkill... we don't (yet) have a bunch of hash functions that are vying here right? We can make this pluggable at a later time... can't the postings format impl pass in an instance of HashFunction when making the FuzzySet? Default would just be MurmurHash. Can you move the imports under the copyright header? (eg in HashFunction.java but maybe others).
          Hide
          Michael McCandless added a comment -

          Estimating key volumes in context 1 is probably hard without some additional hints from the end user.

          Actually, the indexer knows exactly how many unique terms it's about to
          flush. It's the unique term count (for this one segment) that you
          need right? We just have to plumb it somehow (add to
          SegmentWriteState?).

          2) Merged segments e.g. guessing how many unique keys survive a merge operation

          Seems like LUCENE-4198 needs to solve this same problem.

          Not sure how we get the OneMerge instance fed through the call stack - could that be held somewhere on a ThreadLocal as generally useful context?

          Yeah I'm not sure either ... but I agree an approx heuristic based on
          merging segments unique term counts is better than nothing.

          I don't like how you silently just get bad perf now if your up-front
          guess was too small ... I also don't like that the default guessing is
          static (8 MB): flushing tiny NRT segs are gonna pay an absurd price.
          I think we need a solution for this...

          Maybe another option is to simply make a 2nd pass after the segment
          (flush or merge) is written, to build up the bit set; this way we know
          exactly how many unique terms we have.

          Show
          Michael McCandless added a comment - Estimating key volumes in context 1 is probably hard without some additional hints from the end user. Actually, the indexer knows exactly how many unique terms it's about to flush. It's the unique term count (for this one segment) that you need right? We just have to plumb it somehow (add to SegmentWriteState?). 2) Merged segments e.g. guessing how many unique keys survive a merge operation Seems like LUCENE-4198 needs to solve this same problem. Not sure how we get the OneMerge instance fed through the call stack - could that be held somewhere on a ThreadLocal as generally useful context? Yeah I'm not sure either ... but I agree an approx heuristic based on merging segments unique term counts is better than nothing. I don't like how you silently just get bad perf now if your up-front guess was too small ... I also don't like that the default guessing is static (8 MB): flushing tiny NRT segs are gonna pay an absurd price. I think we need a solution for this... Maybe another option is to simply make a 2nd pass after the segment (flush or merge) is written, to build up the bit set; this way we know exactly how many unique terms we have.
          Hide
          Mark Harwood added a comment -

          Added bloom package.html and changes.txt. I plan to commit in a day or two if there are no objections.

          Show
          Mark Harwood added a comment - Added bloom package.html and changes.txt. I plan to commit in a day or two if there are no objections.
          Hide
          Hoss Man added a comment -

          bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment

          Show
          Hoss Man added a comment - bulk cleanup of 4.0-ALPHA / 4.0 Jira versioning. all bulk edited issues have hoss20120711-bulk-40-change in a comment
          Hide
          Mark Harwood added a comment -

          So now we are close to 1M lookups/sec for a single thread!

          Cool!

          I wonder if somehow we can do a better job picking the right sized bit vector up front?

          You basically need to know up front how many unique terms will be in the given field for this segment right?

          Yes - the job of anticipating the number of unique keys probably has 2 different contexts:
          1) Net new segments e.g. guessing up front how many docs/keys a user is likely to generate in a new segment before the flush settings kick in.
          2) Merged segments e.g. guessing how many unique keys survive a merge operation

          Estimating key volumes in context 1 is probably hard without some additional hints from the end user. Arguably the BloomFilterFactory.getSetForField() method already represents where this setting can be controlled.
          In context 2 where potentially large merges occur we could look at adding an extra method to BloomFilterFactory to handle this different context e.g. something like
          FuzzySet getSetForMergeOpOnField(FieldInfo fi, OneMerge mergeContext)
          Based on the size of the segments being merged and volumes of deletes a more appropriate size of Bloom bitset could be allocated based on a worst-case estimate.
          Not sure how we get the OneMerge instance fed through the call stack - could that be held somewhere on a ThreadLocal as generally useful context?

          Show
          Mark Harwood added a comment - So now we are close to 1M lookups/sec for a single thread! Cool! I wonder if somehow we can do a better job picking the right sized bit vector up front? You basically need to know up front how many unique terms will be in the given field for this segment right? Yes - the job of anticipating the number of unique keys probably has 2 different contexts: 1) Net new segments e.g. guessing up front how many docs/keys a user is likely to generate in a new segment before the flush settings kick in. 2) Merged segments e.g. guessing how many unique keys survive a merge operation Estimating key volumes in context 1 is probably hard without some additional hints from the end user. Arguably the BloomFilterFactory.getSetForField() method already represents where this setting can be controlled. In context 2 where potentially large merges occur we could look at adding an extra method to BloomFilterFactory to handle this different context e.g. something like FuzzySet getSetForMergeOpOnField(FieldInfo fi, OneMerge mergeContext) Based on the size of the segments being merged and volumes of deletes a more appropriate size of Bloom bitset could be allocated based on a worst-case estimate. Not sure how we get the OneMerge instance fed through the call stack - could that be held somewhere on a ThreadLocal as generally useful context?
          Hide
          Michael McCandless added a comment -

          I opened LUCENE-4203 for IndexWriter.tryDeleteDocument.

          Urgh, patch is on the wrong issue ...

          Show
          Michael McCandless added a comment - I opened LUCENE-4203 for IndexWriter.tryDeleteDocument. Urgh, patch is on the wrong issue ...
          Hide
          Michael McCandless added a comment -

          Initial rough patch ... not ready to commit and needs a good random test.

          Show
          Michael McCandless added a comment - Initial rough patch ... not ready to commit and needs a good random test.
          Hide
          Michael McCandless added a comment -

          Only substantial change was I fixed the updates vs insert counting to work correctly when doDirectDelete is true.

          Show
          Michael McCandless added a comment - Only substantial change was I fixed the updates vs insert counting to work correctly when doDirectDelete is true.
          Hide
          Michael McCandless added a comment -

          I ran with with the fastest configuration (lookupByReader=false,
          shareEnums=true, doSortKeys=true, useDocValues=true,
          doDirectDelete=true, useBase36=true).

          I used MMapDir, 10M numDocs and keySpaceSize, Java 1.7.0_04, 2 GB
          starting/max heap.

          With doWorstCaseFlushing=false:

            Lucene40
              55978 ms
          
            Bloom
              49880 ms
          
            Memory
              55965 ms
          
            Pulsing
              56059 ms
          

          With doWorstCaseFlushing=true:

            Lucene40
              61097 ms
          
            Bloom
              53698 ms
          
            Memory
              61066 ms
          
            Pulsing
              61355 ms
          

          So Bloom is ~11-12% faster. Very curious that Lucene40, Pulsing and
          Memory are basically the same!

          Show
          Michael McCandless added a comment - I ran with with the fastest configuration (lookupByReader=false, shareEnums=true, doSortKeys=true, useDocValues=true, doDirectDelete=true, useBase36=true). I used MMapDir, 10M numDocs and keySpaceSize, Java 1.7.0_04, 2 GB starting/max heap. With doWorstCaseFlushing=false: Lucene40 55978 ms Bloom 49880 ms Memory 55965 ms Pulsing 56059 ms With doWorstCaseFlushing=true: Lucene40 61097 ms Bloom 53698 ms Memory 61066 ms Pulsing 61355 ms So Bloom is ~11-12% faster. Very curious that Lucene40, Pulsing and Memory are basically the same!
          Hide
          Michael McCandless added a comment -

          I'll open a separate issue for IndexWriter.tryDeleteDocument; I think it's worth exploring...

          Show
          Michael McCandless added a comment - I'll open a separate issue for IndexWriter.tryDeleteDocument; I think it's worth exploring...
          Hide
          Michael McCandless added a comment -

          I added a 100mb start size for the BloomFilter for large-scale tests because without this it gets saturated and there were occasional big spikes in batch times.

          I wonder if somehow we can do a better job picking the right sized bit vector up front? You basically need to know up front how many unique terms will be in the given field for this segment right? This seems similar to LUCENE-4198, where we want to improve Codec API so it has access to collection stats (like "number of unique terms in this field") up front.

          Show
          Michael McCandless added a comment - I added a 100mb start size for the BloomFilter for large-scale tests because without this it gets saturated and there were occasional big spikes in batch times. I wonder if somehow we can do a better job picking the right sized bit vector up front? You basically need to know up front how many unique terms will be in the given field for this segment right? This seems similar to LUCENE-4198 , where we want to improve Codec API so it has access to collection stats (like "number of unique terms in this field") up front.
          Hide
          Michael McCandless added a comment -

          Also: in digging into this I found and fixed a silly performance bug in the nightly PKLookup perf test:

          http://people.apache.org/~mikemccand/lucenebench/PKLookup.html

          So now we are close to 1M lookups/sec for a single thread!

          Show
          Michael McCandless added a comment - Also: in digging into this I found and fixed a silly performance bug in the nightly PKLookup perf test: http://people.apache.org/~mikemccand/lucenebench/PKLookup.html So now we are close to 1M lookups/sec for a single thread!
          Hide
          Michael McCandless added a comment -

          That's tightened performance but that lookss a scary amount of code for the optimal solution of this basic incrementing operation

          Well the code now looks horribly complex since we have a ton of booleans to do the same thing in different ways

          From my limited poking-around testing the fastest way seems to be lookupByReader=false, shareEnums=true, doSortKeys=true, useDocValues=true, useBase36=true, doDirectDelete=true. Once you specialize the code for those booleans then it looks much more sane.

          Show
          Michael McCandless added a comment - That's tightened performance but that lookss a scary amount of code for the optimal solution of this basic incrementing operation Well the code now looks horribly complex since we have a ton of booleans to do the same thing in different ways From my limited poking-around testing the fastest way seems to be lookupByReader=false, shareEnums=true, doSortKeys=true, useDocValues=true, useBase36=true, doDirectDelete=true. Once you specialize the code for those booleans then it looks much more sane.
          Hide
          Mark Harwood added a comment -

          Thanks for the extra tests, Mike. That's tightened performance but that lookss a scary amount of code for the optimal solution of this basic incrementing operation

          I've done some more benchmarks with the updated test and the performance characteristics are becoming clearer as shown in these results: http://goo.gl/dtWSb
          Bloom performance is better than Pulsing but the gap narrows with the volumes of deletes lying around in old segments, caused by updates. In these cases the BloomFilter gives a false positive and falls back to the equivalent operations of Pulsing. I added a 100mb start size for the BloomFilter for large-scale tests because without this it gets saturated and there were occasional big spikes in batch times.
          So overall there still looks to be a benefit and especially in low-frequency update scenarios.

          I'll wait for the dust to settle on Lucene-4190 (given this Codec introduces a new file) before thinking about committing.

          Cheers
          Mark

          Show
          Mark Harwood added a comment - Thanks for the extra tests, Mike. That's tightened performance but that lookss a scary amount of code for the optimal solution of this basic incrementing operation I've done some more benchmarks with the updated test and the performance characteristics are becoming clearer as shown in these results: http://goo.gl/dtWSb Bloom performance is better than Pulsing but the gap narrows with the volumes of deletes lying around in old segments, caused by updates. In these cases the BloomFilter gives a false positive and falls back to the equivalent operations of Pulsing. I added a 100mb start size for the BloomFilter for large-scale tests because without this it gets saturated and there were occasional big spikes in batch times. So overall there still looks to be a benefit and especially in low-frequency update scenarios. I'll wait for the dust to settle on Lucene-4190 (given this Codec introduces a new file) before thinking about committing. Cheers Mark
          Hide
          Mark Harwood added a comment -

          Updated performance test with option to alter the ratio of inserts vs updates via keyspace size.

          Show
          Mark Harwood added a comment - Updated performance test with option to alter the ratio of inserts vs updates via keyspace size.
          Hide
          Michael McCandless added a comment -

          I made various improvements to the PK lookup/update performance tester (called out as booleans at the top) so we can test the impact of:

          • Reusing the enums
          • Pre-sorting the keys before lookup
          • Best case (flush once after each update) vs worst case
          • Using DocValues direct source instead of a stored field to hold the counter
          • Using base 36 for the PK vs base 10

          Also, it's really wasteful that we go and lookup the docID by PK, to retrieve the old count, and then do an updateDocument call which forces IW to go and to the exact same (costly) lookup again.

          So I added a new method enabling deletion by document ID in IndexWriter. I named it tryDeleteDocument, and it takes SegmentInfo and int docID. If that segment has not been merged away, the delete succeeds, else it fails (and the app must delete through the "normal" way). This seems to give ~20% speedup.

          I think eg Solr's "update document field by retrieving old doc, changing field, calling IW.updateDocument" could also use this method.

          Show
          Michael McCandless added a comment - I made various improvements to the PK lookup/update performance tester (called out as booleans at the top) so we can test the impact of: Reusing the enums Pre-sorting the keys before lookup Best case (flush once after each update) vs worst case Using DocValues direct source instead of a stored field to hold the counter Using base 36 for the PK vs base 10 Also, it's really wasteful that we go and lookup the docID by PK, to retrieve the old count, and then do an updateDocument call which forces IW to go and to the exact same (costly) lookup again. So I added a new method enabling deletion by document ID in IndexWriter. I named it tryDeleteDocument, and it takes SegmentInfo and int docID. If that segment has not been merged away, the delete succeeds, else it fails (and the app must delete through the "normal" way). This seems to give ~20% speedup. I think eg Solr's "update document field by retrieving old doc, changing field, calling IW.updateDocument" could also use this method.
          Hide
          Mark Harwood added a comment -

          Added customizable saturation threshold after which Bloom filters are retired and no longer maintained (due to merges creating v large segments)

          Show
          Mark Harwood added a comment - Added customizable saturation threshold after which Bloom filters are retired and no longer maintained (due to merges creating v large segments)
          Hide
          Mark Harwood added a comment -

          Attached a performance test (adapted from Mike's PKLookupPerfTest) that demonstrates the worst-case scenario where BloomFilter offers the 2x speed up not previously revealed in Mike's other tests.

          This test case mixes reads and writes on a growing index and is representative of the real-world scenario I am seeking to optimize. See the javadoc for test details.

          Show
          Mark Harwood added a comment - Attached a performance test (adapted from Mike's PKLookupPerfTest) that demonstrates the worst-case scenario where BloomFilter offers the 2x speed up not previously revealed in Mike's other tests. This test case mixes reads and writes on a growing index and is representative of the real-world scenario I am seeking to optimize. See the javadoc for test details.
          Hide
          Mark Harwood added a comment -

          Updated Performance test code based on new IndexReader changes for accessing subreaders

          Show
          Mark Harwood added a comment - Updated Performance test code based on new IndexReader changes for accessing subreaders
          Hide
          Mark Harwood added a comment -

          Fix for the "not downsizing" bug and a subsequent issue which that fix revealed. The 2nd issue was that on saturation, the downsize method would actually upsize into a bigger bitset. This causes false negatives on searches - it's safe to downsize the indexing bitset but not upsize as there is already some information loss involved.

          Show
          Mark Harwood added a comment - Fix for the "not downsizing" bug and a subsequent issue which that fix revealed. The 2nd issue was that on saturation, the downsize method would actually upsize into a bigger bitset. This causes false negatives on searches - it's safe to downsize the indexing bitset but not upsize as there is already some information loss involved.
          Hide
          Michael McCandless added a comment -

          Results after fixing the "not downsizing" bug:

                          Task    QPS base StdDev base   QPS bloomStdDev bloom      Pct diff
                        Fuzzy1       97.02        1.43       88.67        3.08  -13% -   -4%
                        Fuzzy2       42.50        1.15       40.32        1.75  -11% -    1%
                   AndHighHigh       17.99        0.29       17.74        0.47   -5% -    2%
                       Respell       82.76        2.96       81.66        3.54   -8% -    6%
                    AndHighMed       55.77        1.02       55.06        1.51   -5% -    3%
                   TermGroup1M       42.35        0.79       41.98        0.85   -4% -    3%
                  TermBGroup1M       49.55        0.58       49.14        0.60   -3% -    1%
                      SpanNear       12.53        0.15       12.49        0.09   -2% -    1%
                TermBGroup1M1P       64.36        0.83       64.34        0.83   -2% -    2%
                       Prefix3       34.18        1.25       34.45        1.38   -6% -    8%
                      Wildcard       34.42        1.12       34.75        1.35   -6% -    8%
                  SloppyPhrase       11.48        0.17       11.61        0.22   -2% -    4%
                        Phrase        8.47        0.30        8.57        0.25   -5% -    7%
                          Term       94.06        4.50       95.45        2.34   -5% -    9%
                        IntNRQ        9.84        0.63       10.07        0.71  -10% -   16%
                     OrHighMed       34.57        2.07       35.78        2.24   -8% -   16%
                    OrHighHigh       11.66        0.64       12.11        0.73   -7% -   16%
                      PKLookup      129.75        2.37      153.80        2.31   14% -   22%
          

          Basically same as before ... which is good (it means target 10% saturation doesn't lose any perf).

          Show
          Michael McCandless added a comment - Results after fixing the "not downsizing" bug: Task QPS base StdDev base QPS bloomStdDev bloom Pct diff Fuzzy1 97.02 1.43 88.67 3.08 -13% - -4% Fuzzy2 42.50 1.15 40.32 1.75 -11% - 1% AndHighHigh 17.99 0.29 17.74 0.47 -5% - 2% Respell 82.76 2.96 81.66 3.54 -8% - 6% AndHighMed 55.77 1.02 55.06 1.51 -5% - 3% TermGroup1M 42.35 0.79 41.98 0.85 -4% - 3% TermBGroup1M 49.55 0.58 49.14 0.60 -3% - 1% SpanNear 12.53 0.15 12.49 0.09 -2% - 1% TermBGroup1M1P 64.36 0.83 64.34 0.83 -2% - 2% Prefix3 34.18 1.25 34.45 1.38 -6% - 8% Wildcard 34.42 1.12 34.75 1.35 -6% - 8% SloppyPhrase 11.48 0.17 11.61 0.22 -2% - 4% Phrase 8.47 0.30 8.57 0.25 -5% - 7% Term 94.06 4.50 95.45 2.34 -5% - 9% IntNRQ 9.84 0.63 10.07 0.71 -10% - 16% OrHighMed 34.57 2.07 35.78 2.24 -8% - 16% OrHighHigh 11.66 0.64 12.11 0.73 -7% - 16% PKLookup 129.75 2.37 153.80 2.31 14% - 22% Basically same as before ... which is good (it means target 10% saturation doesn't lose any perf).
          Hide
          Michael McCandless added a comment -

          OK I think I found the problem w/ the patch: in BloomFilteringPostingsFormat.saveAppropriatelySizedBloomFilter, it should be rightSizedSet.serialize(bloomOutput); not bloomFilter.serialize(bloomOutput); (it was saving the un-downsized version). Now I see my .blm files varying in size according to how large the segment is...

          Show
          Michael McCandless added a comment - OK I think I found the problem w/ the patch: in BloomFilteringPostingsFormat.saveAppropriatelySizedBloomFilter, it should be rightSizedSet.serialize(bloomOutput); not bloomFilter.serialize(bloomOutput); (it was saving the un-downsized version). Now I see my .blm files varying in size according to how large the segment is...
          Hide
          Michael McCandless added a comment -

          Hmm somehow I messed up when testing the bloom postings format: all of my on-disk bloom bits sets are 8 MB, regardless of the size of the segment... so something is wrong (I would expect segments w/ more terms to have a larger bit set...). This is what I did for the codec:

              final Codec codec = new Lucene40Codec() {
                @Override
                public PostingsFormat getPostingsFormatForField(String field) {
                  final String pfName = field.equals("id") ? idFieldPostingsFormat : defaultPostingsFormat;
                  if (pfName.equals("Bloom")) {
                    return new BloomFilteringPostingsFormat(PostingsFormat.forName("Lucene40"),
                                                            new BloomFilterFactory() {
                                                              @Override
                                                              public FuzzySet getSetForField(FieldInfo info) {    
                                                                return FuzzySet.createSetBasedOnMaxMemory(100*1024*1024, new MurmurHash2());
                                                              }
                                                            });
                  } else {
                    return PostingsFormat.forName(pfName);
                  }
                }
              };
          

          I thought that would use plenty of RAM (100 MB), and then downsize to the appropriately sized bit set (~10% saturation) when writing ... but somehow it didn't.

          Show
          Michael McCandless added a comment - Hmm somehow I messed up when testing the bloom postings format: all of my on-disk bloom bits sets are 8 MB, regardless of the size of the segment... so something is wrong (I would expect segments w/ more terms to have a larger bit set...). This is what I did for the codec: final Codec codec = new Lucene40Codec() { @Override public PostingsFormat getPostingsFormatForField(String field) { final String pfName = field.equals("id") ? idFieldPostingsFormat : defaultPostingsFormat; if (pfName.equals("Bloom")) { return new BloomFilteringPostingsFormat(PostingsFormat.forName("Lucene40"), new BloomFilterFactory() { @Override public FuzzySet getSetForField(FieldInfo info) { return FuzzySet.createSetBasedOnMaxMemory(100*1024*1024, new MurmurHash2()); } }); } else { return PostingsFormat.forName(pfName); } } }; I thought that would use plenty of RAM (100 MB), and then downsize to the appropriately sized bit set (~10% saturation) when writing ... but somehow it didn't.
          Hide
          Michael McCandless added a comment -

          My Windows-based woes were:
          1) Had to install python (used 2.7)

          OK.

          2) Figure out python proxy settings for Wikipedia download

          Hmmm ... you can also manually download separately.

          3) PySVN missing - downloaded install exe but it claimed Python 2.7 wasn't installed/available so gave up and did svn checkout manually

          Oh I see: setup.py uses PySVN; hmm. Just checking out manually is better...

          4) Ran first python test and it aborted with complaint about GnuPlot missing

          Hmm we should default printIndexCharts to False (I'll fix).

          I imagine most of what is needed here comes out of the box on typical OSX/Linux setup.

          Hmm I don't think pysvn does, nor gnuplot ...

          Show
          Michael McCandless added a comment - My Windows-based woes were: 1) Had to install python (used 2.7) OK. 2) Figure out python proxy settings for Wikipedia download Hmmm ... you can also manually download separately. 3) PySVN missing - downloaded install exe but it claimed Python 2.7 wasn't installed/available so gave up and did svn checkout manually Oh I see: setup.py uses PySVN; hmm. Just checking out manually is better... 4) Ran first python test and it aborted with complaint about GnuPlot missing Hmm we should default printIndexCharts to False (I'll fix). I imagine most of what is needed here comes out of the box on typical OSX/Linux setup. Hmm I don't think pysvn does, nor gnuplot ...
          Hide
          Michael McCandless added a comment -

          My Windows-based woes were:
          1) Had to install python (used 2.7)

          OK.

          2) Figure out python proxy settings for Wikipedia download

          Hmmm ... you can also manually download separately.

          3) PySVN missing - downloaded install exe but it claimed Python 2.7 wasn't installed/available so gave up and did svn checkout manually

          Oh I see: setup.py uses PySVN; hmm. Just checking out manually is better...

          4) Ran first python test and it aborted with complaint about GnuPlot missing

          Hmm we should default printIndexCharts to False (I'll fix).

          I imagine most of what is needed here comes out of the box on typical OSX/Linux setup.

          Hmm I don't think pysvn does, nor gnuplot ...

          Show
          Michael McCandless added a comment - My Windows-based woes were: 1) Had to install python (used 2.7) OK. 2) Figure out python proxy settings for Wikipedia download Hmmm ... you can also manually download separately. 3) PySVN missing - downloaded install exe but it claimed Python 2.7 wasn't installed/available so gave up and did svn checkout manually Oh I see: setup.py uses PySVN; hmm. Just checking out manually is better... 4) Ran first python test and it aborted with complaint about GnuPlot missing Hmm we should default printIndexCharts to False (I'll fix). I imagine most of what is needed here comes out of the box on typical OSX/Linux setup. Hmm I don't think pysvn does, nor gnuplot ...
          Hide
          Michael McCandless added a comment -

          Results from last patch:

                          Task    QPS base StdDev base   QPS bloomStdDev bloom      Pct diff
                        IntNRQ       11.35        1.27       10.14        0.62  -24% -    6%
                        Fuzzy1      108.52        3.34      101.82        2.90  -11% -    0%
                       Prefix3       64.87        2.17       61.55        1.61  -10% -    0%
                      Wildcard       43.18        1.74       41.33        1.17  -10% -    2%
                        Fuzzy2       41.76        1.40       40.05        1.00   -9% -    1%
                          Term      151.71        4.38      147.24        4.42   -8% -    2%
                      SpanNear        5.23        0.09        5.11        0.12   -6% -    1%
                     OrHighMed       12.60        0.88       12.34        0.48  -11% -    9%
                  SloppyPhrase        8.25        0.20        8.09        0.07   -5% -    1%
                  TermBGroup1M       69.98        0.68       68.80        1.13   -4% -    0%
                    OrHighHigh       10.06        0.66        9.93        0.39  -11% -    9%
                        Phrase       12.73        0.30       12.57        0.35   -6% -    3%
                   TermGroup1M       35.44        0.42       35.08        0.67   -4% -    2%
                    AndHighMed       63.40        2.27       62.90        1.11   -5% -    4%
                       Respell       93.11        3.70       92.81        2.33   -6% -    6%
                TermBGroup1M1P       50.93        1.53       50.96        1.75   -6% -    6%
                   AndHighHigh       15.86        0.71       15.93        0.27   -5% -    6%
                      PKLookup      127.44        2.15      134.85        8.68   -2% -   14%
          

          Looks like FuzzyN/Respell is good again ... PKLookup is a bit faster ... the rest is likely noise.

          Show
          Michael McCandless added a comment - Results from last patch: Task QPS base StdDev base QPS bloomStdDev bloom Pct diff IntNRQ 11.35 1.27 10.14 0.62 -24% - 6% Fuzzy1 108.52 3.34 101.82 2.90 -11% - 0% Prefix3 64.87 2.17 61.55 1.61 -10% - 0% Wildcard 43.18 1.74 41.33 1.17 -10% - 2% Fuzzy2 41.76 1.40 40.05 1.00 -9% - 1% Term 151.71 4.38 147.24 4.42 -8% - 2% SpanNear 5.23 0.09 5.11 0.12 -6% - 1% OrHighMed 12.60 0.88 12.34 0.48 -11% - 9% SloppyPhrase 8.25 0.20 8.09 0.07 -5% - 1% TermBGroup1M 69.98 0.68 68.80 1.13 -4% - 0% OrHighHigh 10.06 0.66 9.93 0.39 -11% - 9% Phrase 12.73 0.30 12.57 0.35 -6% - 3% TermGroup1M 35.44 0.42 35.08 0.67 -4% - 2% AndHighMed 63.40 2.27 62.90 1.11 -5% - 4% Respell 93.11 3.70 92.81 2.33 -6% - 6% TermBGroup1M1P 50.93 1.53 50.96 1.75 -6% - 6% AndHighHigh 15.86 0.71 15.93 0.27 -5% - 6% PKLookup 127.44 2.15 134.85 8.68 -2% - 14% Looks like FuzzyN/Respell is good again ... PKLookup is a bit faster ... the rest is likely noise.
          Hide
          Mark Harwood added a comment -

          problem: I'll run perf test again. It's easy...

          Great, thanks.

          Alas it's not easy ... please report back on how to make it easier to set up!

          My Windows-based woes were:
          1) Had to install python (used 2.7)
          2) Figure out python proxy settings for Wikipedia download
          3) PySVN missing - downloaded install exe but it claimed Python 2.7 wasn't installed/available so gave up and did svn checkout manually
          4) Ran first python test and it aborted with complaint about GnuPlot missing

          I imagine most of what is needed here comes out of the box on typical OSX/Linux setup.

          Show
          Mark Harwood added a comment - problem: I'll run perf test again. It's easy... Great, thanks. Alas it's not easy ... please report back on how to make it easier to set up! My Windows-based woes were: 1) Had to install python (used 2.7) 2) Figure out python proxy settings for Wikipedia download 3) PySVN missing - downloaded install exe but it claimed Python 2.7 wasn't installed/available so gave up and did svn checkout manually 4) Ran first python test and it aborted with complaint about GnuPlot missing I imagine most of what is needed here comes out of the box on typical OSX/Linux setup.
          Hide
          Michael McCandless added a comment -

          Mike, currently having various issues getting this benchmark framework up and running on my Windows platform here

          Alas it's not easy ... please report back on how to make it easier to set up!

          is it easy for you to kick off another run with the latest patch on your setup?

          No problem: I'll run perf test again. It's easy...

          Show
          Michael McCandless added a comment - Mike, currently having various issues getting this benchmark framework up and running on my Windows platform here Alas it's not easy ... please report back on how to make it easier to set up! is it easy for you to kick off another run with the latest patch on your setup? No problem: I'll run perf test again. It's easy...
          Hide
          Mark Harwood added a comment -

          Mike, currently having various issues getting this benchmark framework up and running on my Windows platform here - is it easy for you to kick off another run with the latest patch on your setup? The latest change to the patch shouldn't require an index rebuild from your last run.

          No worries if this is too much hassle for you - I'll probably just try switch to testing on OSX at home.

          Cheers,
          Mark

          Show
          Mark Harwood added a comment - Mike, currently having various issues getting this benchmark framework up and running on my Windows platform here - is it easy for you to kick off another run with the latest patch on your setup? The latest change to the patch shouldn't require an index rebuild from your last run. No worries if this is too much hassle for you - I'll probably just try switch to testing on OSX at home. Cheers, Mark
          Hide
          Michael McCandless added a comment -

          Didn't get too far with running the Wikipedia perf tests due to missing data file (see http://code.google.com/a/apache-extras.org/p/luceneutil/issues/detail?id=7 )

          Woops sorry: I posted a comment there (new, more recent wikipedia export).

          Show
          Michael McCandless added a comment - Didn't get too far with running the Wikipedia perf tests due to missing data file (see http://code.google.com/a/apache-extras.org/p/luceneutil/issues/detail?id=7 ) Woops sorry: I posted a comment there (new, more recent wikipedia export).
          Hide
          Mark Harwood added a comment -

          New patch with Terms.intersect overridden for faster Fuzzy queries.
          Didn't get too far with running the Wikipedia perf tests due to missing data file (see http://code.google.com/a/apache-extras.org/p/luceneutil/issues/detail?id=7 )

          Show
          Mark Harwood added a comment - New patch with Terms.intersect overridden for faster Fuzzy queries. Didn't get too far with running the Wikipedia perf tests due to missing data file (see http://code.google.com/a/apache-extras.org/p/luceneutil/issues/detail?id=7 )
          Hide
          Mark Harwood added a comment -

          I think the fix is simple: you are not overriding Terms.intersect now, in BloomFilteredTerms

          Good catch - a quick test indeed shows a speed up on fuzzy queries.
          I'll prepare a new patch.

          I'm not sure on why 3.6+Bloom is faster than 4+Bloom in my tests. I'll take a closer look at your benchmark.

          Show
          Mark Harwood added a comment - I think the fix is simple: you are not overriding Terms.intersect now, in BloomFilteredTerms Good catch - a quick test indeed shows a speed up on fuzzy queries. I'll prepare a new patch. I'm not sure on why 3.6+Bloom is faster than 4+Bloom in my tests. I'll take a closer look at your benchmark.
          Hide
          Michael McCandless added a comment -

          Interesting results, Mike - thanks for taking the time to run them.

          You're welcome!

          BloomFilteredFieldsProducer should just pass through intersect to the delegate?

          I have tried to make the BloomFilteredFieldsProducer get out of the way of the client app and the delegate PostingsFormat as soon as it is safe to do so i.e. when the user is safely focused on a non-filtered field. While there is a chance the client may end up making a call to TermsEnum.seekExact(..) on a filtered field then I need to have a wrapper object in place which is in a position to intercept this call. In all other method invocations I just end up delegating calls so I wonder if all these extra method calls are the cause of the slowdown you see e.g. when Fuzzy is enumerating over many terms.
          The only other alternatives to endlessly wrapping in this way are:
          a) API change - e.g. allow TermsEnum.seekExact to have a pluggable call-out for just this one method.
          b) Mess around with byte-code manipulation techniques to weave in Bloom filtering(the sort of thing I recall Hibernate resorts to)

          Neither of these seem particularly appealing options so I think we may have to live with fuzzy+bloom not being as fast as straight fuzzy.

          I think the fix is simple: you are not overriding Terms.intersect now,
          in BloomFilteredTerms. I think you should override it and immediately
          delegate and then FuzzyN/Respell performance should be just as good as
          Lucene40 codec.

          For completeness sake - I don't have access to your benchmarking code

          All the benchmarking code is here: http://code.google.com/a/apache-extras.org/p/luceneutil/

          I run it nightly (trunk) and publish the results here: http://people.apache.org/~mikemccand/lucenebench/

          but I would hope that PostingsFormat.fieldsProducer() isn't called more than once for the same segment as that's where the Bloom filters get loaded from disk so there's inherent cost there too. I can't imagine this is the case.

          It's only called once on init'ing the SegmentReader (or at least it
          better be!).

          BTW I've just finished a long-running set of tests which mixes up reads and writes here: http://goo.gl/KJmGv
          This benchmark represents how graph databases such as Neo4j use Lucene for an index when loading (I typically use the Wikipedia links as a test set). I look to get a 3.5 x speed up in Lucene 4 and Lucene 3.6 gets nearly 9 x speedup over the comparatively slower 3.6 codebase.

          Nice results! It looks like bloom(3.6) is faster than bloom(4.0)?
          Why is that...

          Also I wonder why you see such sizable (3.5X speedup) gains on PK
          lookup but in my benchmark I see only ~13% - 24%. My index has 5
          segments per level...

          Show
          Michael McCandless added a comment - Interesting results, Mike - thanks for taking the time to run them. You're welcome! BloomFilteredFieldsProducer should just pass through intersect to the delegate? I have tried to make the BloomFilteredFieldsProducer get out of the way of the client app and the delegate PostingsFormat as soon as it is safe to do so i.e. when the user is safely focused on a non-filtered field. While there is a chance the client may end up making a call to TermsEnum.seekExact(..) on a filtered field then I need to have a wrapper object in place which is in a position to intercept this call. In all other method invocations I just end up delegating calls so I wonder if all these extra method calls are the cause of the slowdown you see e.g. when Fuzzy is enumerating over many terms. The only other alternatives to endlessly wrapping in this way are: a) API change - e.g. allow TermsEnum.seekExact to have a pluggable call-out for just this one method. b) Mess around with byte-code manipulation techniques to weave in Bloom filtering(the sort of thing I recall Hibernate resorts to) Neither of these seem particularly appealing options so I think we may have to live with fuzzy+bloom not being as fast as straight fuzzy. I think the fix is simple: you are not overriding Terms.intersect now, in BloomFilteredTerms. I think you should override it and immediately delegate and then FuzzyN/Respell performance should be just as good as Lucene40 codec. For completeness sake - I don't have access to your benchmarking code All the benchmarking code is here: http://code.google.com/a/apache-extras.org/p/luceneutil/ I run it nightly (trunk) and publish the results here: http://people.apache.org/~mikemccand/lucenebench/ but I would hope that PostingsFormat.fieldsProducer() isn't called more than once for the same segment as that's where the Bloom filters get loaded from disk so there's inherent cost there too. I can't imagine this is the case. It's only called once on init'ing the SegmentReader (or at least it better be!). BTW I've just finished a long-running set of tests which mixes up reads and writes here: http://goo.gl/KJmGv This benchmark represents how graph databases such as Neo4j use Lucene for an index when loading (I typically use the Wikipedia links as a test set). I look to get a 3.5 x speed up in Lucene 4 and Lucene 3.6 gets nearly 9 x speedup over the comparatively slower 3.6 codebase. Nice results! It looks like bloom(3.6) is faster than bloom(4.0)? Why is that... Also I wonder why you see such sizable (3.5X speedup) gains on PK lookup but in my benchmark I see only ~13% - 24%. My index has 5 segments per level...
          Hide
          Mark Harwood added a comment -

          Interesting results, Mike - thanks for taking the time to run them.

          BloomFilteredFieldsProducer should just pass through intersect to the delegate?

          I have tried to make the BloomFilteredFieldsProducer get out of the way of the client app and the delegate PostingsFormat as soon as it is safe to do so i.e. when the user is safely focused on a non-filtered field. While there is a chance the client may end up making a call to TermsEnum.seekExact(..) on a filtered field then I need to have a wrapper object in place which is in a position to intercept this call. In all other method invocations I just end up delegating calls so I wonder if all these extra method calls are the cause of the slowdown you see e.g. when Fuzzy is enumerating over many terms.
          The only other alternatives to endlessly wrapping in this way are:
          a) API change - e.g. allow TermsEnum.seekExact to have a pluggable call-out for just this one method.
          b) Mess around with byte-code manipulation techniques to weave in Bloom filtering(the sort of thing I recall Hibernate resorts to)

          Neither of these seem particularly appealing options so I think we may have to live with fuzzy+bloom not being as fast as straight fuzzy.

          For completeness sake - I don't have access to your benchmarking code but I would hope that PostingsFormat.fieldsProducer() isn't called more than once for the same segment as that's where the Bloom filters get loaded from disk so there's inherent cost there too. I can't imagine this is the case.

          BTW I've just finished a long-running set of tests which mixes up reads and writes here: http://goo.gl/KJmGv
          This benchmark represents how graph databases such as Neo4j use Lucene for an index when loading (I typically use the Wikipedia links as a test set). I look to get a 3.5 x speed up in Lucene 4 and Lucene 3.6 gets nearly 9 x speedup over the comparatively slower 3.6 codebase.

          Show
          Mark Harwood added a comment - Interesting results, Mike - thanks for taking the time to run them. BloomFilteredFieldsProducer should just pass through intersect to the delegate? I have tried to make the BloomFilteredFieldsProducer get out of the way of the client app and the delegate PostingsFormat as soon as it is safe to do so i.e. when the user is safely focused on a non-filtered field. While there is a chance the client may end up making a call to TermsEnum.seekExact(..) on a filtered field then I need to have a wrapper object in place which is in a position to intercept this call. In all other method invocations I just end up delegating calls so I wonder if all these extra method calls are the cause of the slowdown you see e.g. when Fuzzy is enumerating over many terms. The only other alternatives to endlessly wrapping in this way are: a) API change - e.g. allow TermsEnum.seekExact to have a pluggable call-out for just this one method. b) Mess around with byte-code manipulation techniques to weave in Bloom filtering(the sort of thing I recall Hibernate resorts to) Neither of these seem particularly appealing options so I think we may have to live with fuzzy+bloom not being as fast as straight fuzzy. For completeness sake - I don't have access to your benchmarking code but I would hope that PostingsFormat.fieldsProducer() isn't called more than once for the same segment as that's where the Bloom filters get loaded from disk so there's inherent cost there too. I can't imagine this is the case. BTW I've just finished a long-running set of tests which mixes up reads and writes here: http://goo.gl/KJmGv This benchmark represents how graph databases such as Neo4j use Lucene for an index when loading (I typically use the Wikipedia links as a test set). I look to get a 3.5 x speed up in Lucene 4 and Lucene 3.6 gets nearly 9 x speedup over the comparatively slower 3.6 codebase.
          Hide
          Michael McCandless added a comment -

          I ran a benchmark on 10 M Wikipedia index; for the factory I used createSetBasedOnMemory and passed it 100 MB; I think that's enough to ensure we get the 10% saturation on save ...:

                          Task    QPS base StdDev base   QPS bloomStdDev bloom      Pct diff
                        Fuzzy1      102.47        3.67       41.95        0.78  -61% -  -56%
                        Fuzzy2       38.36        1.76       18.68        0.37  -54% -  -47%
                       Respell       89.89        4.38       44.09        0.52  -53% -  -47%
                      Wildcard       40.48        2.82       36.20        0.64  -17% -   -2%
                  SloppyPhrase        7.96        0.28        8.07        0.07   -3% -    5%
                       Prefix3       61.94        5.34       63.35        0.37   -6% -   12%
                  TermBGroup1M       71.37        6.79       73.73        1.55   -7% -   16%
                    AndHighMed       64.09        5.51       66.73        1.75   -6% -   16%
                TermBGroup1M1P       49.55        3.78       51.75        2.67   -7% -   18%
                   AndHighHigh       16.05        1.12       16.77        0.53   -5% -   15%
                   TermGroup1M       35.87        3.07       37.56        0.74   -5% -   16%
                    OrHighHigh        9.60        1.38       10.15        0.65  -13% -   31%
                     OrHighMed       11.93        1.91       12.63        0.93  -15% -   35%
                        IntNRQ        9.12        1.25        9.68        0.11   -7% -   24%
                          Term      154.55       19.60      165.32        0.97   -5% -   23%
                        Phrase       11.40        0.33       12.21        0.18    2% -   11%
                      SpanNear        4.31        0.07        4.73        0.03    7% -   12%
                      PKLookup      122.78        1.42      145.95        5.22   13% -   24%
          

          Baseline is Lucene40 PostingsFormat even for the id field ... so PKLookup gets a good improvement. This is on an index w/ 5 segments at each level.

          Other queries seem to speed up as well (eg Term, Or*).

          The queries that rely on Terms.intersect got much worse: is the BloomFilteredFieldsProducer should just pass through intersect to the delegate?

          Show
          Michael McCandless added a comment - I ran a benchmark on 10 M Wikipedia index; for the factory I used createSetBasedOnMemory and passed it 100 MB; I think that's enough to ensure we get the 10% saturation on save ...: Task QPS base StdDev base QPS bloomStdDev bloom Pct diff Fuzzy1 102.47 3.67 41.95 0.78 -61% - -56% Fuzzy2 38.36 1.76 18.68 0.37 -54% - -47% Respell 89.89 4.38 44.09 0.52 -53% - -47% Wildcard 40.48 2.82 36.20 0.64 -17% - -2% SloppyPhrase 7.96 0.28 8.07 0.07 -3% - 5% Prefix3 61.94 5.34 63.35 0.37 -6% - 12% TermBGroup1M 71.37 6.79 73.73 1.55 -7% - 16% AndHighMed 64.09 5.51 66.73 1.75 -6% - 16% TermBGroup1M1P 49.55 3.78 51.75 2.67 -7% - 18% AndHighHigh 16.05 1.12 16.77 0.53 -5% - 15% TermGroup1M 35.87 3.07 37.56 0.74 -5% - 16% OrHighHigh 9.60 1.38 10.15 0.65 -13% - 31% OrHighMed 11.93 1.91 12.63 0.93 -15% - 35% IntNRQ 9.12 1.25 9.68 0.11 -7% - 24% Term 154.55 19.60 165.32 0.97 -5% - 23% Phrase 11.40 0.33 12.21 0.18 2% - 11% SpanNear 4.31 0.07 4.73 0.03 7% - 12% PKLookup 122.78 1.42 145.95 5.22 13% - 24% Baseline is Lucene40 PostingsFormat even for the id field ... so PKLookup gets a good improvement. This is on an index w/ 5 segments at each level. Other queries seem to speed up as well (eg Term, Or*). The queries that rely on Terms.intersect got much worse: is the BloomFilteredFieldsProducer should just pass through intersect to the delegate?
          Hide
          Mark Harwood added a comment -

          Benchmark tool adapted from Mike's original Pulsing codec benchmark. Now includes Bloom postings example.

          Show
          Mark Harwood added a comment - Benchmark tool adapted from Mike's original Pulsing codec benchmark. Now includes Bloom postings example.
          Hide
          Mark Harwood added a comment -

          Updated as follows:

          • Extracted Bloom filter functionality as new oal.util.FuzzySet class - the name is changed because Bloom filtering is one application for a FuzzySet, fuzzy count distincts being another.
          • BloomFilterPostingsFormat now take a factory that can tailor choice of BloomFilter per field (bitset size/saturation settings and choice of hash algo). Provided a default factory implementation.
          • All Unit tests pass now that I have a test PostingsFormat class that uses v small bitsets where before the many-field unit tests would cause OOM.

          Will follow up with benchmarks when I have more time to run and document them. Initial results from my large-scale tests on growing indexes show a nice flat line in the face of a growing index whereas a non-Bloomed index saw-tooths upwards as segments grow/merge.

          Show
          Mark Harwood added a comment - Updated as follows: Extracted Bloom filter functionality as new oal.util.FuzzySet class - the name is changed because Bloom filtering is one application for a FuzzySet, fuzzy count distincts being another. BloomFilterPostingsFormat now take a factory that can tailor choice of BloomFilter per field (bitset size/saturation settings and choice of hash algo). Provided a default factory implementation. All Unit tests pass now that I have a test PostingsFormat class that uses v small bitsets where before the many-field unit tests would cause OOM. Will follow up with benchmarks when I have more time to run and document them. Initial results from my large-scale tests on growing indexes show a nice flat line in the face of a growing index whereas a non-Bloomed index saw-tooths upwards as segments grow/merge.
          Hide
          Mark Harwood added a comment -

          I've thought some more about option 2 (PerFieldPF reusing wrapped PFs) and it looks to get very ugly very quickly.
          There's only so much PerFieldPF can do to rationalize a random jumble of PF instances presented to it by clients. I think the right place to draw the line is Lucene-4093 i.e. a simple .equals() comparison on top-level PFs to eliminate any duplicates. Any other approach that also tries to de-dup nested PFs looks to be adding a lot of complexity, especially when you consider what that does to the model of read-time object instantiation. This would be significant added complexity to solve a problem you have already suggested is insignificant (i.e. too many files doesn't really matter when using CFS).

          I can remove the per-field stuff from BloomPF if you want but I imagine I will routinely subclass it to add this optimisation back in to my apps.

          Show
          Mark Harwood added a comment - I've thought some more about option 2 (PerFieldPF reusing wrapped PFs) and it looks to get very ugly very quickly. There's only so much PerFieldPF can do to rationalize a random jumble of PF instances presented to it by clients. I think the right place to draw the line is Lucene-4093 i.e. a simple .equals() comparison on top-level PFs to eliminate any duplicates. Any other approach that also tries to de-dup nested PFs looks to be adding a lot of complexity, especially when you consider what that does to the model of read-time object instantiation. This would be significant added complexity to solve a problem you have already suggested is insignificant (i.e. too many files doesn't really matter when using CFS). I can remove the per-field stuff from BloomPF if you want but I imagine I will routinely subclass it to add this optimisation back in to my apps.
          Hide
          Simon Willnauer added a comment - - edited

          This seemed relevant given the earlier comments about Solr's use of non-compound files:

          We can't make "wrong" decisions just because higher level apps make "wrong" decisions. The dependency goes Solr -> Lucene not the other way around. We provide fine grained control when to use CFS ie for smallish segments etc. If you have hundreds of fields all using different PF etc. you have to deal with tons of files but that is to be honest not very likely to be the common case.

          Create a PerFieldPF implementation that reuses wrapped PFs using some generic means of discovering recyclable wrapped PFs (i.e go further than what 4093 currently proposes in adding .equals support)

          I think we should investigate this further. Lets keep this issue simple and remove the field handling and fix this on a higher level!

          Show
          Simon Willnauer added a comment - - edited This seemed relevant given the earlier comments about Solr's use of non-compound files: We can't make "wrong" decisions just because higher level apps make "wrong" decisions. The dependency goes Solr -> Lucene not the other way around. We provide fine grained control when to use CFS ie for smallish segments etc. If you have hundreds of fields all using different PF etc. you have to deal with tons of files but that is to be honest not very likely to be the common case. Create a PerFieldPF implementation that reuses wrapped PFs using some generic means of discovering recyclable wrapped PFs (i.e go further than what 4093 currently proposes in adding .equals support) I think we should investigate this further. Lets keep this issue simple and remove the field handling and fix this on a higher level!
          Hide
          Mark Harwood added a comment -

          why is this a speed improvement?

          Sorry - misleading. Replace the word "faster" in my comment with "better" and that makes more sense - I mean better in terms of resource usage and reduced open file handles. This seemed relevant given the earlier comments about Solr's use of non-compound files:

          [Solr] create massive amounts of files if we did so (add to the fact it disables compound files by default and its a disaster...)

          I can see there is a useful simplification being sought for here if PerFieldPF can consider each of the unique top-level PFs presented to it as looking after an exclusive set of files. As the centralised allocator of file names it can then simply call each unique PF with a choice of segment suffix to name its various files without conflicting with other PFs. Lucene 4093 is all about better determining which PF is unique using .equals(). Unfortunately I don't think this approach is sufficiently complex. In order to avoid allocating all unnecessary file names PerFieldPF would have to further understand the nuances of which PFs were being wrapped by other PFs and which wrapped PFs would be reusable outside of their wrapped PF (as is the case with BloomPF's wrapped PF). That seems a more complex task than implementing equals().

          So it seems we have 3 options:
          1) Ignore the problems of creating too many files in the case of BloomPF and any other examples of "wrapping" PFs
          2) Create a PerFieldPF implementation that reuses wrapped PFs using some generic means of discovering recyclable wrapped PFs (i.e go further than what 4093 currently proposes in adding .equals support)
          3) Retain my BloomPF-specific solution to the problem for those prepared to use lower-level APIs.

          Am I missing any other options and which one do you want to go for?

          Show
          Mark Harwood added a comment - why is this a speed improvement? Sorry - misleading. Replace the word "faster" in my comment with "better" and that makes more sense - I mean better in terms of resource usage and reduced open file handles. This seemed relevant given the earlier comments about Solr's use of non-compound files: [Solr] create massive amounts of files if we did so (add to the fact it disables compound files by default and its a disaster...) I can see there is a useful simplification being sought for here if PerFieldPF can consider each of the unique top-level PFs presented to it as looking after an exclusive set of files. As the centralised allocator of file names it can then simply call each unique PF with a choice of segment suffix to name its various files without conflicting with other PFs. Lucene 4093 is all about better determining which PF is unique using .equals(). Unfortunately I don't think this approach is sufficiently complex. In order to avoid allocating all unnecessary file names PerFieldPF would have to further understand the nuances of which PFs were being wrapped by other PFs and which wrapped PFs would be reusable outside of their wrapped PF (as is the case with BloomPF's wrapped PF). That seems a more complex task than implementing equals(). So it seems we have 3 options: 1) Ignore the problems of creating too many files in the case of BloomPF and any other examples of "wrapping" PFs 2) Create a PerFieldPF implementation that reuses wrapped PFs using some generic means of discovering recyclable wrapped PFs (i.e go further than what 4093 currently proposes in adding .equals support) 3) Retain my BloomPF-specific solution to the problem for those prepared to use lower-level APIs. Am I missing any other options and which one do you want to go for?
          Hide
          Robert Muir added a comment -

          For me this falls into one of the many faster-if-you-know-about-it optimisations like FieldSelectors or recycling certain objects. Basically a useful hint to Lucene to save some extra effort but one which you dont need to use.

          I agree with Simon, its not going to be faster.

          Worse, it creates a situation from the per-field perspective where multiple postings formats are sharing the same files for a segment.

          This would make it harder to do things like refactorings of codec apis in the future.

          So where is the benefit?

          Show
          Robert Muir added a comment - For me this falls into one of the many faster-if-you-know-about-it optimisations like FieldSelectors or recycling certain objects. Basically a useful hint to Lucene to save some extra effort but one which you dont need to use. I agree with Simon, its not going to be faster. Worse, it creates a situation from the per-field perspective where multiple postings formats are sharing the same files for a segment. This would make it harder to do things like refactorings of codec apis in the future. So where is the benefit?
          Hide
          Simon Willnauer added a comment -

          I really dont think we should make this complex to save 2 or 3 files total (even in a complex config with many fields). Its not worth the complexity.

          I agree. I think those postings formats should only deal with encoding and not with handling certain fields different. A user / app should handle this in the codec. Ideally you don't have any conditions in the relevant methods like termsConsumer etc.

          For me this falls into one of the many faster-if-you-know-about-it optimisations like FieldSelectors or recycling certain objects. Basically a useful hint to Lucene to save some extra effort but one which you dont need to use.

          why is this a speed improvement? reading from one file vs. multiple is not really faster though.

          Anyway, I think we should make this patch as simple as possible and don't handle fields in the PF. We can still open another issue or wait until LUCENE-4093 is in to discuss this issue?

          Show
          Simon Willnauer added a comment - I really dont think we should make this complex to save 2 or 3 files total (even in a complex config with many fields). Its not worth the complexity. I agree. I think those postings formats should only deal with encoding and not with handling certain fields different. A user / app should handle this in the codec. Ideally you don't have any conditions in the relevant methods like termsConsumer etc. For me this falls into one of the many faster-if-you-know-about-it optimisations like FieldSelectors or recycling certain objects. Basically a useful hint to Lucene to save some extra effort but one which you dont need to use. why is this a speed improvement? reading from one file vs. multiple is not really faster though. Anyway, I think we should make this patch as simple as possible and don't handle fields in the PF. We can still open another issue or wait until LUCENE-4093 is in to discuss this issue?
          Hide
          Mark Harwood added a comment -

          Its not worth the complexity

          There's no real added complexity in BloomFilterPostingsFormat - it has to be capable of storing blooms for >1 field anyway and using the fieldname set is roughly 2 extra lines of code to see if a TermsConsumer needs wrapping or not.

          From a client side you don't have to use this feature - the fieldname set can be null in which case it will wrap all fields sent its way. If you do chose to supply a set the wrapped PostingsFormat will have the advantage of being shared for bloomed and non-bloomed fields. We could add a constructor that removes the set and mark the others "expert".

          For me this falls into one of the many faster-if-you-know-about-it optimisations like FieldSelectors or recycling certain objects. Basically a useful hint to Lucene to save some extra effort but one which you dont need to use.

          Lucene-4093 may in future resolve the multi-file issue but I'm not sure it will do so without significant complication.

          Show
          Mark Harwood added a comment - Its not worth the complexity There's no real added complexity in BloomFilterPostingsFormat - it has to be capable of storing blooms for >1 field anyway and using the fieldname set is roughly 2 extra lines of code to see if a TermsConsumer needs wrapping or not. From a client side you don't have to use this feature - the fieldname set can be null in which case it will wrap all fields sent its way. If you do chose to supply a set the wrapped PostingsFormat will have the advantage of being shared for bloomed and non-bloomed fields. We could add a constructor that removes the set and mark the others "expert". For me this falls into one of the many faster-if-you-know-about-it optimisations like FieldSelectors or recycling certain objects. Basically a useful hint to Lucene to save some extra effort but one which you dont need to use. Lucene-4093 may in future resolve the multi-file issue but I'm not sure it will do so without significant complication.
          Hide
          Robert Muir added a comment -

          Contained but not optimal - roughly double the number of required files if I want the common case of a primary key indexed with Bloom.

          Then use CFS, its optimal always (1).

          I really dont think we should make this complex to save 2 or 3 files total (even in a complex config with many fields). Its not worth the complexity.

          Show
          Robert Muir added a comment - Contained but not optimal - roughly double the number of required files if I want the common case of a primary key indexed with Bloom. Then use CFS, its optimal always (1). I really dont think we should make this complex to save 2 or 3 files total (even in a complex config with many fields). Its not worth the complexity.
          Hide
          Mark Harwood added a comment -

          Its true to say that Bloom is a different case to Pulsing - Bloom does not interfere in any with the normal recording of content in the wrapped delegate whereas Pulsing does.
          It may prove useful for us to mark a formal distinction between these mutating/non mutating types so we can treat them differently and provide optimisations?

          And separately, you can always contain the number of files even today by using only unique instances yourself when writing

          Contained but not optimal - roughly double the number of required files if I want the common case of a primary key indexed with Bloom. I can't see a way of indexing with Bloom-plus-Lucene40 on field "A" and indexing with just Lucene40 on fields B,C and D and winding up with only one Lucene40 set of files with a common segment suffix. The way I did find of achieving this was to add a "bloomFilteredFields" set into my single Bloom+Lucene40 instance used for all fields. Is there any other option here currently?

          Looking to the future, 4093 may have more capabilities at optimising if it understands the distinction between mutating wrappers and non-mutating ones and how they are composed?

          Show
          Mark Harwood added a comment - Its true to say that Bloom is a different case to Pulsing - Bloom does not interfere in any with the normal recording of content in the wrapped delegate whereas Pulsing does. It may prove useful for us to mark a formal distinction between these mutating/non mutating types so we can treat them differently and provide optimisations? And separately, you can always contain the number of files even today by using only unique instances yourself when writing Contained but not optimal - roughly double the number of required files if I want the common case of a primary key indexed with Bloom. I can't see a way of indexing with Bloom-plus-Lucene40 on field "A" and indexing with just Lucene40 on fields B,C and D and winding up with only one Lucene40 set of files with a common segment suffix. The way I did find of achieving this was to add a "bloomFilteredFields" set into my single Bloom+Lucene40 instance used for all fields. Is there any other option here currently? Looking to the future, 4093 may have more capabilities at optimising if it understands the distinction between mutating wrappers and non-mutating ones and how they are composed?
          Hide
          Robert Muir added a comment -

          And separately, you can always contain the number of files even today by:

          • using only unique instances yourself when writing (rather than waiting on LUCENE-4093)
          • using the compound file format.

          The purpose of LUCENE-4093 is just to make this simpler, but I opened it as a separate
          issue because its really solely an optimization, and only for a pretty rare case where
          people are customizing the index format for different fields.

          Show
          Robert Muir added a comment - And separately, you can always contain the number of files even today by: using only unique instances yourself when writing (rather than waiting on LUCENE-4093 ) using the compound file format. The purpose of LUCENE-4093 is just to make this simpler, but I opened it as a separate issue because its really solely an optimization, and only for a pretty rare case where people are customizing the index format for different fields.
          Hide
          Robert Muir added a comment -

          but overlapping choices of fields e.g we want a single Lucene40 to process all fields but we want Bloom to handle only a subset of these fields.

          Thats not true: I disagree. Its an implementation detail that Bloom as a postingsformat wraps another one (thats just the abstract implementation), and the file formats should not expose this in general for any format.

          This is true for a number of reasons: e.g. in the pulsing case the wrapped writer only gets a subset of the postings: therefore the wrapped writer's files are "incomplete" and an implementation detail.

          its enough here that if you have 5 fields: 2 bloom and 3 not, that we detect there are only two postings formats in use, regardless of whether you have 2 or 5 actual object instances.

          Show
          Robert Muir added a comment - but overlapping choices of fields e.g we want a single Lucene40 to process all fields but we want Bloom to handle only a subset of these fields. Thats not true: I disagree. Its an implementation detail that Bloom as a postingsformat wraps another one (thats just the abstract implementation), and the file formats should not expose this in general for any format. This is true for a number of reasons: e.g. in the pulsing case the wrapped writer only gets a subset of the postings: therefore the wrapped writer's files are "incomplete" and an implementation detail. its enough here that if you have 5 fields: 2 bloom and 3 not, that we detect there are only two postings formats in use, regardless of whether you have 2 or 5 actual object instances.
          Hide
          Mark Harwood added a comment -

          To solve what you speak of we just need to resolve LUCENE-4093.

          Presumably the main objective here is that in order to cut down on the number of files we store, content consumers of various types should aim to consolidate multiple fields' contents into a single file (if they share common config choices).

          Then multiple postings format instances that are 'the same' will be deduplicated correctly.

          The complication in this case is that we essentially have 2 consumers (Bloom and Lucene40), one wrapped in the other with different but overlapping choices of fields e.g we want a single Lucene40 to process all fields but we want Bloom to handle only a subset of these fields. This will be a tough one for PFPF to untangle while we are stuck with a delegating model for composing consumers.

          This may be made easier if instead of delegating a single stream we have a stream-splitting capability via a multicast subscription e.g. Bloom filtering consumer registers interest in content streams for fields A and B while Lucene40 is consolidating content from fields A, B, C and D. A broadcast mechanism feeds each consumer a copy of the relevant stream and each consumer is responsible for inventing their own file-naming convention that avoids muddling files.

          While that may help for writing streams it doesn't solve the re-assembly of "producer" streams at read-time where BloomFilter absolutely has to position itself in front of the standard Lucene40 producer in order to offer fast-fail lookups.

          In the absence of a fancy optimised routing mechanism (this all may be overkill) my current solution was to put BloomFilter in the delegate chain armed with a subset of fieldnames to observe as a larger array of fields flow past to a common delegate. I added some Javadocs to describe the need to do it this way for an efficient configuration.
          You are right that this is messy (ie open to bad configuration) but operating this deep down in Lucene that's always a possibility regardless of what we put in place.

          Show
          Mark Harwood added a comment - To solve what you speak of we just need to resolve LUCENE-4093 . Presumably the main objective here is that in order to cut down on the number of files we store, content consumers of various types should aim to consolidate multiple fields' contents into a single file (if they share common config choices). Then multiple postings format instances that are 'the same' will be deduplicated correctly. The complication in this case is that we essentially have 2 consumers (Bloom and Lucene40), one wrapped in the other with different but overlapping choices of fields e.g we want a single Lucene40 to process all fields but we want Bloom to handle only a subset of these fields. This will be a tough one for PFPF to untangle while we are stuck with a delegating model for composing consumers. This may be made easier if instead of delegating a single stream we have a stream-splitting capability via a multicast subscription e.g. Bloom filtering consumer registers interest in content streams for fields A and B while Lucene40 is consolidating content from fields A, B, C and D. A broadcast mechanism feeds each consumer a copy of the relevant stream and each consumer is responsible for inventing their own file-naming convention that avoids muddling files. While that may help for writing streams it doesn't solve the re-assembly of "producer" streams at read-time where BloomFilter absolutely has to position itself in front of the standard Lucene40 producer in order to offer fast-fail lookups. In the absence of a fancy optimised routing mechanism (this all may be overkill) my current solution was to put BloomFilter in the delegate chain armed with a subset of fieldnames to observe as a larger array of fields flow past to a common delegate. I added some Javadocs to describe the need to do it this way for an efficient configuration. You are right that this is messy (ie open to bad configuration) but operating this deep down in Lucene that's always a possibility regardless of what we put in place.
          Hide
          Robert Muir added a comment -

          That would be inefficient because your PFPF will see BloomFilteringPostingsFormat(field1 + Lucene40) and BloomFilteringPostingsFormat(field2 + Lucene40) as fundamentally different PostingsFormat instances and consequently create multiple files named differently because it assumes these instances may be capable of using radically different file structures.

          But adding per-field handling here is not the way to solve this: its messy.

          Per-Field handling should all be handled at a level above in PerFieldPostingsFormat.

          To solve what you speak of we just need to resolve LUCENE-4093. Then multiple postings format instances that are 'the same' will be deduplicated correctly.

          Show
          Robert Muir added a comment - That would be inefficient because your PFPF will see BloomFilteringPostingsFormat(field1 + Lucene40) and BloomFilteringPostingsFormat(field2 + Lucene40) as fundamentally different PostingsFormat instances and consequently create multiple files named differently because it assumes these instances may be capable of using radically different file structures. But adding per-field handling here is not the way to solve this: its messy. Per-Field handling should all be handled at a level above in PerFieldPostingsFormat. To solve what you speak of we just need to resolve LUCENE-4093 . Then multiple postings format instances that are 'the same' will be deduplicated correctly.
          Hide
          Mark Harwood added a comment -

          I dont understand why this handles fields. Someone should just pick that with perfieldpostingsformat.

          That would be inefficient because your PFPF will see BloomFilteringPostingsFormat(field1 + Lucene40) and BloomFilteringPostingsFormat(field2 + Lucene40) as fundamentally different PostingsFormat instances and consequently create multiple files named differently because it assumes these instances may be capable of using radically different file structures.
          In reality, the choice of BloomFilter with field 1 or BloomFilter with field 2 or indeed no BloomFilter does not fundamentally alter the underlying delegate PostingFormat's file format - it only adds a supplementary "blm" file on the side with the field summaries. For this reason it is a mistake to configure seperate BloomFilterPostingsFormat instances on a per-field basis if they can share a common delegate.

          Show
          Mark Harwood added a comment - I dont understand why this handles fields. Someone should just pick that with perfieldpostingsformat. That would be inefficient because your PFPF will see BloomFilteringPostingsFormat(field1 + Lucene40) and BloomFilteringPostingsFormat(field2 + Lucene40) as fundamentally different PostingsFormat instances and consequently create multiple files named differently because it assumes these instances may be capable of using radically different file structures. In reality, the choice of BloomFilter with field 1 or BloomFilter with field 2 or indeed no BloomFilter does not fundamentally alter the underlying delegate PostingFormat's file format - it only adds a supplementary "blm" file on the side with the field summaries. For this reason it is a mistake to configure seperate BloomFilterPostingsFormat instances on a per-field basis if they can share a common delegate.
          Hide
          Mark Harwood added a comment -

          An alternative would be to just pick this less often in RandomCodec: see the SimpleText hack

          Another option might be to make the TestBloomFilteredLucene40Postings pick a ludicrously small Bitset sizing option for each field so that we can accommodate tests that create silly numbers of fields. The bitsets being so small will just quickly reach saturation and force all reads to hit the underlying FieldsProducer.

          Show
          Mark Harwood added a comment - An alternative would be to just pick this less often in RandomCodec: see the SimpleText hack Another option might be to make the TestBloomFilteredLucene40Postings pick a ludicrously small Bitset sizing option for each field so that we can accommodate tests that create silly numbers of fields. The bitsets being so small will just quickly reach saturation and force all reads to hit the underlying FieldsProducer.
          Hide
          Robert Muir added a comment -

          I dont understand why this handles fields. Someone should just pick that with perfieldpostingsformat.

          So you have the abstract wrapper(takes the wrapped postings format, and a String name), not registered.
          And you have a concrete impl registered that is just abstractWrapper(lucene40, "Bloom40"): done.

          Show
          Robert Muir added a comment - I dont understand why this handles fields. Someone should just pick that with perfieldpostingsformat. So you have the abstract wrapper(takes the wrapped postings format, and a String name), not registered. And you have a concrete impl registered that is just abstractWrapper(lucene40, "Bloom40"): done.
          Hide
          Mark Harwood added a comment -

          Instead i think the concrete Bloom+Lucene40 that you have in tests should be moved into src/java and registered there

          What problem would that be trying to solve? Registration (or creation) of any BloomFilteringPostingsFormat subclasses is not necessary to decode index contents. Offering a "Bloom40" would only buy users a pairing of Lucene40Postings and Bloom filtering but they would still have to declare which fields they want Bloom filtering on at write time. This isn't too hard using the code in the existing patch:

          ThisWorks.java
                  final Set<String>bloomFilteredFields=new HashSet<String>();
                  bloomFilteredFields.add(PRIMARY_KEY_FIELD_NAME);
          
                  iwc.setCodec(new Lucene40Codec(){
                    BloomFilteringPostingsFormat postingOptions=new BloomFilteringPostingsFormat(new Lucene40PostingsFormat(), bloomFilteredFields);
                    @Override
                    public PostingsFormat getPostingsFormatForField(String field) {
                      return postingOptions;
                    }          
                  });
          

          No extra subclasses/registration required here to read the index built with the above setup.

          Show
          Mark Harwood added a comment - Instead i think the concrete Bloom+Lucene40 that you have in tests should be moved into src/java and registered there What problem would that be trying to solve? Registration (or creation) of any BloomFilteringPostingsFormat subclasses is not necessary to decode index contents. Offering a "Bloom40" would only buy users a pairing of Lucene40Postings and Bloom filtering but they would still have to declare which fields they want Bloom filtering on at write time. This isn't too hard using the code in the existing patch: ThisWorks.java final Set< String >bloomFilteredFields= new HashSet< String >(); bloomFilteredFields.add(PRIMARY_KEY_FIELD_NAME); iwc.setCodec( new Lucene40Codec(){ BloomFilteringPostingsFormat postingOptions= new BloomFilteringPostingsFormat( new Lucene40PostingsFormat(), bloomFilteredFields); @Override public PostingsFormat getPostingsFormatForField( String field) { return postingOptions; } }); No extra subclasses/registration required here to read the index built with the above setup.
          Hide
          Robert Muir added a comment -

          Seeing the tests in question though, i dont think you want to disable this for these entire test classes.

          We dont have a way to disable this on a per-method basis: and I think its generally not possible because
          many classes create indexes in @BeforeClass etc.

          An alternative would be to just pick this less often in RandomCodec: see the SimpleText hack

          Show
          Robert Muir added a comment - Seeing the tests in question though, i dont think you want to disable this for these entire test classes. We dont have a way to disable this on a per-method basis: and I think its generally not possible because many classes create indexes in @BeforeClass etc. An alternative would be to just pick this less often in RandomCodec: see the SimpleText hack
          Hide
          Robert Muir added a comment -

          I don't think the abstract class should be registered in the SPI.

          Instead i think the concrete Bloom+Lucene40 that you have in tests should be moved into src/java and registered there, just call it Bloom40 or something. The abstract api is still available for someone that wants to do something more specialized.

          This is just like how pulsing (another wrapper) is implemented.

          As far as disabling this for certain tests, import o.a.l.util.LuceneTestCase.SuppressCodecs and put something like this at class level:

          @SuppressCodecs("Bloom40")
          public class TestFoo...
          
          @SuppressCodecs({"Bloom40", "Memory"})
          public class TestBar...
          

          The strings in here can be codecs or postings formats

          Show
          Robert Muir added a comment - I don't think the abstract class should be registered in the SPI. Instead i think the concrete Bloom+Lucene40 that you have in tests should be moved into src/java and registered there, just call it Bloom40 or something. The abstract api is still available for someone that wants to do something more specialized. This is just like how pulsing (another wrapper) is implemented. As far as disabling this for certain tests, import o.a.l.util.LuceneTestCase.SuppressCodecs and put something like this at class level: @SuppressCodecs( "Bloom40" ) public class TestFoo... @SuppressCodecs({ "Bloom40" , "Memory" }) public class TestBar... The strings in here can be codecs or postings formats
          Hide
          Mark Harwood added a comment -

          Added missing class

          Show
          Mark Harwood added a comment - Added missing class
          Hide
          Mark Harwood added a comment -

          This is looking more promising.

          Running "ant test-core -Dtests.postingsformat=TestBloomFilteredLucene40Postings" now passes all tests but causes OOM exception on 3 tests:

          • TestConsistentFieldNumbers.testManyFields
          • TestIndexableField.testArbitraryFields
          • TestIndexWriter.testManyFields

          Any pointers on how to annotate or otherwise avoid the BloomFilter class for "many-field" tests would be welcome. These are not realistic tests for this class (we don't expect indexes with 100s of primary-key like fields).

          In this patch I've

          • added an SPI lookup mechanism for pluggable hash algos.
          • documented the file format
          • fixed issues with TermVector tests
          • changed the API

          To use:
          BloomFilteringPostingFormat now takes a delegate PostingsFormat and a set of field names that are to have bloom-filters created.
          Fields that are not listed in the filter set can be safely indexed as per normal and doing so is beneficial because it allows filtered and non filtered field data to co-exist in the same physical files created by the delegate PostingsFormat.

          Show
          Mark Harwood added a comment - This is looking more promising. Running "ant test-core -Dtests.postingsformat=TestBloomFilteredLucene40Postings" now passes all tests but causes OOM exception on 3 tests: TestConsistentFieldNumbers.testManyFields TestIndexableField.testArbitraryFields TestIndexWriter.testManyFields Any pointers on how to annotate or otherwise avoid the BloomFilter class for "many-field" tests would be welcome. These are not realistic tests for this class (we don't expect indexes with 100s of primary-key like fields). In this patch I've added an SPI lookup mechanism for pluggable hash algos. documented the file format fixed issues with TermVector tests changed the API To use: BloomFilteringPostingFormat now takes a delegate PostingsFormat and a set of field names that are to have bloom-filters created. Fields that are not listed in the filter set can be safely indexed as per normal and doing so is beneficial because it allows filtered and non filtered field data to co-exist in the same physical files created by the delegate PostingsFormat.
          Hide
          Mark Harwood added a comment -

          When I run all tests with the bloom 4.0 postings format (ant test-core -Dtests.postingsformat=BloomFilteredLucene40PostingsFormat),

          Thanks for the pointer on targeting codec testing. I have another patch to upload with various tweaks e.g. configurable choice of hash functions, RandomCodec additions so I will concentrate testing on that before uploading.

          Show
          Mark Harwood added a comment - When I run all tests with the bloom 4.0 postings format (ant test-core -Dtests.postingsformat=BloomFilteredLucene40PostingsFormat), Thanks for the pointer on targeting codec testing. I have another patch to upload with various tweaks e.g. configurable choice of hash functions, RandomCodec additions so I will concentrate testing on that before uploading.
          Hide
          Robert Muir added a comment -

          Right, its not bad. Separately I'm not happy that its a little trappy in situations like your ThisToo.java,

          its certainly easy and safe to continue to use identityhashmap but I think we will need to require equals/hashcode on postingsformats.

          Otherwise its going to make the situation trappy in the future, e.g. we will want to add mechanisms to solr so that you can specify these parameters
          in the schema.xml (today you cannot), and this would create massive amounts of files if we did so (add to the fact it disables compound files by default
          and its a disaster...)

          Show
          Robert Muir added a comment - Right, its not bad. Separately I'm not happy that its a little trappy in situations like your ThisToo.java, its certainly easy and safe to continue to use identityhashmap but I think we will need to require equals/hashcode on postingsformats. Otherwise its going to make the situation trappy in the future, e.g. we will want to add mechanisms to solr so that you can specify these parameters in the schema.xml (today you cannot), and this would create massive amounts of files if we did so (add to the fact it disables compound files by default and its a disaster...)
          Hide
          Mark Harwood added a comment -

          its just an issue with PerFieldPostingsFormat

          OK, thanks. My guess is you'll effectively be having to supplement postingsformat.getName() with object-instanceID in file names.

          Show
          Mark Harwood added a comment - its just an issue with PerFieldPostingsFormat OK, thanks. My guess is you'll effectively be having to supplement postingsformat.getName() with object-instanceID in file names.
          Hide
          Robert Muir added a comment -

          Mark: I opened LUCENE-4090.. I broke it, so I'll fix it

          Show
          Robert Muir added a comment - Mark: I opened LUCENE-4090 .. I broke it, so I'll fix it
          Hide
          Michael McCandless added a comment -

          When I run all tests with the bloom 4.0 postings format (ant test-core -Dtests.postingsformat=BloomFilteredLucene40PostingsFormat), I see a lot of failures..., eg lots of CheckIndex failures like:

             [junit4]   1> java.lang.RuntimeException: vector term=[e4 b8 80 74 77 6f] field=textField2Utf8 does not exist in postings; doc=0
             [junit4]   1> 	at org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:1440)
             [junit4]   1> 	at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:575)
             [junit4]   1> 	at org.apache.lucene.util._TestUtil.checkIndex(_TestUtil.java:186)
             [junit4]   1> 	at org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:587)
             [junit4]   1> 	at org.apache.lucene.index.TestFieldsReader.afterClass(TestFieldsReader.java:72)
             [junit4]   1> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
             [junit4]   1> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
          

          But also other spooky failures. Are these known issues?

          Show
          Michael McCandless added a comment - When I run all tests with the bloom 4.0 postings format ( ant test-core -Dtests.postingsformat=BloomFilteredLucene40PostingsFormat ), I see a lot of failures..., eg lots of CheckIndex failures like: [junit4] 1> java.lang.RuntimeException: vector term=[e4 b8 80 74 77 6f] field=textField2Utf8 does not exist in postings; doc=0 [junit4] 1> at org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:1440) [junit4] 1> at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:575) [junit4] 1> at org.apache.lucene.util._TestUtil.checkIndex(_TestUtil.java:186) [junit4] 1> at org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:587) [junit4] 1> at org.apache.lucene.index.TestFieldsReader.afterClass(TestFieldsReader.java:72) [junit4] 1> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit4] 1> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) But also other spooky failures. Are these known issues?
          Hide
          Robert Muir added a comment -

          As far as your problem with parameters, this is actually just a limitation I created (on accident) of PerFieldPostingsFormat
          on LUCENE-4055. Because we use the postingsformat name as the segment suffix, its not possible to have e.g. id field with Pulsing(1)
          and date field with Pulsing(2).

          I'll open an issue and fix this, again its just an issue with PerFieldPostingsFormat. you should be able to use the same name here
          with different configs, as long as your PostingsFormat writes what it needs to read the files. Thanks for bringing this up.

          Show
          Robert Muir added a comment - As far as your problem with parameters, this is actually just a limitation I created (on accident) of PerFieldPostingsFormat on LUCENE-4055 . Because we use the postingsformat name as the segment suffix, its not possible to have e.g. id field with Pulsing(1) and date field with Pulsing(2). I'll open an issue and fix this, again its just an issue with PerFieldPostingsFormat. you should be able to use the same name here with different configs, as long as your PostingsFormat writes what it needs to read the files. Thanks for bringing this up.
          Hide
          Mark Harwood added a comment -

          This fails if you add docs with title and text fields:

          ThisCrashes.java
                Codec fooCodec=new Lucene40Codec() {
                  @Override
                  public PostingsFormat getPostingsFormatForField(String field) {
                    if ("text".equals(field)) {
                      return new MemoryPostingsFormat(false);
                    }
                    if ("title".equals(field)) {
                      return new MemoryPostingsFormat(true);
                    }          
                    else {
                      return super.getPostingsFormatForField(field);
                    }
                  } 
                };
          

          Exception in thread "main" java.io.IOException: Cannot overwrite: C:\temp\luceneCodecs_2_Memory.ram

          This also fails:

          ThisToo.java
                 Codec fooCodec=new Lucene40Codec() {
                  SimpleTextPostingsFormat theSimple = new SimpleTextPostingsFormat();
                  @Override
                  public PostingsFormat getPostingsFormatForField(String field) {
                    if ("text".equals(field)) {
                      return new SimpleTextPostingsFormat();
                    }
                    if ("title".equals(field)) {
                      return new SimpleTextPostingsFormat();
                    }          
                    else {
                      return super.getPostingsFormatForField(field);
                    }
                  } 
                };
          

          with
          Exception in thread "main" java.io.IOException: Cannot overwrite: C:\temp\luceneCodecs_1_SimpleText.pst

          Whereas sharing the same instance of a PostingsFormat class across fields works:

          ThisWorks.java
                Codec fooCodec=new Lucene40Codec() {
                  SimpleTextPostingsFormat theSimple = new SimpleTextPostingsFormat();
                  @Override
                  public PostingsFormat getPostingsFormatForField(String field) {
                    if (("text".equals(field))|| ("title".equals(field))) {
                      return theSimple;
                    }          
                    else {
                      return super.getPostingsFormatForField(field);
                    }
                  } 
                };
          
          Show
          Mark Harwood added a comment - This fails if you add docs with title and text fields: ThisCrashes.java Codec fooCodec= new Lucene40Codec() { @Override public PostingsFormat getPostingsFormatForField( String field) { if ( "text" .equals(field)) { return new MemoryPostingsFormat( false ); } if ( "title" .equals(field)) { return new MemoryPostingsFormat( true ); } else { return super .getPostingsFormatForField(field); } } }; Exception in thread "main" java.io.IOException: Cannot overwrite: C:\temp\luceneCodecs_2_Memory.ram This also fails: ThisToo.java Codec fooCodec= new Lucene40Codec() { SimpleTextPostingsFormat theSimple = new SimpleTextPostingsFormat(); @Override public PostingsFormat getPostingsFormatForField( String field) { if ( "text" .equals(field)) { return new SimpleTextPostingsFormat(); } if ( "title" .equals(field)) { return new SimpleTextPostingsFormat(); } else { return super .getPostingsFormatForField(field); } } }; with Exception in thread "main" java.io.IOException: Cannot overwrite: C:\temp\luceneCodecs_1_SimpleText.pst Whereas sharing the same instance of a PostingsFormat class across fields works: ThisWorks.java Codec fooCodec= new Lucene40Codec() { SimpleTextPostingsFormat theSimple = new SimpleTextPostingsFormat(); @Override public PostingsFormat getPostingsFormatForField( String field) { if (( "text" .equals(field))|| ( "title" .equals(field))) { return theSimple; } else { return super .getPostingsFormatForField(field); } } };
          Hide
          Robert Muir added a comment -

          I'm not sure this is true: e.g. if your postings format requires parameters to decode the segment, then this enforces that it records said parameters,
          e.g. Pulsing records these parameters.

          Codec parameters are at index-time, at read-time its your responsibility to be able to decode them solely from the index (this enforces that there doesnt need
          to be a crazy matching of user configuration at write and read time).

          Show
          Robert Muir added a comment - I'm not sure this is true: e.g. if your postings format requires parameters to decode the segment, then this enforces that it records said parameters, e.g. Pulsing records these parameters. Codec parameters are at index-time, at read-time its your responsibility to be able to decode them solely from the index (this enforces that there doesnt need to be a crazy matching of user configuration at write and read time).
          Hide
          Mark Harwood added a comment -

          Aaaargh. Unless I've missed something, I have concerns with the fundamental design of the current Codec loading mechanism.

          It seems too tied to the concept of a ServiceProvider class-loading mechanism, forcing users to write new SPI-registered classes in order to simply declare what amount to index schema configuration choices.

          Example: If I take Rob's sample Codec above and choose to use a subtly different configuration of the same PostingsFormat class for different fields it breaks:

          ThisBreaks.java
                Codec fooCodec=new Lucene40Codec() {
                  @Override
                  public PostingsFormat getPostingsFormatForField(String field) {
                    if ("text".equals(field)) {
                      return new FooPostingsFormat(1);
                    }
                    if ("title".equals(field)) {
                      //same impl as "text" field, different constructor settings        
                      return new FooPostingsFormat(2);
                    }          
                    return super.getPostingsFormatForField(field);
                  } 
                };
          

          This causes a file overwrite error as PerFieldPostingsFormat uses the same name from FooPostingsFormat(1) and FooPostingsFormat(2) to create files.
          In order to safely make use of differently configured choices of the same PostingsFormat we are forced to declare a brand new subclass with a unique new service name and entry in the service provider registration. This is essentially where I have got to in trying to integrate this Bloom filtering logic.

          This dependency on writing custom classes seems to make everything a bit fragile, no? What hope has Luke got in opening the average index without careful assembly of classpaths etc?
          If I contrast this with the world of database schemas it seems absurd to have a reliance on writing custom classes with no behaviour simply in order to preserve a configuration of an application's schema settings. Even an IOC container with XML declarations would offer a more agile means of assembling pre-configured beans rather than relying on a Service Provider mechanism that is only serving as a registry of classes.

          Anyone else see this as a major pain?

          Show
          Mark Harwood added a comment - Aaaargh. Unless I've missed something, I have concerns with the fundamental design of the current Codec loading mechanism. It seems too tied to the concept of a ServiceProvider class-loading mechanism, forcing users to write new SPI-registered classes in order to simply declare what amount to index schema configuration choices. Example: If I take Rob's sample Codec above and choose to use a subtly different configuration of the same PostingsFormat class for different fields it breaks: ThisBreaks.java Codec fooCodec= new Lucene40Codec() { @Override public PostingsFormat getPostingsFormatForField( String field) { if ( "text" .equals(field)) { return new FooPostingsFormat(1); } if ( "title" .equals(field)) { //same impl as "text" field, different constructor settings return new FooPostingsFormat(2); } return super .getPostingsFormatForField(field); } }; This causes a file overwrite error as PerFieldPostingsFormat uses the same name from FooPostingsFormat(1) and FooPostingsFormat(2) to create files. In order to safely make use of differently configured choices of the same PostingsFormat we are forced to declare a brand new subclass with a unique new service name and entry in the service provider registration. This is essentially where I have got to in trying to integrate this Bloom filtering logic. This dependency on writing custom classes seems to make everything a bit fragile, no? What hope has Luke got in opening the average index without careful assembly of classpaths etc? If I contrast this with the world of database schemas it seems absurd to have a reliance on writing custom classes with no behaviour simply in order to preserve a configuration of an application's schema settings. Even an IOC container with XML declarations would offer a more agile means of assembling pre-configured beans rather than relying on a Service Provider mechanism that is only serving as a registry of classes . Anyone else see this as a major pain?
          Hide
          Mark Harwood added a comment -

          Thanks for the comment, Rob.
          While the choice of Codec can be an anonymous inner class, resolving the choice of PostingsFormat is trickier.
          BloomFilterPostingsFormat is now intended to wrap any another choice of PostingsFormat and Simon has suggested leaving the Bloom support purely abstract.
          However, as an end user if I want to use Bloom support on the standard Lucene codec I would then have to write one of these:

          MyBloomFilteredLucene40Postings.java
          public class MyBloomFilteredLucene40Postingsextends BloomFilteringPostingsFormatBase {
          
            public MyBloomFilteredLucene40Postings() {
              //declare my choice of PostingsFormat to be wrapped and provide a unique name for this combo of Bloom-plus-delegate
              super("myBL40", new Lucene40PostingsFormat());
            }
          }
          

          The resulting index files are then named [segname]_myBL40.[filetypesuffix].
          At read-time the "myBL40" bit of the filename is used to lookup via Service Provider registrations the decoding class so "com.xx.MyBloomFilteredLucene40Postings" would need adding to a o.a.l.codecs.PostingsFormat file for the registration to work.

          I imagine Bloom-plus-Lucene40Postings would be a common combo and if both are in core it would be annoying to have to code support for this in each app or for things like Luke to have to have classpaths redefined to access the app-specific class that was created purely to bind this combo of core components.

          I think a better option might be to change the Bloom filtering base class to record the choice of delegate PostingsFormat in it's own "blm" file at write-time and instantiate the appropriate delegate instance at read-time using the recorded name. The BloomFilteringBaseClass would need changing to a final class rather than an abstract so that core Lucene could load it as the handler for [segname]_BloomPosting.xxx files and it would then have to examine the [segname].blm file to discover and instantiate the choice of delegate PostingsFormat using the standard service registration mechanism. At write-time clients would need to instantiate the BloomFilterPostingsFormat, passing a choice of PostingsFormat delegate to the constructor. At read-time Lucene core would invoke a zero-arg constructor.
          I'll look into this as an approach.

          Show
          Mark Harwood added a comment - Thanks for the comment, Rob. While the choice of Codec can be an anonymous inner class, resolving the choice of PostingsFormat is trickier. BloomFilterPostingsFormat is now intended to wrap any another choice of PostingsFormat and Simon has suggested leaving the Bloom support purely abstract. However, as an end user if I want to use Bloom support on the standard Lucene codec I would then have to write one of these: MyBloomFilteredLucene40Postings.java public class MyBloomFilteredLucene40Postingsextends BloomFilteringPostingsFormatBase { public MyBloomFilteredLucene40Postings() { //declare my choice of PostingsFormat to be wrapped and provide a unique name for this combo of Bloom-plus-delegate super ( "myBL40" , new Lucene40PostingsFormat()); } } The resulting index files are then named [segname] _myBL40. [filetypesuffix] . At read-time the "myBL40" bit of the filename is used to lookup via Service Provider registrations the decoding class so "com.xx.MyBloomFilteredLucene40Postings" would need adding to a o.a.l.codecs.PostingsFormat file for the registration to work. I imagine Bloom-plus-Lucene40Postings would be a common combo and if both are in core it would be annoying to have to code support for this in each app or for things like Luke to have to have classpaths redefined to access the app-specific class that was created purely to bind this combo of core components. I think a better option might be to change the Bloom filtering base class to record the choice of delegate PostingsFormat in it's own "blm" file at write-time and instantiate the appropriate delegate instance at read-time using the recorded name. The BloomFilteringBaseClass would need changing to a final class rather than an abstract so that core Lucene could load it as the handler for [segname] _BloomPosting.xxx files and it would then have to examine the [segname] .blm file to discover and instantiate the choice of delegate PostingsFormat using the standard service registration mechanism. At write-time clients would need to instantiate the BloomFilterPostingsFormat, passing a choice of PostingsFormat delegate to the constructor. At read-time Lucene core would invoke a zero-arg constructor. I'll look into this as an approach.
          Hide
          Robert Muir added a comment -

          I was trying to limit the extensions apps would have to write to use this service (1 custom postings format subclass, 1 custom Codec subclass and 1 custom SPI config file)

          As far as Codec goes, we don't need to provide one for specialized per-field PostingsFormats.

          Users can just use an anonymous Lucene40Codec, which already has a hook for doing stuff like this (a protected method you can override to changeup postings format per-field).

          example that sets SimpleText on just the "id" field:

              indexWriterConfig.setCodec(new Lucene40Codec() {
                @Override
                public PostingsFormat getPostingsFormatForField(String field) {
                  if ("id".equals(field)) {
                    return new SimpleTextPostingsFormat();
                  } else {
                    return super.getPostingsFormatForField(field);
                  }
                } 
              });
          

          this is only necessary to set on the indexwriterconfig for writing: PerFieldPostingsFormat records for
          that id field that it was encoded with "SimpleText"

          when the segment is opened, assuming "SimpleText" is in the classpath and registered, it will just work.

          from the solr side, they dont need to set any codec either, as its schema already uses this hook, so there its just:

          <fieldType name="string_simpletext" class="solr.StrField" postingsFormat="SimpleText"/>
          
          Show
          Robert Muir added a comment - I was trying to limit the extensions apps would have to write to use this service (1 custom postings format subclass, 1 custom Codec subclass and 1 custom SPI config file) As far as Codec goes, we don't need to provide one for specialized per-field PostingsFormats. Users can just use an anonymous Lucene40Codec, which already has a hook for doing stuff like this (a protected method you can override to changeup postings format per-field). example that sets SimpleText on just the "id" field: indexWriterConfig.setCodec( new Lucene40Codec() { @Override public PostingsFormat getPostingsFormatForField( String field) { if ( "id" .equals(field)) { return new SimpleTextPostingsFormat(); } else { return super .getPostingsFormatForField(field); } } }); this is only necessary to set on the indexwriterconfig for writing: PerFieldPostingsFormat records for that id field that it was encoded with "SimpleText" when the segment is opened, assuming "SimpleText" is in the classpath and registered, it will just work. from the solr side, they dont need to set any codec either, as its schema already uses this hook, so there its just: <fieldType name= "string_simpletext" class= "solr.StrField" postingsFormat= "SimpleText" />
          Hide
          Mark Harwood added a comment -

          I think you should really not provide any field handling at all.

          By that I think you mean keep the abstract BloomFilteringPostingsFormatBase and dispense with the BloomFilteringLucene40Codec (and BloomFilteredLucene40PostingsFormat?). I was trying to limit the extensions apps would have to write to use this service (1 custom postings format subclass, 1 custom Codec subclass and 1 custom SPI config file) but I can see that equally we shouldn't offer implementations for all the many different service permutations.

          I'll look at adding something to RandomCodec for Bloom-plus-random delegate PostingsFormat.

          I am still worried about the TermsEnum reuse code, are you planning to look into this?

          you keep on switching back an forth creating new delegated TermsEnum instances

          I'm not sure what you mean in creating new delegated TermsEnum instances?
          In my impl of"iterator(TermsEnum reuse)" I take care to unwrap my wrapper for TermsEnum to find the original delegate's TermsEnum and then call the delegateTerms iterator method with this object as the reuse parameter. At this stage shouldn't the delegate Terms just recycle that unwrapped TermsEnum as per the normal reuse contract when no wrapping has been done?

          you should also add license headers to the files you are adding

          Will do.

          Show
          Mark Harwood added a comment - I think you should really not provide any field handling at all. By that I think you mean keep the abstract BloomFilteringPostingsFormatBase and dispense with the BloomFilteringLucene40Codec (and BloomFilteredLucene40PostingsFormat?). I was trying to limit the extensions apps would have to write to use this service (1 custom postings format subclass, 1 custom Codec subclass and 1 custom SPI config file) but I can see that equally we shouldn't offer implementations for all the many different service permutations. I'll look at adding something to RandomCodec for Bloom-plus-random delegate PostingsFormat. I am still worried about the TermsEnum reuse code, are you planning to look into this? you keep on switching back an forth creating new delegated TermsEnum instances I'm not sure what you mean in creating new delegated TermsEnum instances? In my impl of"iterator(TermsEnum reuse)" I take care to unwrap my wrapper for TermsEnum to find the original delegate's TermsEnum and then call the delegateTerms iterator method with this object as the reuse parameter. At this stage shouldn't the delegate Terms just recycle that unwrapped TermsEnum as per the normal reuse contract when no wrapping has been done? you should also add license headers to the files you are adding Will do.
          Hide
          Simon Willnauer added a comment - - edited

          hey mark,

          I think you should really not provide any field handling at all. this should be done in user code ie. a general codes/postings format per field. What would be very useful here is integration into RandomCodec so we can randomly swap it in. Same is true for the hash algos & memory settings, I think they should be part of the abstract class ctor and if a user wants to use its own algo they should write their own codec ie. subclass. this stuff is very expert and apps like solr can handle this on top.

          I am still worried about the TermsEnum reuse code, are you planning to look into this?

          you should also add license headers to the files you are adding, note that we don't do javadoc license headers anymore so they should just be block comments.

          Show
          Simon Willnauer added a comment - - edited hey mark, I think you should really not provide any field handling at all. this should be done in user code ie. a general codes/postings format per field. What would be very useful here is integration into RandomCodec so we can randomly swap it in. Same is true for the hash algos & memory settings, I think they should be part of the abstract class ctor and if a user wants to use its own algo they should write their own codec ie. subclass. this stuff is very expert and apps like solr can handle this on top. I am still worried about the TermsEnum reuse code, are you planning to look into this? you should also add license headers to the files you are adding, note that we don't do javadoc license headers anymore so they should just be block comments.
          Hide
          Mark Harwood added a comment -

          Updated to work with trunk.

          • Changed to use FixedBitSet
          • Is now a PostingsFormat abstract base class
          • Added missing MurmurHash class

          TODOs

          • Move Bloom filter logic to common utils classes
          • Use Service Providers for pluggable choice of hash algos?
          • Expose settings for memory/saturation
          Show
          Mark Harwood added a comment - Updated to work with trunk. Changed to use FixedBitSet Is now a PostingsFormat abstract base class Added missing MurmurHash class TODOs Move Bloom filter logic to common utils classes Use Service Providers for pluggable choice of hash algos? Expose settings for memory/saturation
          Hide
          Michael McCandless added a comment -

          It looks like MurmurHash.java is missing from the patch?

          Also, with LUCENE-4055 landing it looks like the patch no longer compiles (sorry!)...

          Show
          Michael McCandless added a comment - It looks like MurmurHash.java is missing from the patch? Also, with LUCENE-4055 landing it looks like the patch no longer compiles (sorry!)...
          Hide
          Simon Willnauer added a comment -

          Thanks for the tips for making the patch more generic. I'll get on it tomorrow and changed to FixedBitSet while I'm at it.

          YW! A couple of things regarding your patch. I think you should write a header for the bloomfilter file using CodecUtils.writeHeader() to be consistent. Once you are ready you should also look at the other *PostingsFormat impls how we document the file formats there. We try to have versioned file formats instead of the monolithic one on the website. I saw some minor things like

          finally {
            if (output != null)  {
              output.close();
            }
          }
          

          you can use IOUtils.close*() for that which handles some exception cases etc.

          I briefly looked at your reuse code for TermsEnum and it seems it has the same issues as pulsing has (sovled). When you get a TermsEnum that is infact a Bloomfilter wrapper but wraps another codec underneath you keep on switching back an forth creating new delegated TermsEnum instances. You can look at PulsingPostingsReader and in particular at PulsingEnumAttributeImpl how we solved that in Pulsing.

          Show
          Simon Willnauer added a comment - Thanks for the tips for making the patch more generic. I'll get on it tomorrow and changed to FixedBitSet while I'm at it. YW! A couple of things regarding your patch. I think you should write a header for the bloomfilter file using CodecUtils.writeHeader() to be consistent. Once you are ready you should also look at the other *PostingsFormat impls how we document the file formats there. We try to have versioned file formats instead of the monolithic one on the website. I saw some minor things like finally { if (output != null ) { output.close(); } } you can use IOUtils.close*() for that which handles some exception cases etc. I briefly looked at your reuse code for TermsEnum and it seems it has the same issues as pulsing has (sovled). When you get a TermsEnum that is infact a Bloomfilter wrapper but wraps another codec underneath you keep on switching back an forth creating new delegated TermsEnum instances. You can look at PulsingPostingsReader and in particular at PulsingEnumAttributeImpl how we solved that in Pulsing.
          Hide
          Mark Harwood added a comment -

          I wonder if that also helps indexing in terms of applying deletes. did you test that

          I've not looked into that particularly but I imagine this may be relevant.

          Thanks for the tips for making the patch more generic. I'll get on it tomorrow and changed to FixedBitSet while I'm at it.

          I also wonder if we can extract a "bloomfilter" class into utils

          There's some reusable stuff in this patch for downsizing the Bitset (according to desired saturation levels) having accumulated a stream of values.

          Show
          Mark Harwood added a comment - I wonder if that also helps indexing in terms of applying deletes. did you test that I've not looked into that particularly but I imagine this may be relevant. Thanks for the tips for making the patch more generic. I'll get on it tomorrow and changed to FixedBitSet while I'm at it. I also wonder if we can extract a "bloomfilter" class into utils There's some reusable stuff in this patch for downsizing the Bitset (according to desired saturation levels) having accumulated a stream of values.
          Hide
          Simon Willnauer added a comment -

          hey mark, this looks very interesting. I wonder if that also helps indexing in terms of applying deletes. did you test that at all or is that something are looking into as well? For deletes this could be very handy since bloom filters should help a lot when a given key is likely to only in one segment.

          Regarding the code, I think BloomFilter should be like Pulsing and only be a PostingsFormat. You should make the BloomFilterPostingsFormat abstract and pass in the "wrapped" postings format and delegate independent of the field. PostingsFormat is per field and is resolved by the codec. An actual implementation should subclass the abstract BloomFilterPostingsFormat that way you have a fixed PostingsFormat per "delegate" and don't need to deal with "which field got which delegate" bla bla.

          We should not have a Bloomfilter codec but rather swap in the bloomfilter postings format in test via random codec. I also wonder if we can extract a "bloomfilter" class into utils that builds an BloomFilter this could be useful elsewhere.

          Show
          Simon Willnauer added a comment - hey mark, this looks very interesting. I wonder if that also helps indexing in terms of applying deletes. did you test that at all or is that something are looking into as well? For deletes this could be very handy since bloom filters should help a lot when a given key is likely to only in one segment. Regarding the code, I think BloomFilter should be like Pulsing and only be a PostingsFormat. You should make the BloomFilterPostingsFormat abstract and pass in the "wrapped" postings format and delegate independent of the field. PostingsFormat is per field and is resolved by the codec. An actual implementation should subclass the abstract BloomFilterPostingsFormat that way you have a fixed PostingsFormat per "delegate" and don't need to deal with "which field got which delegate" bla bla. We should not have a Bloomfilter codec but rather swap in the bloomfilter postings format in test via random codec. I also wonder if we can extract a "bloomfilter" class into utils that builds an BloomFilter this could be useful elsewhere.
          Hide
          Mark Harwood added a comment -

          Fixed the issue with >1 field in an index.
          Tests on random lookups on Wikipedia titles (unique keys) now show a 3 x speed up for a Bloom-filtered index over standard 4.0 Codec for fully warmed indexes.

          Show
          Mark Harwood added a comment - Fixed the issue with >1 field in an index. Tests on random lookups on Wikipedia titles (unique keys) now show a 3 x speed up for a Bloom-filtered index over standard 4.0 Codec for fully warmed indexes.
          Hide
          Mark Harwood added a comment -

          Update- I've discovered this Bloom Filter Codec currently has a bug where it doesn't handle indexes with >1 field.
          It's probably all tangled up in the "PerField..." codec logic so I need to do some more digging.

          Show
          Mark Harwood added a comment - Update- I've discovered this Bloom Filter Codec currently has a bug where it doesn't handle indexes with >1 field. It's probably all tangled up in the "PerField..." codec logic so I need to do some more digging.
          Hide
          Mark Harwood added a comment -

          My current focus is speeding up primary key lookups but this may have applications outside of that (Zipf tells us there is a lot of low frequency stuff in free text).

          Following the principle of the best IO is no IO the Bloom Filter helps us quickly understand which segments to even bother looking in. That has to be a win overall.

          I started trying to write this Codec as a wrapper for any other Codec (it simply listens to a stream of terms and stores a bitset of recorded hashes in a ".blm" file). However that was trickier than I expected - I'd need to encode a special entry in my blm files just to know the name of the delegated codec I needed to instantiate at read-time because Lucene's normal Codec-instantiation logic would be looking for "BloomCodec" and I'd have to discover the delegate that was used to write all of the non-blm files.

          Not looked at FixedBitSet but I imagine that could be used instead.

          Show
          Mark Harwood added a comment - My current focus is speeding up primary key lookups but this may have applications outside of that (Zipf tells us there is a lot of low frequency stuff in free text). Following the principle of the best IO is no IO the Bloom Filter helps us quickly understand which segments to even bother looking in. That has to be a win overall. I started trying to write this Codec as a wrapper for any other Codec (it simply listens to a stream of terms and stores a bitset of recorded hashes in a ".blm" file). However that was trickier than I expected - I'd need to encode a special entry in my blm files just to know the name of the delegated codec I needed to instantiate at read-time because Lucene's normal Codec-instantiation logic would be looking for "BloomCodec" and I'd have to discover the delegate that was used to write all of the non-blm files. Not looked at FixedBitSet but I imagine that could be used instead.
          Hide
          Michael McCandless added a comment -

          I see a 35% improvement over standard Codecs on random lookups on a warmed index.

          Impressive! This is for primary key lookups?

          It looks like the primary keys are GUID-like right? (Ie randomly generated). I wonder if they had some structure instead (eg '%09d' % (id++)) how the results would look...

          I also notice that the PulsingCodec is no longer faster than standard Codec - is this news to people as I thought it was supposed to be the way forward?

          That's baffling to me: it should only save seeks vs Lucene40 codec, so on a cold index you should see substantial gains, and on a warm index I'd still expect some gains. Not sure what's up...

          I can open a seperate JIRA issue for this 4.0 version of the code if that makes more sense.

          I think it's fine to do it here? Really 3.6.x is only for bug fixes now ... so I think we should commit this to trunk.

          I wonder if you can wrap any other PostingsFormat (ie instead of hard-coding to Lucene40PostingsFormat)? This way users can wrap any PF they have w/ the bloom filter...

          Can you use FixedBitSet instead of OpenBitSet? Or is there a reason to use OpenBitSet here...?

          Show
          Michael McCandless added a comment - I see a 35% improvement over standard Codecs on random lookups on a warmed index. Impressive! This is for primary key lookups? It looks like the primary keys are GUID-like right? (Ie randomly generated). I wonder if they had some structure instead (eg '%09d' % (id++)) how the results would look... I also notice that the PulsingCodec is no longer faster than standard Codec - is this news to people as I thought it was supposed to be the way forward? That's baffling to me: it should only save seeks vs Lucene40 codec, so on a cold index you should see substantial gains, and on a warm index I'd still expect some gains. Not sure what's up... I can open a seperate JIRA issue for this 4.0 version of the code if that makes more sense. I think it's fine to do it here? Really 3.6.x is only for bug fixes now ... so I think we should commit this to trunk. I wonder if you can wrap any other PostingsFormat (ie instead of hard-coding to Lucene40PostingsFormat)? This way users can wrap any PF they have w/ the bloom filter... Can you use FixedBitSet instead of OpenBitSet? Or is there a reason to use OpenBitSet here...?
          Hide
          Mark Harwood added a comment -

          I've ported this Bloom Filtering code to work as a 4.0 Codec now.
          I see a 35% improvement over standard Codecs on random lookups on a warmed index.

          I also notice that the PulsingCodec is no longer faster than standard Codec - is this news to people as I thought it was supposed to be the way forward?

          My test rig (adapted from Mike's original primary key test rig here http://blog.mikemccandless.com/2010/06/lucenes-pulsingcodec-on-primary-key.html) is attached as a zip.
          The new BloomFilteringCodec is also attached here as a patch.

          Searches against plain text fields also look to be faster (using AOL500k queries searching Wikipedia English) but obviously that particular test rig is harder to include as an attachment here.

          I can open a seperate JIRA issue for this 4.0 version of the code if that makes more sense.

          Show
          Mark Harwood added a comment - I've ported this Bloom Filtering code to work as a 4.0 Codec now. I see a 35% improvement over standard Codecs on random lookups on a warmed index. I also notice that the PulsingCodec is no longer faster than standard Codec - is this news to people as I thought it was supposed to be the way forward? My test rig (adapted from Mike's original primary key test rig here http://blog.mikemccandless.com/2010/06/lucenes-pulsingcodec-on-primary-key.html ) is attached as a zip. The new BloomFilteringCodec is also attached here as a patch. Searches against plain text fields also look to be faster (using AOL500k queries searching Wikipedia English) but obviously that particular test rig is harder to include as an attachment here. I can open a seperate JIRA issue for this 4.0 version of the code if that makes more sense.
          Hide
          Mark Harwood added a comment -

          Initial patch

          Show
          Mark Harwood added a comment - Initial patch

            People

            • Assignee:
              Mark Harwood
              Reporter:
              Mark Harwood
            • Votes:
              4 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development