Lucene - Core
  1. Lucene - Core
  2. LUCENE-550

InstantiatedIndex - faster but memory consuming index

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: None
    • Component/s: core/store
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      Represented as a coupled graph of class instances, this all-in-memory index store implementation delivers search results up to a 100 times faster than the file-centric RAMDirectory at the cost of greater RAM consumption.

      Performance seems to be a little bit better than log2n (binary search). No real data on that, just my eyes.

      Populated with a single document InstantiatedIndex is almost, but not quite, as fast as MemoryIndex.

      At 20,000 document 10-50 characters long InstantiatedIndex outperforms RAMDirectory some 30x,
      15x at 100 documents of 2000 charachters length,
      and is linear to RAMDirectory at 10,000 documents of 2000 characters length.

      Mileage may vary depending on term saturation.

      1. BinarySearchUtils.Apache.java
        10 kB
        Olivier Chafik
      2. classdiagram.png
        61 kB
        Karl Wettin
      3. HitCollectionBench.jpg
        156 kB
        Karl Wettin
      4. LUCENE-550_20071021_no_core_changes.txt
        109 kB
        Karl Wettin
      5. LUCENE-550.patch
        113 kB
        Karl Wettin
      6. LUCENE-550.patch
        112 kB
        Karl Wettin
      7. LUCENE-550.patch
        112 kB
        Grant Ingersoll
      8. test-reports.zip
        90 kB
        Hoss Man

        Issue Links

          Activity

          Hide
          Grant Ingersoll added a comment -

          Committed revision 636745. Thanks Karl!

          Show
          Grant Ingersoll added a comment - Committed revision 636745. Thanks Karl!
          Hide
          Karl Wettin added a comment -

          Some dull colors, rendered via PDF to PNG and then scaled to fit 1024x768. Also softscaled in package.html, but linked to when clicked on.

          Show
          Karl Wettin added a comment - Some dull colors, rendered via PDF to PNG and then scaled to fit 1024x768. Also softscaled in package.html, but linked to when clicked on.
          Hide
          Karl Wettin added a comment -

          Added more javadocs.
          The patch is not sticky enough for instantiated/docs/classdiagram.jpg.

          Show
          Karl Wettin added a comment - Added more javadocs. The patch is not sticky enough for instantiated/docs/classdiagram.jpg.
          Hide
          Grant Ingersoll added a comment -

          Cleaned up a few things, added CHANGES.txt, added ASL to a file. I'll commit tomorrow, pending any more feedback.

          Show
          Grant Ingersoll added a comment - Cleaned up a few things, added CHANGES.txt, added ASL to a file. I'll commit tomorrow, pending any more feedback.
          Hide
          Olivier Chafik added a comment -

          Here is an enhanced binarySearch method for int arrays, which I wrote and wish to donate to the ASF (for the Lucene project or any other purpose), following Karl Wettin's request.
          This code was initially published on my blog : http://ochafik.free.fr/blog/?p=106
          Have fun with it !

          Olivier Chafik

          Show
          Olivier Chafik added a comment - Here is an enhanced binarySearch method for int arrays, which I wrote and wish to donate to the ASF (for the Lucene project or any other purpose), following Karl Wettin's request. This code was initially published on my blog : http://ochafik.free.fr/blog/?p=106 Have fun with it ! – Olivier Chafik
          Hide
          Grant Ingersoll added a comment -

          Commons would be all right, since this is a contrib and it can have
          dependencies. But putting it on this patch would be just as useful.
          Your call. Putting it into Lucene makes it more likely that it will
          be addressed as part of this patch, and thus committed. Going the
          Commons route is probably for the greater good, but we may not see it
          for a good long time, depending on their commit/release needs.

          OK. Once we get the legal piece resolved, I am going to commit.

          -Grant

          Show
          Grant Ingersoll added a comment - Commons would be all right, since this is a contrib and it can have dependencies. But putting it on this patch would be just as useful. Your call. Putting it into Lucene makes it more likely that it will be addressed as part of this patch, and thus committed. Going the Commons route is probably for the greater good, but we may not see it for a good long time, depending on their commit/release needs. OK. Once we get the legal piece resolved, I am going to commit. -Grant
          Hide
          Karl Wettin added a comment -

          Did zOlive ever post his code Jakarta Commons? Without him actually doing it, I don't know that it is good enough legally to accept it.

          He did not. Should I ask him to post the code as an ASL-tagged attachment to this issue? Or is commons a better place?

          Also, is your last comment such that you think there is a new patch?

          Not anytime soon. They are only ideas that could make it a bit less ad hoc. But I'm actually quite happy with the way it works now. The code has sucessfully been used in a handful of commercial projects.

          Show
          Karl Wettin added a comment - Did zOlive ever post his code Jakarta Commons? Without him actually doing it, I don't know that it is good enough legally to accept it. He did not. Should I ask him to post the code as an ASL-tagged attachment to this issue? Or is commons a better place? Also, is your last comment such that you think there is a new patch? Not anytime soon. They are only ideas that could make it a bit less ad hoc. But I'm actually quite happy with the way it works now. The code has sucessfully been used in a handful of commercial projects.
          Hide
          Grant Ingersoll added a comment -

          Did zOlive ever post his code Jakarta Commons? Without him actually doing it, I don't know that it is good enough legally to accept it.

          Also, is your last comment such that you think there is a new patch?

          Show
          Grant Ingersoll added a comment - Did zOlive ever post his code Jakarta Commons? Without him actually doing it, I don't know that it is good enough legally to accept it. Also, is your last comment such that you think there is a new patch?
          Hide
          Karl Wettin added a comment -

          I was poking around in the javadocs of this and came to the conclution that InstantiatedIndexWriter is depricated code, that it is enough one can construct InstantiatedIndex using an optimized IndexReader. This makes all InstantiatedIndexes immutable. That makes the no-locks caveat to go away.

          Also, it is a hassle to make sure that InstantiatedIndexWriter work just as IndexWriter does.

          In the future, a segmented Directory-facade could be built on top of this, where each InstantiatedIndex is a segment created by IndexWriter flush. It would potentially be slower to populate this, but it would be compatible with everything. Adding more than one segement will requite merging and optimizing indices forth and back in RAMDirectories a but, but InstantiatedIndexes are usually quite small.

          It feels like much of that code is already there.

          On the matter of RAM consumption, using a profiler I recently noticed a 3.2MB directory of 3-5;3-3;3-5 ngrams with term vectors consumed something like 35MB RAM when loaded to an InstantiatedIndex.

          Show
          Karl Wettin added a comment - I was poking around in the javadocs of this and came to the conclution that InstantiatedIndexWriter is depricated code, that it is enough one can construct InstantiatedIndex using an optimized IndexReader. This makes all InstantiatedIndexes immutable. That makes the no-locks caveat to go away. Also, it is a hassle to make sure that InstantiatedIndexWriter work just as IndexWriter does. In the future, a segmented Directory-facade could be built on top of this, where each InstantiatedIndex is a segment created by IndexWriter flush. It would potentially be slower to populate this, but it would be compatible with everything. Adding more than one segement will requite merging and optimizing indices forth and back in RAMDirectories a but, but InstantiatedIndexes are usually quite small. It feels like much of that code is already there. On the matter of RAM consumption, using a profiler I recently noticed a 3.2MB directory of 3-5;3-3;3-5 ngrams with term vectors consumed something like 35MB RAM when loaded to an InstantiatedIndex.
          Hide
          Karl Wettin added a comment -

          Grant Ingersoll - 10/Dec/07 02:11 PM
          > courtesy of Olivier Chafik
          What does this mean? He contributed the code personally or you got it from him? In other words, do you have the authority to assign the ASF copyright for said code?

          {/quote}

          Yes,

          http://ochafik.free.fr/blog/?p=106

          Karl Wettin dit:
          20 October 2007 at 7:54 pm
          Hi Olivier,

          I was just going nuts over the lack of offset and length in Collections.binarySearch. I was thinking that perhaps a subList would be OK, but it turns out that the overhead of AbstractList.subList (in my case an ArrayList) is huge. It takes 1/3 the time to search the complete subList owner of 5000 instanes compared to instantiate and binarySearch a subListIn(2500, 5000).

          Google suggested your blog post.

          I have based some non-released optimization in http://issues.apache.org/jira/browse/LUCENE-550 on your code. Would you mind donating it to the Apache Software Foundation? Lucene does not state author credits in source code, only in CHANGES.TXT.

          LUCENE-550 is an alternative RAM index store that is up to 100x faster than the standard RAMDirectory and it is built to support my machine learning projects such as http://issues.apache.org/jira/browse/LUCENE-626 and http://issues.apache.org/jira/browse/LUCENE-1025

          zOlive dit:
          21 October 2007 at 9:02 am
          Hi Karl,

          Thanks for your message, I'm happy to hear that someone actually made some use of this code !
          Apart from the offset feature, the only specificity of my code is its relative speed for lookups in sorted integer lists, which I'm unsure whether it's exactly your use case or not.
          However, I will be more than pleased to contribute this tiny piece of code to Apache, and I must say I'm a bit surprised that there isn't such a method in any of their projects yet (say, in Jakarta Commons - http://commons.apache.org/collections/).
          Where shall I post it to ?

          Karl Wettin dit:
          21 October 2007 at 4:32 pm
          Thanks!

          You don't need to post it anywhere, I have simply pasted it in this class of mine and adapted it to fit my needs.

          It is indeed an int[] (actually MyClass[].getInt()) I'm seeking in, the variable pivot is most welcome.

          Show
          Karl Wettin added a comment - Grant Ingersoll - 10/Dec/07 02:11 PM > courtesy of Olivier Chafik What does this mean? He contributed the code personally or you got it from him? In other words, do you have the authority to assign the ASF copyright for said code? {/quote} Yes, http://ochafik.free.fr/blog/?p=106 Karl Wettin dit: 20 October 2007 at 7:54 pm Hi Olivier, I was just going nuts over the lack of offset and length in Collections.binarySearch. I was thinking that perhaps a subList would be OK, but it turns out that the overhead of AbstractList.subList (in my case an ArrayList) is huge. It takes 1/3 the time to search the complete subList owner of 5000 instanes compared to instantiate and binarySearch a subListIn(2500, 5000). Google suggested your blog post. I have based some non-released optimization in http://issues.apache.org/jira/browse/LUCENE-550 on your code. Would you mind donating it to the Apache Software Foundation? Lucene does not state author credits in source code, only in CHANGES.TXT. LUCENE-550 is an alternative RAM index store that is up to 100x faster than the standard RAMDirectory and it is built to support my machine learning projects such as http://issues.apache.org/jira/browse/LUCENE-626 and http://issues.apache.org/jira/browse/LUCENE-1025 zOlive dit: 21 October 2007 at 9:02 am Hi Karl, Thanks for your message, I'm happy to hear that someone actually made some use of this code ! Apart from the offset feature, the only specificity of my code is its relative speed for lookups in sorted integer lists, which I'm unsure whether it's exactly your use case or not. However, I will be more than pleased to contribute this tiny piece of code to Apache, and I must say I'm a bit surprised that there isn't such a method in any of their projects yet (say, in Jakarta Commons - http://commons.apache.org/collections/ ). Where shall I post it to ? Karl Wettin dit: 21 October 2007 at 4:32 pm Thanks! You don't need to post it anywhere, I have simply pasted it in this class of mine and adapted it to fit my needs. It is indeed an int[] (actually MyClass[].getInt()) I'm seeking in, the variable pivot is most welcome.
          Hide
          Grant Ingersoll added a comment -

          courtesy of Olivier Chafik

          What does this mean? He contributed the code personally or you got it from him? In other words, do you have the authority to assign the ASF copyright for said code?

          FYI, the patch applies clean and compiles. I still have some benchmarking to do, but would like to commit.

          Show
          Grant Ingersoll added a comment - courtesy of Olivier Chafik What does this mean? He contributed the code personally or you got it from him? In other words, do you have the authority to assign the ASF copyright for said code? FYI, the patch applies clean and compiles. I still have some benchmarking to do, but would like to commit.
          Hide
          Karl Wettin added a comment -

          In this patch:

          • Replaced all List<T> with T[] as Array.binarySearch is 20% faster than Collections.binarySearch.
          • Ad hoc binarySearch using variable pivot increase speed of TermDocs.skipTo 20%-400%, courtesy of Olivier Chafik.
          • Default InstantiatedWriter.mergeFactor changed from 1 to 2500
          Show
          Karl Wettin added a comment - In this patch: Replaced all List<T> with T[] as Array.binarySearch is 20% faster than Collections.binarySearch. Ad hoc binarySearch using variable pivot increase speed of TermDocs.skipTo 20%-400%, courtesy of Olivier Chafik. Default InstantiatedWriter.mergeFactor changed from 1 to 2500
          Hide
          Karl Wettin added a comment -

          In this patch:

          • IndexReader.terms(Term) optimization, initial seek now jit-call away given the term exists, rather than using binary search.
          • A handful of minor optimizations
          • IndexReader.version() mimics Segment-dito
          Show
          Karl Wettin added a comment - In this patch: IndexReader.terms(Term) optimization, initial seek now jit-call away given the term exists, rather than using binary search. A handful of minor optimizations IndexReader.version() mimics Segment-dito
          Hide
          Karl Wettin added a comment -

          In this path:

          • As the Segment-dito, non-mapper term vector methods returns null rather than throwing NPE when term vector is not available.
          Show
          Karl Wettin added a comment - In this path: As the Segment-dito, non-mapper term vector methods returns null rather than throwing NPE when term vector is not available.
          Hide
          Karl Wettin added a comment -

          In this patch:

          • Minor discrepancy in IndexReader#norms(String field, byte[] bytes, int offset) between SegmentReader and InstantiatedIndexReader fixed and demonstrated in TestIndicesEquals.

          http://www.nabble.com/norms%28String-field%2C-byte---bytes%2C-int-offset%29-tf4580460.html#a13075367

          • Updated maven pom and fixed some typos in documentation.
          Show
          Karl Wettin added a comment - In this patch: Minor discrepancy in IndexReader#norms(String field, byte[] bytes, int offset) between SegmentReader and InstantiatedIndexReader fixed and demonstrated in TestIndicesEquals. http://www.nabble.com/norms%28String-field%2C-byte---bytes%2C-int-offset%29-tf4580460.html#a13075367 Updated maven pom and fixed some typos in documentation.
          Hide
          Karl Wettin added a comment -

          Oups, the patch is of course granted ASF licence.

          Show
          Karl Wettin added a comment - Oups, the patch is of course granted ASF licence.
          Hide
          Karl Wettin added a comment -

          New in this patch:

          • Payloads added to TestIndicesEquals
          • Package level java docs with UMLet class diagram
          • Some additional todo-tags in the code that shows what can be improved

          I've noticed that there are some differences in the behavior of IndexWriter and InstantiatedIndexWriter when a document containing multiple fields with the same name but different settings, such as:

           d.add(new Field("f", " All work and no play makes Jack a dull boy", Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
           d.add(new Field("f", " All work and no play makes Jack a dull boy", Field.Store.NO));
           d.add(new Field("f", " All work and no play makes Jack a dull boy", Field.Store.YES, Field.Index.NO_NORMS, Field.TermVector.NO));
          

          Would this be considered an invalid document? Should there be a term vector or not? Or perhaps just term vector for the tokens in the first field?

          Show
          Karl Wettin added a comment - New in this patch: Payloads added to TestIndicesEquals Package level java docs with UMLet class diagram Some additional todo-tags in the code that shows what can be improved I've noticed that there are some differences in the behavior of IndexWriter and InstantiatedIndexWriter when a document containing multiple fields with the same name but different settings, such as: d.add( new Field( "f" , " All work and no play makes Jack a dull boy" , Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS)); d.add( new Field( "f" , " All work and no play makes Jack a dull boy" , Field.Store.NO)); d.add( new Field( "f" , " All work and no play makes Jack a dull boy" , Field.Store.YES, Field.Index.NO_NORMS, Field.TermVector.NO)); Would this be considered an invalid document? Should there be a term vector or not? Or perhaps just term vector for the tokens in the first field?
          Hide
          Hoss Man added a comment -

          > Any comments on how to include graphics in the documentation? (I'm a big fan of UML,
          > you might have noticed there is quite a bit of ASCII class diagram stubs in the javadocs of
          > fields that represent binary associations, association classes and qualifications.) Also, where
          > should I store the XML used to render the graphics? Just pop it all in the src classpath?

          images that you want to embed in (or files you want to link to from) javadocs should live in a "doc-files" directory in the package....

          http://java.sun.com/j2se/javadoc/writingdoccomments/#images

          ...iwould put the XML source for the image in there as well, and put a link to it in the javadocs as well.

          Show
          Hoss Man added a comment - > Any comments on how to include graphics in the documentation? (I'm a big fan of UML, > you might have noticed there is quite a bit of ASCII class diagram stubs in the javadocs of > fields that represent binary associations, association classes and qualifications.) Also, where > should I store the XML used to render the graphics? Just pop it all in the src classpath? images that you want to embed in (or files you want to link to from) javadocs should live in a "doc-files" directory in the package.... http://java.sun.com/j2se/javadoc/writingdoccomments/#images ...iwould put the XML source for the image in there as well, and put a link to it in the javadocs as well.
          Hide
          Karl Wettin added a comment -

          Grant Ingersoll - 22/Sep/07 05:52 AM

          > I would like to see payloads tested as well.

          I'm new to payloads and don't know what makes sense when it comes to populating the aprioi/test indices. Any preferences? Or should I just randomly add some payloads to the positions of a couple of terms in a couple of documents?

          > package level javadoc

          Any comments on how to include graphics in the documentation? (I'm a big fan of UML, you might have noticed there is quite a bit of ASCII class diagram stubs in the javadocs of fields that represent binary associations, association classes and qualifications.) Also, where should I store the XML used to render the graphics? Just pop it all in the src classpath?

          > I notice a TODO as well saying implement locking. Thoughts on implementing it?

          It used to be a ReentrantLock, but for some reason I can't seem to recall, this was a bad idea. There are TODO: lock and TODO: release lock tags left throughout the code. I should probably take a look at o.a.l.store.Lock.

          There are three more caveats I know of, but I'm not certain how important they are to fix.

          IndexReader:

          public Document document(int n, FieldSelector fieldSelector) throws IOException

          { // todo: it does not make to much sense to use field selector using this implementation, // todo: so it simply ignores this and return everything. return document(n); }

          public Collection getFieldNames(FieldOption fldOption) {
          if (fldOption != FieldOption.ALL)

          { throw new IllegalArgumentException("Only FieldOption.ALL implemented."); // todo }

          IndexWriter.addDocument does not support readerValue and binaryValue.

          if (field.isTokenized()) {
          int termCounter = 0;
          final TokenStream tokenStream;
          // todo readerValue(), binaryValue()
          if (field.tokenStreamValue() != null) {

          Show
          Karl Wettin added a comment - Grant Ingersoll - 22/Sep/07 05:52 AM > I would like to see payloads tested as well. I'm new to payloads and don't know what makes sense when it comes to populating the aprioi/test indices. Any preferences? Or should I just randomly add some payloads to the positions of a couple of terms in a couple of documents? > package level javadoc Any comments on how to include graphics in the documentation? (I'm a big fan of UML, you might have noticed there is quite a bit of ASCII class diagram stubs in the javadocs of fields that represent binary associations, association classes and qualifications.) Also, where should I store the XML used to render the graphics? Just pop it all in the src classpath? > I notice a TODO as well saying implement locking. Thoughts on implementing it? It used to be a ReentrantLock, but for some reason I can't seem to recall, this was a bad idea. There are TODO: lock and TODO: release lock tags left throughout the code. I should probably take a look at o.a.l.store.Lock. There are three more caveats I know of, but I'm not certain how important they are to fix. IndexReader: public Document document(int n, FieldSelector fieldSelector) throws IOException { // todo: it does not make to much sense to use field selector using this implementation, // todo: so it simply ignores this and return everything. return document(n); } public Collection getFieldNames(FieldOption fldOption) { if (fldOption != FieldOption.ALL) { throw new IllegalArgumentException("Only FieldOption.ALL implemented."); // todo } IndexWriter.addDocument does not support readerValue and binaryValue. if (field.isTokenized()) { int termCounter = 0; final TokenStream tokenStream; // todo readerValue(), binaryValue() if (field.tokenStreamValue() != null) {
          Hide
          Grant Ingersoll added a comment -

          If I understand your test correctly, you have gone through and compared term by term, etc. (vectors, etc.)

          I would like to see payloads tested as well.

          I also think you need a package level javadoc that explains the use cases for this and the basics of using it.

          Also, I notice the caveat about no locking (in the javadocs for InstantiatedIndex) and I notice a TODO as well saying implement locking. Thoughts on implementing it?

          Show
          Grant Ingersoll added a comment - If I understand your test correctly, you have gone through and compared term by term, etc. (vectors, etc.) I would like to see payloads tested as well. I also think you need a package level javadoc that explains the use cases for this and the basics of using it. Also, I notice the caveat about no locking (in the javadocs for InstantiatedIndex) and I notice a TODO as well saying implement locking. Thoughts on implementing it?
          Hide
          Karl Wettin added a comment -

          Previously mentioned problems deloused. The phrase (term position) problem turned out to be the constructor InstantiatedIndex(IndexReader) that had a bug, ending up with a index not equal to one created via InstantiatedIndexWriter.

          I also did a bunch of tests on how much it would speed up by replacing the binary searches over lists with hash tables (maps). Gained perhaps 5% speed, but lost quite a bit of RAM, so I reverted those things.

          Do you want more test cases than the TestIndicesEquals?

          Payloads needs to be verified. I never really worked with them, and the Directory-centric test will not be ported easily.

          Show
          Karl Wettin added a comment - Previously mentioned problems deloused. The phrase (term position) problem turned out to be the constructor InstantiatedIndex(IndexReader) that had a bug, ending up with a index not equal to one created via InstantiatedIndexWriter. I also did a bunch of tests on how much it would speed up by replacing the binary searches over lists with hash tables (maps). Gained perhaps 5% speed, but lost quite a bit of RAM, so I reverted those things. Do you want more test cases than the TestIndicesEquals? Payloads needs to be verified. I never really worked with them, and the Directory-centric test will not be ported easily.
          Hide
          Karl Wettin added a comment -

          I just found a bug that I can not explain.

          While scoring this one specific phrase query in this one specific corpus of mine, the scorer calls TermPositions.nextPosition() more than TermPositions.freq() times. Never seen this error before, and it does not do this when running against a Directory. TestIndicesEquals does however pass, so it must be me that does not reset currentTermPosition counter, or something along that way.

          I have been debugging for hours and hours in the scorer code in order to understand the difference between II and Directory is, but I can't figure it out. Completely lost in this (read: any) scorer code.

          It sure is a show stopper if it sometimes does not work, so I'll try to find the bug. This is the first time I've seen it though. I mean, I do use phrase queries in other places in conjunction with this store, and that makes it even more strange.

          I have tried to come up with an isolated test case, but I can't. I can however pass the corpus and code that produce this error to some specific person, but I'm afraid I can't post it here.

          There is also a minor TermFreqVector bug that throws a NPE, solved in the next patch.

          Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 12
          at org.apache.lucene.store.instantiated.InstantiatedTermPositions.nextPosition(InstantiatedTermPositions.java:70)
          at org.apache.lucene.search.PhrasePositions.nextPosition(PhrasePositions.java:76)
          at org.apache.lucene.search.PhrasePositions.firstPosition(PhrasePositions.java:65)
          at org.apache.lucene.search.ExactPhraseScorer.phraseFreq(ExactPhraseScorer.java:34)
          at org.apache.lucene.search.PhraseScorer.doNext(PhraseScorer.java:94)
          at org.apache.lucene.search.PhraseScorer.next(PhraseScorer.java:81)
          at org.apache.lucene.search.DisjunctionSumScorer.initScorerDocQueue(DisjunctionSumScorer.java:105)
          at org.apache.lucene.search.DisjunctionSumScorer.next(DisjunctionSumScorer.java:144)
          at org.apache.lucene.search.BooleanScorer2.next(BooleanScorer2.java:360)
          at org.apache.lucene.search.DisjunctionSumScorer.initScorerDocQueue(DisjunctionSumScorer.java:105)
          at org.apache.lucene.search.DisjunctionSumScorer.next(DisjunctionSumScorer.java:144)
          at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:327)
          at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146)
          at org.apache.lucene.search.Searcher.search(Searcher.java:118)
          at org.apache.lucene.search.Searcher.search(Searcher.java:97)

          Show
          Karl Wettin added a comment - I just found a bug that I can not explain. While scoring this one specific phrase query in this one specific corpus of mine, the scorer calls TermPositions.nextPosition() more than TermPositions.freq() times. Never seen this error before, and it does not do this when running against a Directory. TestIndicesEquals does however pass, so it must be me that does not reset currentTermPosition counter, or something along that way. I have been debugging for hours and hours in the scorer code in order to understand the difference between II and Directory is, but I can't figure it out. Completely lost in this (read: any) scorer code. It sure is a show stopper if it sometimes does not work, so I'll try to find the bug. This is the first time I've seen it though. I mean, I do use phrase queries in other places in conjunction with this store, and that makes it even more strange. I have tried to come up with an isolated test case, but I can't. I can however pass the corpus and code that produce this error to some specific person, but I'm afraid I can't post it here. There is also a minor TermFreqVector bug that throws a NPE, solved in the next patch. Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 12 at org.apache.lucene.store.instantiated.InstantiatedTermPositions.nextPosition(InstantiatedTermPositions.java:70) at org.apache.lucene.search.PhrasePositions.nextPosition(PhrasePositions.java:76) at org.apache.lucene.search.PhrasePositions.firstPosition(PhrasePositions.java:65) at org.apache.lucene.search.ExactPhraseScorer.phraseFreq(ExactPhraseScorer.java:34) at org.apache.lucene.search.PhraseScorer.doNext(PhraseScorer.java:94) at org.apache.lucene.search.PhraseScorer.next(PhraseScorer.java:81) at org.apache.lucene.search.DisjunctionSumScorer.initScorerDocQueue(DisjunctionSumScorer.java:105) at org.apache.lucene.search.DisjunctionSumScorer.next(DisjunctionSumScorer.java:144) at org.apache.lucene.search.BooleanScorer2.next(BooleanScorer2.java:360) at org.apache.lucene.search.DisjunctionSumScorer.initScorerDocQueue(DisjunctionSumScorer.java:105) at org.apache.lucene.search.DisjunctionSumScorer.next(DisjunctionSumScorer.java:144) at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:327) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146) at org.apache.lucene.search.Searcher.search(Searcher.java:118) at org.apache.lucene.search.Searcher.search(Searcher.java:97)
          Hide
          Karl Wettin added a comment -

          Grant Ingersoll - 15/Aug/07 05:17 PM
          > Should I wait on this until you figure this out?

          Please don't. I'm just thinking really lound.

          Show
          Karl Wettin added a comment - Grant Ingersoll - 15/Aug/07 05:17 PM > Should I wait on this until you figure this out? Please don't. I'm just thinking really lound.
          Hide
          Grant Ingersoll added a comment -

          Should I wait on this until you figure this out?

          Show
          Grant Ingersoll added a comment - Should I wait on this until you figure this out?
          Hide
          Karl Wettin added a comment -

          > It also hit me that I could have a HashMap<Term, Integer> parallel to the
          > List<Term> orderdTerms. The latter is currently beeing binarysearched
          > in TermEnum, and a HashMap would make it much faster, especially as
          > the index grows.

          Just looked in to this. There is some performance to gain, but not much. I'll do some benches later on and see if it was worth it.

          Most binary searches are placed in the IndexWriter, and I honestly don't care too much about make that part faster if it slows down searching or makes it hog more RAM.

          Show
          Karl Wettin added a comment - > It also hit me that I could have a HashMap<Term, Integer> parallel to the > List<Term> orderdTerms. The latter is currently beeing binarysearched > in TermEnum, and a HashMap would make it much faster, especially as > the index grows. Just looked in to this. There is some performance to gain, but not much. I'll do some benches later on and see if it was worth it. Most binary searches are placed in the IndexWriter, and I honestly don't care too much about make that part faster if it slows down searching or makes it hog more RAM.
          Hide
          Karl Wettin added a comment -

          Added support for payloads
          Reintroduced InstantiatedIndexWriter (no locks!)
          Reintroduced TestIndicesEquals
          Introduced build.xml
          Introduced pom.xml (this file is missing java 1.5 setting)
          Added some silly javadocs

          It also hit me that I could have a HashMap<Term, Integer> parallell to the List<Term> orderdTerms. The latter is currently beeing binarysearched in TermEnum, and a HashMap would make it much faster, especially as the index grows. Might speed things up alot.

          Show
          Karl Wettin added a comment - Added support for payloads Reintroduced InstantiatedIndexWriter (no locks!) Reintroduced TestIndicesEquals Introduced build.xml Introduced pom.xml (this file is missing java 1.5 setting) Added some silly javadocs It also hit me that I could have a HashMap<Term, Integer> parallell to the List<Term> orderdTerms. The latter is currently beeing binarysearched in TermEnum, and a HashMap would make it much faster, especially as the index grows. Might speed things up alot.
          Hide
          Grant Ingersoll added a comment -

          On the Payload question, it is still marked as experimental, but if your patch gets in before anyone changes it, the onus is on that person to make sure the change is functional, so I would think you are fine to assume the current payload is fixed for the time being.

          Show
          Grant Ingersoll added a comment - On the Payload question, it is still marked as experimental, but if your patch gets in before anyone changes it, the onus is on that person to make sure the change is functional, so I would think you are fine to assume the current payload is fixed for the time being.
          Hide
          Karl Wettin added a comment -

          Grant Ingersoll - 07/Aug/07 06:22 PM
          > 1. No build file
          > 2. Tests are virtually non-existent
          >
          > It could also use some documentation, especially on the how and why of the InstantiatedIndex.

          I'll come up with some stuff asap.

          About tests, the new patch is more or less a redection of the previous patch. The latter contains more or less all tests assimilated to run on instantiated index. WIth the new patch there is no IndexWriter, so I will have to reassimilate it all.

          In the old patch there is a test case that compare two index readers - enumerating all parts of an a priori reader and a test reader comparing the values. It passed in the old patch, so I don't think there is any problem. I'll reintroduce it though. Do you think that would be enough, or do you want the assimilated tests back?

          Is the payload API fixed? There is a bunch of TODOs and warnings here and there in the code, the reason for me not implementing it in this store.

          Show
          Karl Wettin added a comment - Grant Ingersoll - 07/Aug/07 06:22 PM > 1. No build file > 2. Tests are virtually non-existent > > It could also use some documentation, especially on the how and why of the InstantiatedIndex. I'll come up with some stuff asap. About tests, the new patch is more or less a redection of the previous patch. The latter contains more or less all tests assimilated to run on instantiated index. WIth the new patch there is no IndexWriter, so I will have to reassimilate it all. In the old patch there is a test case that compare two index readers - enumerating all parts of an a priori reader and a test reader comparing the values. It passed in the old patch, so I don't think there is any problem. I'll reintroduce it though. Do you think that would be enough, or do you want the assimilated tests back? Is the payload API fixed? There is a bunch of TODOs and warnings here and there in the code, the reason for me not implementing it in this store.
          Hide
          Grant Ingersoll added a comment -

          Hey Karl,

          I started to look at this, but there are a few stoppers at this point for me:
          1. No build file
          2. Tests are virtually non-existent

          It could also use some documentation, especially on the how and why of the InstantiatedIndex.

          Cheers,
          Grant

          Show
          Grant Ingersoll added a comment - Hey Karl, I started to look at this, but there are a few stoppers at this point for me: 1. No build file 2. Tests are virtually non-existent It could also use some documentation, especially on the how and why of the InstantiatedIndex. Cheers, Grant
          Hide
          Karl Wettin added a comment -

          This is a small and completely isolated version of InstantiatedIndex, the results of my "last attempt" thread:
          http://www.nabble.com/Last-attempt-tf4153815.html

          It requires no changes to the Lucene core but hogs a bit more RAM and probably depends on your JIT to avoid wasting CPU. So prior required definalization and generalization is replaced by aggregation (strategy pattern). I also had to remove all the polymorphic index handling (IndexWriterInterface et c), and I have removed the IndexWriter in InstantiatedIndex. One now have to create a new InstantiatedIndex and pass down an IndexReader instead. So there is no appending allowed. Also, there are no locks no more, but that should not be needed anymore.

          The port of the complete test suite from Lucene to the unison index handling has been removed. Ie there are no real test cases that demonstrate this patch. Anything but term vectors and payloads should work great though. The code base is over a year old and these are new features I did not have time to implement or test.

          No new benchmarks. The greatest loss is the loss of features, not CPU and RAM. Perhaps it waste 15% more resources than the previous patch?

          As I personally enjoy the features removed in this patch, I will keep on running Lucene 2.0 and the old version, but this should be easier to understand and maintain if anyone else wants to take a look at it.

          Show
          Karl Wettin added a comment - This is a small and completely isolated version of InstantiatedIndex, the results of my "last attempt" thread: http://www.nabble.com/Last-attempt-tf4153815.html It requires no changes to the Lucene core but hogs a bit more RAM and probably depends on your JIT to avoid wasting CPU. So prior required definalization and generalization is replaced by aggregation (strategy pattern). I also had to remove all the polymorphic index handling (IndexWriterInterface et c), and I have removed the IndexWriter in InstantiatedIndex. One now have to create a new InstantiatedIndex and pass down an IndexReader instead. So there is no appending allowed. Also, there are no locks no more, but that should not be needed anymore. The port of the complete test suite from Lucene to the unison index handling has been removed. Ie there are no real test cases that demonstrate this patch. Anything but term vectors and payloads should work great though. The code base is over a year old and these are new features I did not have time to implement or test. No new benchmarks. The greatest loss is the loss of features, not CPU and RAM. Perhaps it waste 15% more resources than the previous patch? As I personally enjoy the features removed in this patch, I will keep on running Lucene 2.0 and the old version, but this should be easier to understand and maintain if anyone else wants to take a look at it.
          Hide
          Karl Wettin added a comment -

          x/y axis names updates

          Show
          Karl Wettin added a comment - x/y axis names updates
          Hide
          Karl Wettin added a comment -

          > Nicolas Lalevée [18/Mar/07 02:04 AM]

          > This a very interesting benchmark graph ! Note that there is just a little mistake in there : the labels of the axes are switched.

          The test is sort of crued, a set of queries with variable complexity that for each iteration is placed on a new IndexSearcher and IndexReader. Index is optimized at all measure points.

          > And you said that you still have lot of gain with 250 000 documents because
          > retreiving cost. But if I have to made the choice of having everything in memory,
          > I won't put the data of my own model into Lucene. I will keep them in memory
          > while not transforming them into stored Lucene >Document. I will just transform
          > them for indexing purpose and just keep an ID in the Lucene store which will
          > help me map the search result to my own model data. This will avoid the
          > transformation Lucene-Document -> MyModel-Data.

          I can only agree.

          >(after relooking at the UML diagram) : Unless you allow to put POJO objects in a Document ?

          That is the hypothesis. I've actually been a bit baffled by the results I've seen the last days while benchmarking.

          The application this was orginially built for (the one with 250 000 documents) is fairly busy, on average one query every 10ms 24/7. Peeks at one every 2ms. On the single machine setup with 4GB and Solaris the CPU went from 90% busy to 90% idle when switching from RAMDirectory to InstantiatedIndex. I can at this point not say if this is due to bad use of Lucene and compensating for that with a crazy solution. But I don't think so. I think I've missed a bunch of benchmark factors.

          Since that project, and that was some time ago, I have not implemented any applications with a "normal" corpus using InstantiatedIndex.

          It is the backbone of the active cache (also availabe in this patch). I'm sure people made similar things with MemoryIndex. For each batch of new documents inserted, I apply cached queries on the batch-index to detect if the new data would affect the results associated with the cached query. (The cache does other active things too.)

          In the didyoumean issue I use InstantiatedIndex as a speedy a priori index, a small index with feature selected text (common user queries known to be correct, very common phrases in document titles, et c) that is used to build ngrams for token suggestions, build phrase suggestions, rearrange term order in phrases, et c. As these documents are very small (a small phrase) it is some 10x-20x faster than a RAMDirectory at 50 000 documents.

          Show
          Karl Wettin added a comment - > Nicolas Lalevée [18/Mar/07 02:04 AM] > This a very interesting benchmark graph ! Note that there is just a little mistake in there : the labels of the axes are switched. The test is sort of crued, a set of queries with variable complexity that for each iteration is placed on a new IndexSearcher and IndexReader. Index is optimized at all measure points. > And you said that you still have lot of gain with 250 000 documents because > retreiving cost. But if I have to made the choice of having everything in memory, > I won't put the data of my own model into Lucene. I will keep them in memory > while not transforming them into stored Lucene >Document. I will just transform > them for indexing purpose and just keep an ID in the Lucene store which will > help me map the search result to my own model data. This will avoid the > transformation Lucene-Document -> MyModel-Data. I can only agree. >(after relooking at the UML diagram) : Unless you allow to put POJO objects in a Document ? That is the hypothesis. I've actually been a bit baffled by the results I've seen the last days while benchmarking. The application this was orginially built for (the one with 250 000 documents) is fairly busy, on average one query every 10ms 24/7. Peeks at one every 2ms. On the single machine setup with 4GB and Solaris the CPU went from 90% busy to 90% idle when switching from RAMDirectory to InstantiatedIndex. I can at this point not say if this is due to bad use of Lucene and compensating for that with a crazy solution. But I don't think so. I think I've missed a bunch of benchmark factors. Since that project, and that was some time ago, I have not implemented any applications with a "normal" corpus using InstantiatedIndex. It is the backbone of the active cache (also availabe in this patch). I'm sure people made similar things with MemoryIndex. For each batch of new documents inserted, I apply cached queries on the batch-index to detect if the new data would affect the results associated with the cached query. (The cache does other active things too.) In the didyoumean issue I use InstantiatedIndex as a speedy a priori index, a small index with feature selected text (common user queries known to be correct, very common phrases in document titles, et c) that is used to build ngrams for token suggestions, build phrase suggestions, rearrange term order in phrases, et c. As these documents are very small (a small phrase) it is some 10x-20x faster than a RAMDirectory at 50 000 documents.
          Hide
          Nicolas Lalevée added a comment -

          This a very interesting benchmark graph ! Note that there is just a little mistake in there : the labels of the axes are switched.

          And you said that you still have lot of gain with 250 000 documents because retreiving cost. But if I have to made the choice of having everything in memory, I won't put the data of my own model into Lucene. I will keep them in memory while not transforming them into stored Lucene Document. I will just transform them for indexing purpose and just keep an ID in the Lucene store which will help me map the search result to my own model data. This will avoid the transformation Lucene-Document -> MyModel-Data.

          (after relooking at the UML diagram) : Unless you allow to put POJO objects in a Document ?

          Show
          Nicolas Lalevée added a comment - This a very interesting benchmark graph ! Note that there is just a little mistake in there : the labels of the axes are switched. And you said that you still have lot of gain with 250 000 documents because retreiving cost. But if I have to made the choice of having everything in memory, I won't put the data of my own model into Lucene. I will keep them in memory while not transforming them into stored Lucene Document. I will just transform them for indexing purpose and just keep an ID in the Lucene store which will help me map the search result to my own model data. This will avoid the transformation Lucene-Document -> MyModel-Data. (after relooking at the UML diagram) : Unless you allow to put POJO objects in a Document ?
          Hide
          Karl Wettin added a comment -

          made graph more readable

          Show
          Karl Wettin added a comment - made graph more readable
          Hide
          Karl Wettin added a comment -

          A graph showing performance of hit collection using InstantiatedIndex, RAMDirectory and FSDirectory.

          In essence, there is no great win in pure search time when there are more than 7000 documents. However, retreiving documents is still not associate with any cost what so ever, so in a 250000 sized index that use Lucene for persistency of fields, I still see a boost with 6-10x or so compared to RAMDirectory.

          documents in corpus \t queries per second

          org.apache.lucene.store.instantiated.InstantiatedIndex@628704
          250 37530,00
          500 29610,00
          750 22612,50
          1000 19267,50
          1250 16027,50
          1500 14737,50
          1750 13230,00
          2000 12322,50
          2250 11482,50
          2500 10125,00
          2750 9802,50
          3000 8508,25
          3250 8469,80
          3500 7788,61
          3750 5207,29
          4000 5484,52
          4250 4912,50
          4500 4420,58
          4750 4006,49
          5000 4357,50
          5250 3886,67
          5500 3573,93
          5750 3236,76
          6000 3602,10
          6250 3420,00
          6500 3075,00
          6750 2805,00
          7000 2680,98
          7250 2908,55
          7500 2769,46
          7750 2644,86
          8000 2496,25
          8250 2377,50
          8500 2578,71
          8750 2390,11
          9000 2160,00
          9250 2037,96
          9500 1872,19
          9750 2041,38
          10000 1959,12
          Created 10000 documents

          org.apache.lucene.index.facade.RAMDirectoryIndex@af993e
          250 4845,00
          500 3986,01
          750 4330,67
          1000 4682,82
          1250 4148,78
          1500 4847,65
          1750 4535,23
          2000 4192,50
          2250 4203,30
          2500 3695,65
          2750 3742,50
          3000 3485,76
          3250 3470,76
          3500 3525,00
          3750 2877,61
          4000 3221,78
          4250 2983,51
          4500 2982,02
          4750 2724,55
          5000 3092,86
          5250 2646,18
          5500 2940,00
          5750 2709,58
          6000 2423,30
          6250 2602,50
          6500 2305,39
          6750 2462,57
          7000 1815,00
          7250 2431,42
          7500 2171,74
          7750 2297,90
          8000 2134,30
          8250 2308,85
          8500 2038,98
          8750 2231,65
          9000 2097,90
          9250 2041,38
          9500 1819,77
          9750 2102,24
          10000 1876,87
          Created 10000 documents

          org.apache.lucene.index.facade.FSDirectoryIndex@4112c0
          250 3448,28
          500 2422,50
          750 2677,50
          1000 2607,39
          1250 2241,92
          1500 2486,27
          1750 2472,53
          2000 1733,52
          2250 2325,00
          2500 2194,21
          2750 1969,55
          3000 2125,75
          3250 2009,00
          3500 1473,08
          3750 1858,14
          4000 1925,57
          4250 1671,66
          4500 1786,25
          4750 1694,15
          5000 1217,63
          5250 1595,11
          5500 1745,75
          5750 1526,18
          6000 1431,78
          6250 1524,66
          6500 1648,35
          6750 1544,23
          7000 1428,22
          7250 1487,29
          7500 1494,02
          7750 1106,13
          8000 1455,00
          8250 1284,86
          8500 1182,63
          8750 1292,33
          9000 1399,70
          9250 1000,00
          9500 1291,04
          9750 1359,56
          10000 1194,62
          Created 10000 documents

          Show
          Karl Wettin added a comment - A graph showing performance of hit collection using InstantiatedIndex, RAMDirectory and FSDirectory. In essence, there is no great win in pure search time when there are more than 7000 documents. However, retreiving documents is still not associate with any cost what so ever, so in a 250000 sized index that use Lucene for persistency of fields, I still see a boost with 6-10x or so compared to RAMDirectory. documents in corpus \t queries per second org.apache.lucene.store.instantiated.InstantiatedIndex@628704 250 37530,00 500 29610,00 750 22612,50 1000 19267,50 1250 16027,50 1500 14737,50 1750 13230,00 2000 12322,50 2250 11482,50 2500 10125,00 2750 9802,50 3000 8508,25 3250 8469,80 3500 7788,61 3750 5207,29 4000 5484,52 4250 4912,50 4500 4420,58 4750 4006,49 5000 4357,50 5250 3886,67 5500 3573,93 5750 3236,76 6000 3602,10 6250 3420,00 6500 3075,00 6750 2805,00 7000 2680,98 7250 2908,55 7500 2769,46 7750 2644,86 8000 2496,25 8250 2377,50 8500 2578,71 8750 2390,11 9000 2160,00 9250 2037,96 9500 1872,19 9750 2041,38 10000 1959,12 Created 10000 documents org.apache.lucene.index.facade.RAMDirectoryIndex@af993e 250 4845,00 500 3986,01 750 4330,67 1000 4682,82 1250 4148,78 1500 4847,65 1750 4535,23 2000 4192,50 2250 4203,30 2500 3695,65 2750 3742,50 3000 3485,76 3250 3470,76 3500 3525,00 3750 2877,61 4000 3221,78 4250 2983,51 4500 2982,02 4750 2724,55 5000 3092,86 5250 2646,18 5500 2940,00 5750 2709,58 6000 2423,30 6250 2602,50 6500 2305,39 6750 2462,57 7000 1815,00 7250 2431,42 7500 2171,74 7750 2297,90 8000 2134,30 8250 2308,85 8500 2038,98 8750 2231,65 9000 2097,90 9250 2041,38 9500 1819,77 9750 2102,24 10000 1876,87 Created 10000 documents org.apache.lucene.index.facade.FSDirectoryIndex@4112c0 250 3448,28 500 2422,50 750 2677,50 1000 2607,39 1250 2241,92 1500 2486,27 1750 2472,53 2000 1733,52 2250 2325,00 2500 2194,21 2750 1969,55 3000 2125,75 3250 2009,00 3500 1473,08 3750 1858,14 4000 1925,57 4250 1671,66 4500 1786,25 4750 1694,15 5000 1217,63 5250 1595,11 5500 1745,75 5750 1526,18 6000 1431,78 6250 1524,66 6500 1648,35 6750 1544,23 7000 1428,22 7250 1487,29 7500 1494,02 7750 1106,13 8000 1455,00 8250 1284,86 8500 1182,63 8750 1292,33 9000 1399,70 9250 1000,00 9500 1291,04 9750 1359,56 10000 1194,62 Created 10000 documents
          Hide
          Doron Cohen added a comment -

          That's a good point about the task-benchmark karl!

          All 4 ReaderTasks are reusing the reader if it is already open, but if it is not already open, each task opens a private one, and closes it after the task is done.

          I now see that the javadocs can be improved here - especially in the reader sub-tasks. I will update the documentation to clarify this point.

          Anyhow, for the running tasks to share a reader, the alg part of the .alg file should have something like this:

          OpenReader

          ReaderTaskA
          ReaderTaskB
          ReaderYaskC

          CloseReader

          This way all three tasks would share the same, already open, reader.

          Show
          Doron Cohen added a comment - That's a good point about the task-benchmark karl! All 4 ReaderTasks are reusing the reader if it is already open, but if it is not already open, each task opens a private one, and closes it after the task is done. I now see that the javadocs can be improved here - especially in the reader sub-tasks. I will update the documentation to clarify this point. Anyhow, for the running tasks to share a reader, the alg part of the .alg file should have something like this: OpenReader ReaderTaskA ReaderTaskB ReaderYaskC CloseReader This way all three tasks would share the same, already open, reader.
          Hide
          Karl Wettin added a comment -

          A note on, and output from contrib/benchmark:

          I'm getting really poor results compared to my own test and live enviroment stats. At query time I expected maximum 1/6th time spent in InstantiatedIndex than RAMDirectory, but it turns out that in the benchmarker the speed is almost the same as RAMDirectory. Retrieving documents is only 1/5th of the speed rather than maximum 1/60th as expected.

          Investigated the code a bit and noticed that ReadTask creates a new instance of IndexReader and IndexSearcher for each query. Could this be the reason?

          Memory consumption is 3x of a RAMDirectory, but half of the memory is spent on keeping the Document instances in heap. Perhaps it would be interesting to use the same persistency for these as in the Directory implementations.

          The merge factor sweet spot is around 2500, where it turns out to be a little bit faster than the RAMDirectory sweet spot. At defualt 10 InstantiatedIndex consumes about 5x more time than a RAMDirectory. If I fix the locklessness as suggested in previous comment, it most probably will be much faster than a RAMDirectory at any setting.

          /**

          • The sweet spot for this implementation is at 2500.
          • <p/>
          • Benchmark output:
          • <pre>
          • ------------> Report sum by Prefix (MAddDocs) and Round (8 about 8 out of 160153)
          • Operation round mrg buf cmpnd runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
          • MAddDocs_20000 0 10 10 true 1 20000 81,4 245,68 200 325 152 268 156 928
          • MAddDocs_20000 - 1 1000 10 true - - 1 - - 20000 - - 494,1 - - 40,47 - 247 119 072 - 347 025 408
          • MAddDocs_20000 2 10 100 true 1 20000 104,8 190,81 233 895 552 363 720 704
          • MAddDocs_20000 - 3 2000 100 true - - 1 - - 20000 - - 527,2 - - 37,94 - 266 136 448 - 378 273 792
          • MAddDocs_20000 4 10 10 false 1 20000 103,2 193,75 222 089 792 378 273 792
          • MAddDocs_20000 - 5 3000 10 false - - 1 - - 20000 - - 545,2 - - 36,69 - 237 917 152 - 378 273 792
          • MAddDocs_20000 6 10 100 false 1 20000 102,7 194,67 237 018 976 378 273 792
          • MAddDocs_20000 - 7 4000 100 false - - 1 - - 20000 - - 535,8 - - 37,33 - 309 680 640 - 501 968 896
          • </pre>
            *
          • @see org.apache.lucene.index.IndexWriterInterface#setMergeFactor(int)
            */
            public void setMergeFactor(int mergeFactor) {

          I would not pay to much attention to the numbers below until I've got the benchmarker under control, but here are the stats:

          Output from InstantiatedIndex:

          [java] ------------> Report Sum By (any) Name (19 about 160153 out of 160153)
          [java] Operation round mrg buf cmpnd runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
          [java] Rounds_8 0 10 10 true 1 25142792 19?842,0 1?267,15 291?055?680 377?163?776
          [java] Populate - - - - - - - - - - - - - - - - - - 8 - - 20003 - - 148,1 - 1?080,73 - 249?711?264 - 354?926?592
          [java] CreateIndex - - - - 8 1 1?142,9 0,01 178?670?624 322?181?120
          [java] MAddDocs_20000 - - - - - - - - - - - - - - - - 8 - - 20000 - - 148,0 - 1?080,72 - 249?706?256 - 354?926?592
          [java] AddDoc - - - - 160000 1 156,2 1?024,02 228?890?976 339?588?384
          [java] Optimize - - - - - - - - - - - - - - - - - - 8 - - - - 1 - - 8?000,0 - - 0,00 - 249?679?056 - 354?926?592
          [java] CloseIndex - - - - 8 1 2?666,7 0,00 249?689?056 354?926?592
          [java] OpenReader - - - - - - - - - - - - - - - - - 16 - - - - 1 - 16?000,0 - - 0,00 - 246?507?072 - 354?926?592
          [java] SearchSameRdr_5000 - - - - 8 5000 806,6 49,59 250?121?728 354?926?592
          [java] CloseReader - - - - - - - - - - - - - - - - - 16 - - - - 1 - 16?000,0 - - 0,00 - 249?146?336 - 354?971?648
          [java] WarmNewRdr_50 - - - - 8 1000000 3?118?908,5 2,57 249?616?272 354?926?592
          [java] SrchNewRdr_500 - - - - - - - - - - - - - - - - 8 - - - 500 - - 806,5 - - 4,96 - 252?762?128 - 354?926?592
          [java] SrchTrvNewRdr_300 - - - - 8 335500 135?891,9 19,75 250?484?240 354?926?592
          [java] SrchTrvRetNewRdr_100 - - - - - - - - - - - - - - 8 - - 209216 - 267?326,0 - - 6,26 - 245?991?776 - 354?926?592
          [java] SearchSameRdr_5000_2500/sec_Par - - - - 8 5000 1?163,3 34,39 250?892?304 355?016?704
          [java] WarmNewRdr_50_25/sec_Par - - - - - - - - - - - - - 8 - - 1000000 - 507?872,0 - - 15,75 - 250?855?648 - 355?016?704
          [java] SrchNewRdr_50_25/sec_Par - - - - 8 50 25,5 15,69 254?289?584 355?016?704
          [java] SrchTrvNewRdr_300_150/sec_Par - - - - - - - - - - - 8 - - 335500 - 177?807,2 - - 15,10 - 251?699?584 - 355?016?704
          [java] SrchTrvRetNewRdr_100_50/sec_Par - - - - 8 232076 117?106,6 15,85 252?423?376 355?016?704

          Output from RAMDirectory:
          [java] ------------> Report Sum By (any) Name (19 about 160153 out of 160153)
          [java] Operation round mrg buf cmpnd runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
          [java] Rounds_8 0 10 10 true 1 25142792 36?177,3 694,99 119?427?680 182?538?240
          [java] Populate - - - - - - - - - - - - - - - - - - 8 - - 20003 - - 482,0 - - 331,99 - 114?288?472 - 140?156?416
          [java] CreateIndex - - - - 8 1 2?666,7 0,00 48?867?204 124?752?384
          [java] MAddDocs_20000 - - - - - - - - - - - - - - - - 8 - - 20000 - - 499,2 - - 320,51 - 111?734?320 - 135?969?280
          [java] AddDoc - - - - 160000 1 604,9 264,49 90?860?048 130?812?488
          [java] Optimize - - - - - - - - - - - - - - - - - - 8 - - - - 1 - - - 0,7 - - 11,48 - 123?532?104 - 140?156?416
          [java] CloseIndex - - - - 8 1 8?000,0 0,00 114?288?472 140?156?416
          [java] OpenReader - - - - - - - - - - - - - - - - - 16 - - - - 1 - - 197,5 - - 0,08 - 113?600?096 - 143?475?712
          [java] SearchSameRdr_5000 - - - - 8 5000 1?209,4 33,07 115?720?920 143?314?944
          [java] CloseReader - - - - - - - - - - - - - - - - - 16 - - - - 1 - 16?000,0 - - 0,00 - 102?590?368 - 145?079?552
          [java] WarmNewRdr_50 - - - - 8 1000000 65?734,9 121,70 105?734?472 143?314?944
          [java] SrchNewRdr_500 - - - - - - - - - - - - - - - - 8 - - - 500 - - 417,4 - - 9,58 - 104?480?168 - 146?795?008
          [java] SrchTrvNewRdr_300 - - - - 8 335500 133?532,3 20,10 116?353?456 146?795?008
          [java] SrchTrvRetNewRdr_100 - - - - - - - - - - - - - - 8 - - 209216 - 60?686,3 - - 27,58 - 124?211?040 - 146?795?008
          [java] SearchSameRdr_5000_2500/sec_Par - - - - 8 5000 1?596,0 25,06 114?145?856 146?844?160
          [java] WarmNewRdr_50_25/sec_Par - - - - - - - - - - - - - 8 - - 1000000 - 105?678,9 - - 75,70 - 104?830?320 - 146?844?160
          [java] SrchNewRdr_50_25/sec_Par - - - - 8 50 25,5 15,70 107?417?728 146?844?160
          [java] SrchTrvNewRdr_300_150/sec_Par - - - - - - - - - - - 8 - - 335500 - 178?635,6 - - 15,02 - 116?779?312 - 146?835?968
          [java] SrchTrvRetNewRdr_100_50/sec_Par - - - - 8 232076 100?569,2 18,46 111?881?152 146?819?584

          Show
          Karl Wettin added a comment - A note on, and output from contrib/benchmark: I'm getting really poor results compared to my own test and live enviroment stats. At query time I expected maximum 1/6th time spent in InstantiatedIndex than RAMDirectory, but it turns out that in the benchmarker the speed is almost the same as RAMDirectory. Retrieving documents is only 1/5th of the speed rather than maximum 1/60th as expected. Investigated the code a bit and noticed that ReadTask creates a new instance of IndexReader and IndexSearcher for each query. Could this be the reason? Memory consumption is 3x of a RAMDirectory, but half of the memory is spent on keeping the Document instances in heap. Perhaps it would be interesting to use the same persistency for these as in the Directory implementations. The merge factor sweet spot is around 2500, where it turns out to be a little bit faster than the RAMDirectory sweet spot. At defualt 10 InstantiatedIndex consumes about 5x more time than a RAMDirectory. If I fix the locklessness as suggested in previous comment, it most probably will be much faster than a RAMDirectory at any setting. /** The sweet spot for this implementation is at 2500. <p/> Benchmark output: <pre> ------------> Report sum by Prefix (MAddDocs) and Round (8 about 8 out of 160153) Operation round mrg buf cmpnd runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem MAddDocs_20000 0 10 10 true 1 20000 81,4 245,68 200 325 152 268 156 928 MAddDocs_20000 - 1 1000 10 true - - 1 - - 20000 - - 494,1 - - 40,47 - 247 119 072 - 347 025 408 MAddDocs_20000 2 10 100 true 1 20000 104,8 190,81 233 895 552 363 720 704 MAddDocs_20000 - 3 2000 100 true - - 1 - - 20000 - - 527,2 - - 37,94 - 266 136 448 - 378 273 792 MAddDocs_20000 4 10 10 false 1 20000 103,2 193,75 222 089 792 378 273 792 MAddDocs_20000 - 5 3000 10 false - - 1 - - 20000 - - 545,2 - - 36,69 - 237 917 152 - 378 273 792 MAddDocs_20000 6 10 100 false 1 20000 102,7 194,67 237 018 976 378 273 792 MAddDocs_20000 - 7 4000 100 false - - 1 - - 20000 - - 535,8 - - 37,33 - 309 680 640 - 501 968 896 </pre> * @see org.apache.lucene.index.IndexWriterInterface#setMergeFactor(int) */ public void setMergeFactor(int mergeFactor) { I would not pay to much attention to the numbers below until I've got the benchmarker under control, but here are the stats: Output from InstantiatedIndex: [java] ------------> Report Sum By (any) Name (19 about 160153 out of 160153) [java] Operation round mrg buf cmpnd runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] Rounds_8 0 10 10 true 1 25142792 19?842,0 1?267,15 291?055?680 377?163?776 [java] Populate - - - - - - - - - - - - - - - - - - 8 - - 20003 - - 148,1 - 1?080,73 - 249?711?264 - 354?926?592 [java] CreateIndex - - - - 8 1 1?142,9 0,01 178?670?624 322?181?120 [java] MAddDocs_20000 - - - - - - - - - - - - - - - - 8 - - 20000 - - 148,0 - 1?080,72 - 249?706?256 - 354?926?592 [java] AddDoc - - - - 160000 1 156,2 1?024,02 228?890?976 339?588?384 [java] Optimize - - - - - - - - - - - - - - - - - - 8 - - - - 1 - - 8?000,0 - - 0,00 - 249?679?056 - 354?926?592 [java] CloseIndex - - - - 8 1 2?666,7 0,00 249?689?056 354?926?592 [java] OpenReader - - - - - - - - - - - - - - - - - 16 - - - - 1 - 16?000,0 - - 0,00 - 246?507?072 - 354?926?592 [java] SearchSameRdr_5000 - - - - 8 5000 806,6 49,59 250?121?728 354?926?592 [java] CloseReader - - - - - - - - - - - - - - - - - 16 - - - - 1 - 16?000,0 - - 0,00 - 249?146?336 - 354?971?648 [java] WarmNewRdr_50 - - - - 8 1000000 3?118?908,5 2,57 249?616?272 354?926?592 [java] SrchNewRdr_500 - - - - - - - - - - - - - - - - 8 - - - 500 - - 806,5 - - 4,96 - 252?762?128 - 354?926?592 [java] SrchTrvNewRdr_300 - - - - 8 335500 135?891,9 19,75 250?484?240 354?926?592 [java] SrchTrvRetNewRdr_100 - - - - - - - - - - - - - - 8 - - 209216 - 267?326,0 - - 6,26 - 245?991?776 - 354?926?592 [java] SearchSameRdr_5000_2500/sec_Par - - - - 8 5000 1?163,3 34,39 250?892?304 355?016?704 [java] WarmNewRdr_50_25/sec_Par - - - - - - - - - - - - - 8 - - 1000000 - 507?872,0 - - 15,75 - 250?855?648 - 355?016?704 [java] SrchNewRdr_50_25/sec_Par - - - - 8 50 25,5 15,69 254?289?584 355?016?704 [java] SrchTrvNewRdr_300_150/sec_Par - - - - - - - - - - - 8 - - 335500 - 177?807,2 - - 15,10 - 251?699?584 - 355?016?704 [java] SrchTrvRetNewRdr_100_50/sec_Par - - - - 8 232076 117?106,6 15,85 252?423?376 355?016?704 Output from RAMDirectory: [java] ------------> Report Sum By (any) Name (19 about 160153 out of 160153) [java] Operation round mrg buf cmpnd runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem [java] Rounds_8 0 10 10 true 1 25142792 36?177,3 694,99 119?427?680 182?538?240 [java] Populate - - - - - - - - - - - - - - - - - - 8 - - 20003 - - 482,0 - - 331,99 - 114?288?472 - 140?156?416 [java] CreateIndex - - - - 8 1 2?666,7 0,00 48?867?204 124?752?384 [java] MAddDocs_20000 - - - - - - - - - - - - - - - - 8 - - 20000 - - 499,2 - - 320,51 - 111?734?320 - 135?969?280 [java] AddDoc - - - - 160000 1 604,9 264,49 90?860?048 130?812?488 [java] Optimize - - - - - - - - - - - - - - - - - - 8 - - - - 1 - - - 0,7 - - 11,48 - 123?532?104 - 140?156?416 [java] CloseIndex - - - - 8 1 8?000,0 0,00 114?288?472 140?156?416 [java] OpenReader - - - - - - - - - - - - - - - - - 16 - - - - 1 - - 197,5 - - 0,08 - 113?600?096 - 143?475?712 [java] SearchSameRdr_5000 - - - - 8 5000 1?209,4 33,07 115?720?920 143?314?944 [java] CloseReader - - - - - - - - - - - - - - - - - 16 - - - - 1 - 16?000,0 - - 0,00 - 102?590?368 - 145?079?552 [java] WarmNewRdr_50 - - - - 8 1000000 65?734,9 121,70 105?734?472 143?314?944 [java] SrchNewRdr_500 - - - - - - - - - - - - - - - - 8 - - - 500 - - 417,4 - - 9,58 - 104?480?168 - 146?795?008 [java] SrchTrvNewRdr_300 - - - - 8 335500 133?532,3 20,10 116?353?456 146?795?008 [java] SrchTrvRetNewRdr_100 - - - - - - - - - - - - - - 8 - - 209216 - 60?686,3 - - 27,58 - 124?211?040 - 146?795?008 [java] SearchSameRdr_5000_2500/sec_Par - - - - 8 5000 1?596,0 25,06 114?145?856 146?844?160 [java] WarmNewRdr_50_25/sec_Par - - - - - - - - - - - - - 8 - - 1000000 - 105?678,9 - - 75,70 - 104?830?320 - 146?844?160 [java] SrchNewRdr_50_25/sec_Par - - - - 8 50 25,5 15,70 107?417?728 146?844?160 [java] SrchTrvNewRdr_300_150/sec_Par - - - - - - - - - - - 8 - - 335500 - 178?635,6 - - 15,02 - 116?779?312 - 146?835?968 [java] SrchTrvRetNewRdr_100_50/sec_Par - - - - 8 232076 100?569,2 18,46 111?881?152 146?819?584
          Hide
          Karl Wettin added a comment -

          Patched contrib/benchmark to support InstantiatedIndex.

          Fixed a bug with mergeFactor.

          Reverted java 1.5<G> changes in PriorityQueue to (ClassCasting). (This is actually a spell checker thingy, but due to local dependencies the changes are located in this patch).

          Removed write locks. These had severe bugs and need to be reconsidered. Should be back in next patch. Using multiple InstantiatedIndex:es as segments on a MultiReader rather than updating the same index, this can be made completly lockless.

          Show
          Karl Wettin added a comment - Patched contrib/benchmark to support InstantiatedIndex. Fixed a bug with mergeFactor. Reverted java 1.5<G> changes in PriorityQueue to (ClassCasting). (This is actually a spell checker thingy, but due to local dependencies the changes are located in this patch). Removed write locks. These had severe bugs and need to be reconsidered. Should be back in next patch. Using multiple InstantiatedIndex:es as segments on a MultiReader rather than updating the same index, this can be made completly lockless.
          Hide
          Karl Wettin added a comment -

          Removed the dependencies to LUCENE-626.

          Show
          Karl Wettin added a comment - Removed the dependencies to LUCENE-626 .
          Hide
          Karl Wettin added a comment -

          Switched from java.util.PriorityQueue to org.apache.lucene.util.PriorityQueue, and made the latter <Generic>.

          Fixed some major bugs in the TermFreqVector inspection for the spell checker.

          Demonstrate in TestGoalJuror how to build an a priori corpus for the ngram token suggester based on user input by inverting the suggestion dictionary. That should probably be extracted to a helper class in the future. This makes it faster to query the a apriori, but it also means that what the system takes for grantent is correct comes from user input, and even if the correct data is what users point out as a real query goal, it does not have to be correct. Although, it makes the suggester much faster.

          Show
          Karl Wettin added a comment - Switched from java.util.PriorityQueue to org.apache.lucene.util.PriorityQueue, and made the latter <Generic>. Fixed some major bugs in the TermFreqVector inspection for the spell checker. Demonstrate in TestGoalJuror how to build an a priori corpus for the ngram token suggester based on user input by inverting the suggestion dictionary. That should probably be extracted to a helper class in the future. This makes it faster to query the a apriori, but it also means that what the system takes for grantent is correct comes from user input, and even if the correct data is what users point out as a real query goal, it does not have to be correct. Although, it makes the suggester much faster.
          Hide
          Karl Wettin added a comment -

          New Patch. Mainly updates in contrib/didyoumean. Merged some core conflicts.

          TestGoalJuror now import 200,000 real user queries from a log containing session id, query, category, timestamp and number of hits, ordered by session id and time.

          This means that the trainer and suggester are not aware of if the user followed or ignored a suggestion from the system, what results was inspected, if the query contained a goal, et c. So it does not work as if trained from the start with the adaptive layer.

          Still, the suggester navigates the dictionary fairly well and misspelled queries will be suggested the correct suggestion, but many correct spelled phrases will recommend something silly. As one start reporting user interaction to the suggester any silly recommendation should go away.

          In essence, it can only adapt the suggestions positive based on what the QueryGoalJuror says is a goal. Negative is only when a user don't take a suggestion. It could be solved with bootstrapping. Will mess with that later.

          Show
          Karl Wettin added a comment - New Patch. Mainly updates in contrib/didyoumean. Merged some core conflicts. TestGoalJuror now import 200,000 real user queries from a log containing session id, query, category, timestamp and number of hits, ordered by session id and time. This means that the trainer and suggester are not aware of if the user followed or ignored a suggestion from the system, what results was inspected, if the query contained a goal, et c. So it does not work as if trained from the start with the adaptive layer. Still, the suggester navigates the dictionary fairly well and misspelled queries will be suggested the correct suggestion, but many correct spelled phrases will recommend something silly. As one start reporting user interaction to the suggester any silly recommendation should go away. In essence, it can only adapt the suggestions positive based on what the QueryGoalJuror says is a goal. Negative is only when a user don't take a suggestion. It could be solved with bootstrapping. Will mess with that later.
          Hide
          Karl Wettin added a comment -

          Support for deleteDocuments in IndexWriterInterface, InstantiatedIndex and NotifiableIndex.

          Somewhat hacky solution to pick up the deletions in NotifiableIndex, but it is a solution.

          Show
          Karl Wettin added a comment - Support for deleteDocuments in IndexWriterInterface, InstantiatedIndex and NotifiableIndex. Somewhat hacky solution to pick up the deletions in NotifiableIndex, but it is a solution.
          Hide
          Karl Wettin added a comment -

          Added lots of documentation

          Show
          Karl Wettin added a comment - Added lots of documentation
          Hide
          Karl Wettin added a comment -

          I'll try to keep updated and built javadocs at this location:

          http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html

          (Sorry for flooding..)

          Show
          Karl Wettin added a comment - I'll try to keep updated and built javadocs at this location: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html (Sorry for flooding..)
          Hide
          Karl Wettin added a comment -

          (now proof read and all)
          Package level java doc of the spell checker:

          A dictionary with weighted suggestions,
          ordered by user activity,
          backed by algorithmic suggestions.
          <p/>

          <h1>What, where, when and how.</h1>

          <h2>Goal trees</h2>
          A user session could contain multiple quests for content.
          For example,
          first the user looks for the Apache licence,
          spells it wrong, inspects different results,
          and then the user searches for the author Ivan Goncharov.
          <p/>
          In this package we call them different goals.
          <p/>
          User activities are represented by a tree of QueryGoalNodes,
          each describes a user query,
          if the current query (goal node) was a suggestion from the system to a previous user query,
          what search results was further inspected,
          when it happend,
          and for how long.
          <p/>
          The biggest task as a consumer when implementing this package
          will be to keep track of what goal node the user came from,
          so that the new queries (goal node) will become children to the parent.
          Probably you add it as meta data to all actions,
          e.g. in the <a href="?goalID=>, as <input type=hidden name="goalID" value=>, et c,
          and keep track of them in a Map<Integer, QueryGoalNode> in the user session.
          <p/>
          It is up to the QueryGoalTreeExtractor implementations to decide what
          events in a session are parts of the same goal,
          as we don't want to suggest the user to check out Goncharov
          when they are looking for the Apache license.
          <p/>
          In the default query goal tree extractor,
          nodes are parts of the same goal as their parent when:
          <ul>
          <li>The queries are the same.</li>
          <li>The user took a suggestion from the system.</li>
          <li>The current and the parent queries are similair enough.</li>
          <li>The queries was entered within short enough time.</li>
          </ul>
          <p/>

          <h2>Adaptive training</h2>
          Adaptive means that the suggestions to a query
          depends on how users previously have been acting.
          This means that the dictionary could be tampered with quite easy
          and you should therefore try to train only with data from trusted users.
          <p/>
          The default trainer implementation works like this:
          <ul>
          <li>If a user accepts the suggestion made by the system, then we increase the score for that suggestion. (positive
          adaptation)
          </li>
          <li>If a user does not accept the suggestion made by the system, then we decrease the score for that suggestion.
          (negative adaptation)
          </li>
          <li>
          If the goal tree is a single query, one query only (perhaps with multiple inspections)
          then we adapt negative once again.
          </li>
          <li>
          Suggestions are the queries with inspections, ordered by the classification weight.
          All the queries in the goal witout inspections will be adpated positive with
          the query with inspections that has the shortest edit distance.
          </li>
          <li>Suggests back from best goal to second best goal. homm -> heroes of might and magic -> homm</li>
          </ul>
          <p/>

          <h2>Suggesting</h2>
          Suggestions are created by the suggester, that navigates a dictionary.
          The default implementation works like this:
          <ul>
          <li>
          Returns highest scoring suggestion available,
          unless the score is lower than the suggestion supression threadshold.
          </li>
          <li>
          If there are no suggestions available, the second level suggesters
          registred to the dictionary are used to produce the suggestions.
          </li>
          <li>
          If the top scoring suggestion is same as the query,
          and the second best is not supressed below threadshold,
          change order
          </li>
          </ul>
          Ignoring a suggestion 50 times or so with a DefaultTrainer makes a score hit 0.05d.
          <p/>

          <h2>Second level suggestion</h2>
          If the dictionary does not contain a suggestion for a given query,
          it will be passed on to any available SecondLevelSuggester,
          usually an algorithmic suggestion scheme
          that hopefully can come up with a suggestion.
          As a user accepts such a suggestion it will be trained
          and become a part of the adaptive layer.
          <h3>Token suggesters</h3>
          The lowest level of suggestion is single token suggestions,
          and the default implementation is a refactor of the contrib/spellcheck.
          <h3>TokenPhraseSuggester</h3>
          A layer on top of the single token suggesting that enables muti token (phrase) suggestions.
          <p/>
          For example, the user places the query "thh best game".
          The matrix of similar tokens are:
          <pre>
          the best game
          tho rest fame
          lame
          </pre>
          These can be represented in a finite number of ways:
          <pre>
          tho best game
          tho best fame
          tho best lame
          tho rest game
          tho rest fame
          tho rest lame
          the best game
          the best fame
          the best lame
          the rest game
          the rest fame
          the rest lame
          </pre>
          A query is created for each combination, in the default SpanNearQueries, to find valid suggestions.
          <p/>
          If any of the valid hits contains a TermPositionVector
          it will be analyzed and suggest the query in the order of terms in the index.
          E.g. query "camel that broke the staw" is suggested with "straw that broke the camel"
          todo: if term positions available and stored, suggest that for cosmetic reasons.)

          <h1>Consumer interface example</h1>
          Code from the test cases.
          <pre>
          private SuggestionFacade<R> suggestionFacade;

          @Override
          protected void setUp() throws Exception

          { suggestionFacade = = new SuggestionFacade<R>(); }

          public void testBasicTraining() throws Exception

          { QueryGoalNode<R> node; node = new QueryGoalNode<R>(null, "heroes of nmight and magic", 3); node = new QueryGoalNode<R>(node, "heroes of night and magic", 3); node = new QueryGoalNode<R>(node, "heroes of might and magic", 10); node.new Inspection(23, QueryGoalNode.GOAL); suggestionFacade.queueGoalTree(node.getRoot()); node = new QueryGoalNode<R>(null, "heroes of night and magic", 3); node = new QueryGoalNode<R>(node, "heroes of knight and magic", 7); node = new QueryGoalNode<R>(node, "heroes of might and magic", 20); node.new Inspection(23, QueryGoalNode.GOAL); suggestionFacade.queueGoalTree(node); node = new QueryGoalNode<R>(null, "heroes of might and magic", 20, 1l); suggestionFacade.queueGoalTree(node); node = new QueryGoalNode<R>(null, "heroes of night and magic", 7, 0l); node = new QueryGoalNode<R>(node, "heroes of light and magic", 14, 1l); node = new QueryGoalNode<R>(node, "heroes of might and magic", 2, 6l); node.new Inspection(23, QueryGoalNode.GOAL); node.new Inspection(23, QueryGoalNode.GOAL); suggestionFacade.queueGoalTree(node); node = new QueryGoalNode<R>(null, "heroes of night and magic", 4, 0l); node = new QueryGoalNode<R>(node, "heroes of knight and magic", 17, 1l); node = new QueryGoalNode<R>(node, "heroes of might and magic", 2, 2l); node.new Inspection(23, QueryGoalNode.GOAL); suggestionFacade.queueGoalTree(node); suggestionFacade.flush(); assertEquals("heroes of might and magic", suggestionFacade.didYouMean("heroes of light and magic")); assertEquals("heroes of might and magic", suggestionFacade.didYouMean("heroes of night and magic")); assertEquals("heroes of might and magic", suggestionFacade.didYouMean("heroes ofnight andmagic")); }

          </pre>
          <p/>
          Notice the last assertation:
          <pre>
          assertEquals("heroes of might and magic", suggestionFacade.didYouMean("heroes ofnight andmagic"));
          </pre>
          The dictionary will strip keys from puctuation and whitespace,
          resulting in better support for de/compositions of words.
          <p/>
          Above example will be user session analyzing and adaptive only,
          no algorithmic suggestions if the user types in something nobody miss spelled before.
          Simply add one to the dictionary:
          <pre>
          protected void setUp() throws Exception

          { suggestionFacade = new SuggestionFacade<R>(); // your primary index that suggestions must match. IndexFacade aprioriIndex = new IndexFacade(new RAMDirectoryIndex()); String aprioriField = "title"; // build the ngram suggester IndexFacade ngramIndex = new IndexFacade(new RAMDirectoryIndex()); NgramTokenSuggester ngramSuggester = new NgramTokenSuggester(ngramIndex); ngramSuggester.indexDictionary(new TermEnumIterator(aprioriIndex.getReader(), aprioriField)); // the greater the better results but with a longer response time. int maxSuggestionsPerToken = 3; // add ngram suggester wrapped in a single token phrase suggester as second level suggester. suggestionFacade.getDictionary().getPrioritesBySecondLevelSuggester().put(new SecondLevelTokenPhraseSuggester(ngramSuggester, aprioriField, false, maxSuggestionsPerToken, new WhitespaceAnalyzer(), aprioriIndex), 1d); }

          </pre>
          <h1>Persistence and memory usage.</h1>
          By default the dictionary is soft referenced,
          meaning it will consume as much memory it can get,
          and if some other application is in need of memory
          low prioritized (priority is decided by the JVM) instances will be released.
          <p/>
          There is currently no persistence but java.io.Serliazlible available for the in the adaptive layer.
          You need to implement your own Map<String, SuggestionList> that is persistent
          and pass it to the constructor of your directory.

          Show
          Karl Wettin added a comment - (now proof read and all) Package level java doc of the spell checker: A dictionary with weighted suggestions, ordered by user activity, backed by algorithmic suggestions. <p/> <h1>What, where, when and how.</h1> <h2>Goal trees</h2> A user session could contain multiple quests for content. For example, first the user looks for the Apache licence, spells it wrong, inspects different results, and then the user searches for the author Ivan Goncharov. <p/> In this package we call them different goals. <p/> User activities are represented by a tree of QueryGoalNodes, each describes a user query, if the current query (goal node) was a suggestion from the system to a previous user query, what search results was further inspected, when it happend, and for how long. <p/> The biggest task as a consumer when implementing this package will be to keep track of what goal node the user came from, so that the new queries (goal node) will become children to the parent. Probably you add it as meta data to all actions, e.g. in the <a href="?goalID=>, as <input type=hidden name="goalID" value=>, et c, and keep track of them in a Map<Integer, QueryGoalNode> in the user session. <p/> It is up to the QueryGoalTreeExtractor implementations to decide what events in a session are parts of the same goal, as we don't want to suggest the user to check out Goncharov when they are looking for the Apache license. <p/> In the default query goal tree extractor, nodes are parts of the same goal as their parent when: <ul> <li>The queries are the same.</li> <li>The user took a suggestion from the system.</li> <li>The current and the parent queries are similair enough.</li> <li>The queries was entered within short enough time.</li> </ul> <p/> <h2>Adaptive training</h2> Adaptive means that the suggestions to a query depends on how users previously have been acting. This means that the dictionary could be tampered with quite easy and you should therefore try to train only with data from trusted users. <p/> The default trainer implementation works like this: <ul> <li>If a user accepts the suggestion made by the system, then we increase the score for that suggestion. (positive adaptation) </li> <li>If a user does not accept the suggestion made by the system, then we decrease the score for that suggestion. (negative adaptation) </li> <li> If the goal tree is a single query, one query only (perhaps with multiple inspections) then we adapt negative once again. </li> <li> Suggestions are the queries with inspections, ordered by the classification weight. All the queries in the goal witout inspections will be adpated positive with the query with inspections that has the shortest edit distance. </li> <li>Suggests back from best goal to second best goal. homm -> heroes of might and magic -> homm</li> </ul> <p/> <h2>Suggesting</h2> Suggestions are created by the suggester, that navigates a dictionary. The default implementation works like this: <ul> <li> Returns highest scoring suggestion available, unless the score is lower than the suggestion supression threadshold. </li> <li> If there are no suggestions available, the second level suggesters registred to the dictionary are used to produce the suggestions. </li> <li> If the top scoring suggestion is same as the query, and the second best is not supressed below threadshold, change order </li> </ul> Ignoring a suggestion 50 times or so with a DefaultTrainer makes a score hit 0.05d. <p/> <h2>Second level suggestion</h2> If the dictionary does not contain a suggestion for a given query, it will be passed on to any available SecondLevelSuggester, usually an algorithmic suggestion scheme that hopefully can come up with a suggestion. As a user accepts such a suggestion it will be trained and become a part of the adaptive layer. <h3>Token suggesters</h3> The lowest level of suggestion is single token suggestions, and the default implementation is a refactor of the contrib/spellcheck. <h3>TokenPhraseSuggester</h3> A layer on top of the single token suggesting that enables muti token (phrase) suggestions. <p/> For example, the user places the query "thh best game". The matrix of similar tokens are: <pre> the best game tho rest fame lame </pre> These can be represented in a finite number of ways: <pre> tho best game tho best fame tho best lame tho rest game tho rest fame tho rest lame the best game the best fame the best lame the rest game the rest fame the rest lame </pre> A query is created for each combination, in the default SpanNearQueries, to find valid suggestions. <p/> If any of the valid hits contains a TermPositionVector it will be analyzed and suggest the query in the order of terms in the index. E.g. query "camel that broke the staw" is suggested with "straw that broke the camel" todo: if term positions available and stored, suggest that for cosmetic reasons.) <h1>Consumer interface example</h1> Code from the test cases. <pre> private SuggestionFacade<R> suggestionFacade; @Override protected void setUp() throws Exception { suggestionFacade = = new SuggestionFacade<R>(); } public void testBasicTraining() throws Exception { QueryGoalNode<R> node; node = new QueryGoalNode<R>(null, "heroes of nmight and magic", 3); node = new QueryGoalNode<R>(node, "heroes of night and magic", 3); node = new QueryGoalNode<R>(node, "heroes of might and magic", 10); node.new Inspection(23, QueryGoalNode.GOAL); suggestionFacade.queueGoalTree(node.getRoot()); node = new QueryGoalNode<R>(null, "heroes of night and magic", 3); node = new QueryGoalNode<R>(node, "heroes of knight and magic", 7); node = new QueryGoalNode<R>(node, "heroes of might and magic", 20); node.new Inspection(23, QueryGoalNode.GOAL); suggestionFacade.queueGoalTree(node); node = new QueryGoalNode<R>(null, "heroes of might and magic", 20, 1l); suggestionFacade.queueGoalTree(node); node = new QueryGoalNode<R>(null, "heroes of night and magic", 7, 0l); node = new QueryGoalNode<R>(node, "heroes of light and magic", 14, 1l); node = new QueryGoalNode<R>(node, "heroes of might and magic", 2, 6l); node.new Inspection(23, QueryGoalNode.GOAL); node.new Inspection(23, QueryGoalNode.GOAL); suggestionFacade.queueGoalTree(node); node = new QueryGoalNode<R>(null, "heroes of night and magic", 4, 0l); node = new QueryGoalNode<R>(node, "heroes of knight and magic", 17, 1l); node = new QueryGoalNode<R>(node, "heroes of might and magic", 2, 2l); node.new Inspection(23, QueryGoalNode.GOAL); suggestionFacade.queueGoalTree(node); suggestionFacade.flush(); assertEquals("heroes of might and magic", suggestionFacade.didYouMean("heroes of light and magic")); assertEquals("heroes of might and magic", suggestionFacade.didYouMean("heroes of night and magic")); assertEquals("heroes of might and magic", suggestionFacade.didYouMean("heroes ofnight andmagic")); } </pre> <p/> Notice the last assertation: <pre> assertEquals("heroes of might and magic", suggestionFacade.didYouMean("heroes ofnight andmagic")); </pre> The dictionary will strip keys from puctuation and whitespace, resulting in better support for de/compositions of words. <p/> Above example will be user session analyzing and adaptive only, no algorithmic suggestions if the user types in something nobody miss spelled before. Simply add one to the dictionary: <pre> protected void setUp() throws Exception { suggestionFacade = new SuggestionFacade<R>(); // your primary index that suggestions must match. IndexFacade aprioriIndex = new IndexFacade(new RAMDirectoryIndex()); String aprioriField = "title"; // build the ngram suggester IndexFacade ngramIndex = new IndexFacade(new RAMDirectoryIndex()); NgramTokenSuggester ngramSuggester = new NgramTokenSuggester(ngramIndex); ngramSuggester.indexDictionary(new TermEnumIterator(aprioriIndex.getReader(), aprioriField)); // the greater the better results but with a longer response time. int maxSuggestionsPerToken = 3; // add ngram suggester wrapped in a single token phrase suggester as second level suggester. suggestionFacade.getDictionary().getPrioritesBySecondLevelSuggester().put(new SecondLevelTokenPhraseSuggester(ngramSuggester, aprioriField, false, maxSuggestionsPerToken, new WhitespaceAnalyzer(), aprioriIndex), 1d); } </pre> <h1>Persistence and memory usage.</h1> By default the dictionary is soft referenced, meaning it will consume as much memory it can get, and if some other application is in need of memory low prioritized (priority is decided by the JVM) instances will be released. <p/> There is currently no persistence but java.io.Serliazlible available for the in the adaptive layer. You need to implement your own Map<String, SuggestionList> that is persistent and pass it to the constructor of your directory.
          Hide
          Karl Wettin added a comment -

          UML class diagram of the adaptive spell checker with all java docs as comments

          Show
          Karl Wettin added a comment - UML class diagram of the adaptive spell checker with all java docs as comments
          Hide
          Karl Wettin added a comment -

          Updated spell checker code

          Show
          Karl Wettin added a comment - Updated spell checker code
          Hide
          Karl Wettin added a comment -

          Introduced a method in instantiated index that appends the entire content to any other index.

          /**
             * Adds the complete content of this instantiated index on to any other index using an index writer.
             * <p/>
             * This can for instance be used for
             * merging multiple instantiated indices
             * and periodically storing persistent snapshots in an FSDirectory.
             * <p/>
             * Non stored offsets are partially rebuilt. This can be improved quite a bit. See comments in code.
             * <p/>
             * The analyzer creates one complete token stream of all fields with the same name the first time it is requested,
             * and after that an empty for each remaining. todo: this is a problem?
             * <p/>
             * It can be buggy if the same token appears as synonym to it self (position increment 0). not really something to worry about.. or?
             *
             * @param indexWriter represents the index on wich to add all the content of this instantiated index.
             * @throws IOException when accessing parameter indexWriter
             */
            public void writeToIndex(IndexWriterInterface indexWriter) throws IOException {
          
          Show
          Karl Wettin added a comment - Introduced a method in instantiated index that appends the entire content to any other index. /** * Adds the complete content of this instantiated index on to any other index using an index writer. * <p/> * This can for instance be used for * merging multiple instantiated indices * and periodically storing persistent snapshots in an FSDirectory. * <p/> * Non stored offsets are partially rebuilt. This can be improved quite a bit. See comments in code. * <p/> * The analyzer creates one complete token stream of all fields with the same name the first time it is requested, * and after that an empty for each remaining. todo: this is a problem? * <p/> * It can be buggy if the same token appears as synonym to it self (position increment 0). not really something to worry about.. or? * * @param indexWriter represents the index on wich to add all the content of this instantiated index. * @ throws IOException when accessing parameter indexWriter */ public void writeToIndex(IndexWriterInterface indexWriter) throws IOException {
          Hide
          Karl Wettin added a comment -

          the last attachment is of course for ASF distribution. sorry.

          Show
          Karl Wettin added a comment - the last attachment is of course for ASF distribution. sorry.
          Hide
          Karl Wettin added a comment -

          Can now be loaded from, and be persisted in an FSDirectory.

          The actual implementation is a bit more abtract than that though. It is not super nice yet, but all low level index comparator tests pass.

          Introduced functionallity to load an instantiated from any index reader (e.g. a FSDirectory)

            /**
             * Creates a new instantiated index that looks just like the index in a specific state as represented by a reader.
             * 
             * @param sourceIndexReader the source index this new instantiated index will be copied from.
             * @throws IOException if the source index is not optimized, or when accesing the source.
             */
            public InstantiatedIndex(IndexReader sourceIndexReader) throws IOException {
          

          Also introduced class SimpleSychronizedIndex, a class that kind of works like unix command "tee", makes sure that all changes to a main index (e.g. an instantiated index) also is applied to a mirror index (e.g. the fs directory loaded to the instantiated index at constructor time).

          Some class that handles these two things a single entity will probably be added soon.

          Basiacally this is replicating changes to a secondary index on commits. Thus it takes about twice the time to insert documents. Perhaps the secondary index should be updated in a secondary thread?

          Show
          Karl Wettin added a comment - Can now be loaded from, and be persisted in an FSDirectory. The actual implementation is a bit more abtract than that though. It is not super nice yet, but all low level index comparator tests pass. Introduced functionallity to load an instantiated from any index reader (e.g. a FSDirectory) /** * Creates a new instantiated index that looks just like the index in a specific state as represented by a reader. * * @param sourceIndexReader the source index this new instantiated index will be copied from. * @ throws IOException if the source index is not optimized, or when accesing the source. */ public InstantiatedIndex(IndexReader sourceIndexReader) throws IOException { Also introduced class SimpleSychronizedIndex, a class that kind of works like unix command "tee", makes sure that all changes to a main index (e.g. an instantiated index) also is applied to a mirror index (e.g. the fs directory loaded to the instantiated index at constructor time). Some class that handles these two things a single entity will probably be added soon. Basiacally this is replicating changes to a secondary index on commits. Thus it takes about twice the time to insert documents. Perhaps the secondary index should be updated in a secondary thread?
          Hide
          Karl Wettin added a comment -

          Added support for contrib/memory MemoryIndex, so now it works with readers and writers as if it was any other index.

          Added a consumer level index implementation that handles cache, notifications, and all the stuff this issue is about:

          // This is the instace one is supposed to use for all access against the index in this JVM.
          IndexFacade index = new IndexFacade(new RAMDirectoryIndex());

          // Accessors
          IndexWriterInterface writer = index.indexWriterFactory(anayzler, true);
          Document doc = new Document();
          doc.add(...
          writer.add(doc);
          writer.close();
          IndexReader deleter = index.indexReaderFactory();
          index.getSearcher().search(...
          index.getReader().doc(0)
          deleter.close();
          assertEquals(0, index.getReader().numDocs());

          public class IndexFacade {

          /** wrapps any storage, optional cache settings */
          public IndexFacade(I index, CachedSearcher.HitCollectionCacheState hitCollectionCache, boolean topDocsCache, boolean topFieldsCache, boolean documentsCache) throws IOException {
          public CachedSearcher getSearcher() throws IOException {

          /** The general consumer searcher to be used when querying this index. Always fresh. */
          public Searcher getSearcher() throws IOException {

          /** The general consumer read only index reader to be used when inspecting this index. Always fresh. */
          public IndexReader getReader() throws IOException {

          Show
          Karl Wettin added a comment - Added support for contrib/memory MemoryIndex, so now it works with readers and writers as if it was any other index. Added a consumer level index implementation that handles cache, notifications, and all the stuff this issue is about: // This is the instace one is supposed to use for all access against the index in this JVM. IndexFacade index = new IndexFacade(new RAMDirectoryIndex()); // Accessors IndexWriterInterface writer = index.indexWriterFactory(anayzler, true); Document doc = new Document(); doc.add(... writer.add(doc); writer.close(); IndexReader deleter = index.indexReaderFactory(); index.getSearcher().search(... index.getReader().doc(0) deleter.close(); assertEquals(0, index.getReader().numDocs()); public class IndexFacade { /** wrapps any storage, optional cache settings */ public IndexFacade(I index, CachedSearcher.HitCollectionCacheState hitCollectionCache, boolean topDocsCache, boolean topFieldsCache, boolean documentsCache) throws IOException { public CachedSearcher getSearcher() throws IOException { /** The general consumer searcher to be used when querying this index. Always fresh. */ public Searcher getSearcher() throws IOException { /** The general consumer read only index reader to be used when inspecting this index. Always fresh. */ public IndexReader getReader() throws IOException {
          Hide
          Karl Wettin added a comment -

          Refactored the Term->Document relationships a bit for speed optimizations. It also resulted with getting all term frequency vector information except for offsets free of charge. More information on that in the class diagram.

          Removed a whole bunch of todo:s in the writer and reader.

          The current lock implementen is worthless. I need to read up on RentrentLock. Or should I perhaps use the lock Directory:s use?

          (And that class diagram is of course granted for ASF, my misstake.)

          Show
          Karl Wettin added a comment - Refactored the Term->Document relationships a bit for speed optimizations. It also resulted with getting all term frequency vector information except for offsets free of charge. More information on that in the class diagram. Removed a whole bunch of todo:s in the writer and reader. The current lock implementen is worthless. I need to read up on RentrentLock. Or should I perhaps use the lock Directory:s use? (And that class diagram is of course granted for ASF, my misstake.)
          Hide
          Karl Wettin added a comment -

          new diagram with lots of notes
          (this is also available in the patch as an uxf-file for umlet)

          Show
          Karl Wettin added a comment - new diagram with lots of notes (this is also available in the patch as an uxf-file for umlet)
          Hide
          Karl Wettin added a comment -

          Patch of the week.

          Changes:

          • CachedSearcher – soft referenced hit collection-, TopDocs- and TopFieldDocs cache. Backed by NotifiableIndex.

          Removed Hits cache due to uncertainty but introduced:

          • CachedIndexReader – soft referenced documents cache. Backed by NotifiableIndex.

          TopDocs/TopFieldDocs- and IndexReader cache combined almost replace a fully cached Hits.

          The number of unit tests and detail of them is increasing.

          The plan is now to have the cached reader pre-loading documents to memory from an own thread when server load allows it.

          Also added some abstractation levers used by above:

          • AutofreshedIndexReader – always up to date with the index.
          • ReadOnlyIndexReader – makes sure the user don't delete stuff with the decorated reader.

          Had some problems with decorating the IndexModifierInterface against Directory in NotifiableIndex, so removed the Index.indexModifierFactory() and introduced a index facade backed version:

          org.apache.lucene.index.facade.IndexModifier(myIndex, analyzer, create)

          where all reader/writer creation is myIndex.indexReaderFactory() and indexWriterFactory();

          Makes the Notifiable code a bit simpler.

          Show
          Karl Wettin added a comment - Patch of the week. Changes: CachedSearcher – soft referenced hit collection-, TopDocs- and TopFieldDocs cache. Backed by NotifiableIndex. Removed Hits cache due to uncertainty but introduced: CachedIndexReader – soft referenced documents cache. Backed by NotifiableIndex. TopDocs/TopFieldDocs- and IndexReader cache combined almost replace a fully cached Hits. The number of unit tests and detail of them is increasing. The plan is now to have the cached reader pre-loading documents to memory from an own thread when server load allows it. Also added some abstractation levers used by above: AutofreshedIndexReader – always up to date with the index. ReadOnlyIndexReader – makes sure the user don't delete stuff with the decorated reader. Had some problems with decorating the IndexModifierInterface against Directory in NotifiableIndex, so removed the Index.indexModifierFactory() and introduced a index facade backed version: org.apache.lucene.index.facade.IndexModifier(myIndex, analyzer, create) where all reader/writer creation is myIndex.indexReaderFactory() and indexWriterFactory(); Makes the Notifiable code a bit simpler.
          Hide
          Karl Wettin added a comment -

          New sunday, new code.

          Hoss Man [15/Jan/07 12:16 AM]
          > I've only briefly looked at the new stuff in contrib, because I got lost ... there isn't
          > any package or class level javadocs or a build.xml in either contrib.

          Tried to do something about the java docs. Also made a new fresh class diagram with some comments in it. I can make it PDF or XUL if prefered.

          That boxing error you fixed might be back. Where was it? Could not find it in the patch (all adding and no -+ fix) and it was too late to apply your patch on my local version..

          > Hoss Man [15/Jan/07 12:16 AM]
          >
          > 1) some of these changes seem to be duplicated in LUCENE-774 and LUCENE-775
          > ... just pointing that out for other people who might get confused.

          Is it considered better practise to keep all my changes in this one huge issue? I thought it could be nice to pop in minor patches such as them.

          > 4) I would personally prefer..
          > but that's a minor nit.

          There has been a lot of refactoring of packages and class names as suggested. (I'm still not happy with the notification listener classes.)

          A few new changes to the core:

          Lazy initialization of the fields collection in Document .

          Some definalization to allow decoration of IndexReader.
          http://www.nabble.com/IndexReader-can-not-be-decorated-tf3041647.html#a8461125

          > Hoss Man [15/Jan/07 12:16 AM]
          >
          > 3) i don't think the Hits.setSearcher method you added is safe

          It smeared out on java-dev: http://www.nabble.com/Decorative-cache-%28and-Hits.setSearcher%29-tf3009848.html#a8428139

          I did not investigate this any further with test code, but I have identitfied lazy fields as a problem. Instead I'm considering a supplementary decorated document cache on the IndexReader, and implementing a replacement for Hits.

          Hoss Man [15/Jan/07 12:39 AM]
          > I just realized that all of the tests in contrib/instantiated/src/test/java/org/apache/lucene/
          > instantiated/assimilated/ are duplicates of tests from the core with a few line changes
          > so they use an InstantiatedIndex to get a reader/writer/seracher etc.

          This is not a bad idea at all, but I will not have time to do it right anytime soon. It would be a simpler task if the facade was a part of the core, as this is just the thing it was built for – unison index handling.

          Hoss Man [15/Jan/07 01:35 AM]
          > Then i ran the tests, and got some errors – which are included in test-reports.zip so you can check them out.

          What tool do you recommend to inspect these reports?

          I know for a fact that remote searchable will fail. I hope for someone to show up, need it and fix it.

          Show
          Karl Wettin added a comment - New sunday, new code. Hoss Man [15/Jan/07 12:16 AM] > I've only briefly looked at the new stuff in contrib, because I got lost ... there isn't > any package or class level javadocs or a build.xml in either contrib. Tried to do something about the java docs. Also made a new fresh class diagram with some comments in it. I can make it PDF or XUL if prefered. That boxing error you fixed might be back. Where was it? Could not find it in the patch (all adding and no -+ fix) and it was too late to apply your patch on my local version.. > Hoss Man [15/Jan/07 12:16 AM] > > 1) some of these changes seem to be duplicated in LUCENE-774 and LUCENE-775 > ... just pointing that out for other people who might get confused. Is it considered better practise to keep all my changes in this one huge issue? I thought it could be nice to pop in minor patches such as them. > 4) I would personally prefer.. > but that's a minor nit. There has been a lot of refactoring of packages and class names as suggested. (I'm still not happy with the notification listener classes.) A few new changes to the core: Lazy initialization of the fields collection in Document . Some definalization to allow decoration of IndexReader. http://www.nabble.com/IndexReader-can-not-be-decorated-tf3041647.html#a8461125 > Hoss Man [15/Jan/07 12:16 AM] > > 3) i don't think the Hits.setSearcher method you added is safe It smeared out on java-dev: http://www.nabble.com/Decorative-cache-%28and-Hits.setSearcher%29-tf3009848.html#a8428139 I did not investigate this any further with test code, but I have identitfied lazy fields as a problem. Instead I'm considering a supplementary decorated document cache on the IndexReader, and implementing a replacement for Hits. Hoss Man [15/Jan/07 12:39 AM] > I just realized that all of the tests in contrib/instantiated/src/test/java/org/apache/lucene/ > instantiated/assimilated/ are duplicates of tests from the core with a few line changes > so they use an InstantiatedIndex to get a reader/writer/seracher etc. This is not a bad idea at all, but I will not have time to do it right anytime soon. It would be a simpler task if the facade was a part of the core, as this is just the thing it was built for – unison index handling. Hoss Man [15/Jan/07 01:35 AM] > Then i ran the tests, and got some errors – which are included in test-reports.zip so you can check them out. What tool do you recommend to inspect these reports? I know for a fact that remote searchable will fail. I hope for someone to show up, need it and fix it.
          Hide
          Karl Wettin added a comment -

          Thanks alot Hoss, for taking the time. I sure do appreciate it.

          I'll get back on your comments.

          Show
          Karl Wettin added a comment - Thanks alot Hoss, for taking the time. I sure do appreciate it. I'll get back on your comments.
          Hide
          Hoss Man added a comment -

          Karl: the trunk.diff i just attached fixes a small autoboxing dependency your patch introduced into the core (preventing compilation on java 1.4). I also added build.xml files to the new contrib dirs, rearanged the directory of the contribs so they match the default for contribs and the the build.xml files could be simple. Once i did this i discovered some unneccessary dependencies on commons-logging that i removed. Then i ran the tests, and got some errors – which are included in test-reports.zip so you can check them out.

          Show
          Hoss Man added a comment - Karl: the trunk.diff i just attached fixes a small autoboxing dependency your patch introduced into the core (preventing compilation on java 1.4). I also added build.xml files to the new contrib dirs, rearanged the directory of the contribs so they match the default for contribs and the the build.xml files could be simple. Once i did this i discovered some unneccessary dependencies on commons-logging that i removed. Then i ran the tests, and got some errors – which are included in test-reports.zip so you can check them out.
          Hide
          Hoss Man added a comment -

          I just realized that all of the tests in contrib/instantiated/src/test/java/org/apache/lucene/instantiated/assimilated/ are duplicates of tests from the core with a few line changes so they use an InstantiatedIndex to get a reader/writer/seracher etc.

          I think it would be much better if we changed the orriginal versions of these tests to include an accessors for constructing/fetching those objects which could be subclassed by tests in your contrib – that way any bugs found/fixed in those test classes and any additional test methods added to those classes would automatically be inherited by your versions (instead of winding up with duplicate cut/paste test code)

          Show
          Hoss Man added a comment - I just realized that all of the tests in contrib/instantiated/src/test/java/org/apache/lucene/instantiated/assimilated/ are duplicates of tests from the core with a few line changes so they use an InstantiatedIndex to get a reader/writer/seracher etc. I think it would be much better if we changed the orriginal versions of these tests to include an accessors for constructing/fetching those objects which could be subclassed by tests in your contrib – that way any bugs found/fixed in those test classes and any additional test methods added to those classes would automatically be inherited by your versions (instead of winding up with duplicate cut/paste test code)
          Hide
          Hoss Man added a comment -

          I've been trying to follow the work you've been doing Karl, but i must admit a lot of it is over my head – but since i've got a long weekend and your patch now makes so few changes to the core i could acctually make sense of that part, so here are some comments on those changes...

          1) some of these changes seem to be duplicated in LUCENE-774 and LUCENE-775 ... just pointing that out for other people who might get confused.

          2) since the new ScoreDoc.docComparator and ScoreDoc.scoreComparator are public, they should have some javadocs clarifing what they are for.

          3) i don't think the Hits.setSearcher method you added is safe ... i believe that at a minimum hitDocs, first, last, and weight all need to be reset – weight's a tricky one since the instance doesn't currently hang on to the orriginal query.

          4) I would personally prefer IndexWriterInterface and IndexModifierInterface over InterfaceIndexWriter and InterfaceIndexModifier – if for no other reason then so they sort together .. but that's a minor nit.

          I've only briefly looked at the new stuff in contrib, because I got lost ... there isn't any package or class level javadocs or a build.xml in either contrib. A big thing i did notice is that the code in indexfacade puts things in the o.a.l.search and o.a.l.index packages, which is being discouraged for contribs (among other reasons it makes it confusing to understand where a class is coming form) ideally those classes should live under o.a.l.indexfacade.index and o.a.l.indexfacade.index (or maybe just o.a.l.facade - but you get the idea)

          Show
          Hoss Man added a comment - I've been trying to follow the work you've been doing Karl, but i must admit a lot of it is over my head – but since i've got a long weekend and your patch now makes so few changes to the core i could acctually make sense of that part, so here are some comments on those changes... 1) some of these changes seem to be duplicated in LUCENE-774 and LUCENE-775 ... just pointing that out for other people who might get confused. 2) since the new ScoreDoc.docComparator and ScoreDoc.scoreComparator are public, they should have some javadocs clarifing what they are for. 3) i don't think the Hits.setSearcher method you added is safe ... i believe that at a minimum hitDocs, first, last, and weight all need to be reset – weight's a tricky one since the instance doesn't currently hang on to the orriginal query. 4) I would personally prefer IndexWriterInterface and IndexModifierInterface over InterfaceIndexWriter and InterfaceIndexModifier – if for no other reason then so they sort together .. but that's a minor nit. I've only briefly looked at the new stuff in contrib, because I got lost ... there isn't any package or class level javadocs or a build.xml in either contrib. A big thing i did notice is that the code in indexfacade puts things in the o.a.l.search and o.a.l.index packages, which is being discouraged for contribs (among other reasons it makes it confusing to understand where a class is coming form) ideally those classes should live under o.a.l.indexfacade.index and o.a.l.indexfacade.index (or maybe just o.a.l.facade - but you get the idea)
          Hide
          Karl Wettin added a comment -

          New patch has all assimilated test cases moved to a new non conflicting package.

          Also contains contrib/cache that depends on everything else.

          Show
          Karl Wettin added a comment - New patch has all assimilated test cases moved to a new non conflicting package. Also contains contrib/cache that depends on everything else.
          Hide
          Karl Wettin added a comment -

          Doug Cutting [12/Jan/07 10:16 AM]
          > I don't see a patch file here. Your proposal would be easier to evaluate as a patch file.

          Attached!

          > easier to accept if your new classes are in the contrib tree.

          There are a couple of chages in the core, the rest has been moved to contrib/indexfacade and contrib/instantiated. There is some clean up to do: a couple of static tests in instantiated. And perhaps some common logging artifacts left from debugging.

          I'm quite certain that both contrib/packages depends on java<1.5>. At least concurrency in instantiated.

          Show
          Karl Wettin added a comment - Doug Cutting [12/Jan/07 10:16 AM] > I don't see a patch file here. Your proposal would be easier to evaluate as a patch file. Attached! > easier to accept if your new classes are in the contrib tree. There are a couple of chages in the core, the rest has been moved to contrib/indexfacade and contrib/instantiated. There is some clean up to do: a couple of static tests in instantiated. And perhaps some common logging artifacts left from debugging. I'm quite certain that both contrib/packages depends on java<1.5>. At least concurrency in instantiated.
          Hide
          Doug Cutting added a comment -

          I don't see a patch file here. Your proposal would be easier to evaluate as a patch file. Also, a contribution like this will be easier to accept if your new classes are in the contrib tree. Then, if they prove popular, they can move into the core. Or perhaps folks will find them so obviously useful they'll want them in the core from the start, but contrib would require less convincing.

          Show
          Doug Cutting added a comment - I don't see a patch file here. Your proposal would be easier to evaluate as a patch file. Also, a contribution like this will be easier to accept if your new classes are in the contrib tree. Then, if they prove popular, they can move into the core. Or perhaps folks will find them so obviously useful they'll want them in the core from the start, but contrib would require less convincing.
          Hide
          Karl Wettin added a comment -

          > Jira admins: you are more than welcome to remove all old attachments, except images.

          oh, i had no clue my status was upgraded. cool. fixed it my self.

          Show
          Karl Wettin added a comment - > Jira admins: you are more than welcome to remove all old attachments, except images. oh, i had no clue my status was upgraded. cool. fixed it my self.
          Hide
          Karl Wettin added a comment -

          This is the current version of my local Lucene branch, including InstantiatedIndex. As I have not merged with the trunk for a while, it also features my locally patched version. It really is just a few small changes. Some classes are no longer final, plus I have introduced InterfaceIndexWriter and InterfaceIndexModifier.

          /lucene2karl/lucene2-apache-karl-patched
          /lucene2karl/lucene2-karl/test <--- all (search) test cases adapted to run with instantiated index
          /lucene2karl/lucene2-karl/index
          /lucene2karl/lucene2-karl/instantiated
          /lucene2karl/lucene2-karl/searchfork <--- non important stuff
          /lucene2karl/lucene2-karl/analysis <--- just some stuff
          /lucene2karl/lucene2-karl/core <-- patches for the lucene trunk
          /memoryindex <--- stuff for wolfgang

          All tests pass, except remote, multi and parallell searchers.

          Jira admins: you are more than welcome to remove all old attachments, except images.

          Show
          Karl Wettin added a comment - This is the current version of my local Lucene branch, including InstantiatedIndex. As I have not merged with the trunk for a while, it also features my locally patched version. It really is just a few small changes. Some classes are no longer final, plus I have introduced InterfaceIndexWriter and InterfaceIndexModifier. /lucene2karl/lucene2-apache-karl-patched /lucene2karl/lucene2-karl/test <--- all (search) test cases adapted to run with instantiated index /lucene2karl/lucene2-karl/index /lucene2karl/lucene2-karl/instantiated /lucene2karl/lucene2-karl/searchfork <--- non important stuff /lucene2karl/lucene2-karl/analysis <--- just some stuff /lucene2karl/lucene2-karl/core <-- patches for the lucene trunk /memoryindex <--- stuff for wolfgang All tests pass, except remote, multi and parallell searchers. Jira admins: you are more than welcome to remove all old attachments, except images.
          Hide
          wolfgang hoschek added a comment -

          I've now checked in a version of MemoryIndexTest into contrib/memory that more easily allows to switch between measuring indexing or querying. Example output for measuring query throughput on simple term queries: ~500000 queries/sec on a MacBook Pro, jdk 1.5.0_06, server VM. As always, your mileage may vary.

          Show
          wolfgang hoschek added a comment - I've now checked in a version of MemoryIndexTest into contrib/memory that more easily allows to switch between measuring indexing or querying. Example output for measuring query throughput on simple term queries: ~500000 queries/sec on a MacBook Pro, jdk 1.5.0_06, server VM. As always, your mileage may vary.
          Hide
          wolfgang hoschek added a comment -

          > All Lucene unit tests have been adapted to work with my alternate index. Everything but proximity queries pass.

          Sounds like you're almost there

          Regarding indexing performance with MemoryIndex: Performance is more than good enough. I've observed and measured that often the bottleneck is not the MemoryIndex itself, but rather the Analyzer type (e.g. StandardAnalayzer) or the I/O for the input files or term lower casing (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265809) or something else entirely.

          Regarding query performance with MemoryIndex: Some queries are more efficient than others. For example, fuzzy queries are much less efficient than wild card queries, which in turn are much less efficient than simple term queries. Such effects seem partly inherent due too the nature of the query type, partly a function of the chosen data structure (RAMDirectory, MemoryIndex, II, ...), and partly a consequence of the overall Lucene API design.

          The query mix found in testqueries.txt is more intended for correctness testing than benchmarking. Therein, certain query types dominate over others, and thus, conclusions about the performance of individual aspects cannot easily be drawn.

          Wolfgang.

          Show
          wolfgang hoschek added a comment - > All Lucene unit tests have been adapted to work with my alternate index. Everything but proximity queries pass. Sounds like you're almost there Regarding indexing performance with MemoryIndex: Performance is more than good enough. I've observed and measured that often the bottleneck is not the MemoryIndex itself, but rather the Analyzer type (e.g. StandardAnalayzer) or the I/O for the input files or term lower casing ( http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265809 ) or something else entirely. Regarding query performance with MemoryIndex: Some queries are more efficient than others. For example, fuzzy queries are much less efficient than wild card queries, which in turn are much less efficient than simple term queries. Such effects seem partly inherent due too the nature of the query type, partly a function of the chosen data structure (RAMDirectory, MemoryIndex, II, ...), and partly a consequence of the overall Lucene API design. The query mix found in testqueries.txt is more intended for correctness testing than benchmarking. Therein, certain query types dominate over others, and thus, conclusions about the performance of individual aspects cannot easily be drawn. Wolfgang.
          Hide
          Karl Wettin added a comment -

          wolfgang hoschek [21/Nov/06 12:50 PM]
          > Ok. That means a basic test passes. For some more exhaustive tests, run all the queries in

          All Lucene unit tests have been adapted to work with my alternate index. Everything but proximity queries pass. Have not looked in to why as I don't use them (yet). And I have written an in depth index comparator to make sure that an InstantiatedIndex equals a Directory implementation. Hence I have already verified that the index works as expected.

          Todays postings from me is more to show that InstantiatedIndex is /almost/ as fast as MemoryIndex and could thus be an interesting replacement, as as it handles more than one document it might even be preferable in some cases.

          I will however run your suggested tests tomorrow and report back.
          And post the latest patches, including my adaptation of your unit test, in case you want to explore it by your self.

          Show
          Karl Wettin added a comment - wolfgang hoschek [21/Nov/06 12:50 PM] > Ok. That means a basic test passes. For some more exhaustive tests, run all the queries in All Lucene unit tests have been adapted to work with my alternate index. Everything but proximity queries pass. Have not looked in to why as I don't use them (yet). And I have written an in depth index comparator to make sure that an InstantiatedIndex equals a Directory implementation. Hence I have already verified that the index works as expected. Todays postings from me is more to show that InstantiatedIndex is /almost/ as fast as MemoryIndex and could thus be an interesting replacement, as as it handles more than one document it might even be preferable in some cases. I will however run your suggested tests tomorrow and report back. And post the latest patches, including my adaptation of your unit test, in case you want to explore it by your self.
          Hide
          Karl Wettin added a comment -

          > > diff=-0.024093388, query=term*, scoreII=0.024093388, scoreRAM=0.024093388
          >
          > Actually, diff != 0 means the test fails, unless the diff is very small due too rounding error, say 10E-9.
          > The driver should report a IllegalStateException("BUG DETECTED:"

          Right, that was a bug in my code. The diff /output/ was calculated on scoreMEM - scoreRAM (were scoreMEM is 0) and not scoreII - scoreRAM ; )

          Show
          Karl Wettin added a comment - > > diff=-0.024093388, query=term*, scoreII=0.024093388, scoreRAM=0.024093388 > > Actually, diff != 0 means the test fails, unless the diff is very small due too rounding error, say 10E-9. > The driver should report a IllegalStateException("BUG DETECTED:" Right, that was a bug in my code. The diff /output/ was calculated on scoreMEM - scoreRAM (were scoreMEM is 0) and not scoreII - scoreRAM ; )
          Hide
          wolfgang hoschek added a comment -

          > diff=-0.024093388, query=term*, scoreII=0.024093388, scoreRAM=0.024093388

          Actually, diff != 0 means the test fails, unless the diff is very small due too rounding error, say 10E-9. The driver should report a IllegalStateException("BUG DETECTED:"

          Show
          wolfgang hoschek added a comment - > diff=-0.024093388, query=term*, scoreII=0.024093388, scoreRAM=0.024093388 Actually, diff != 0 means the test fails, unless the diff is very small due too rounding error, say 10E-9. The driver should report a IllegalStateException("BUG DETECTED:"
          Hide
          wolfgang hoschek added a comment -

          Ok. That means a basic test passes. For some more exhaustive tests, run all the queries in

          src/test/org/apache/lucene/index/memory/testqueries.txt

          against matching files such as

          String[] files = listFiles(new String[]

          { "*.txt", //"*.html", "*.xml", "xdocs/*.xml", "src/java/test/org/apache/lucene/queryParser/*.java", "src/java/org/apache/lucene/index/memory/*.java", }

          );

          See testMany() for details. Repeat for various analyzer, stopword toLowerCase settings, such as

          boolean toLowerCase = true;
          // boolean toLowerCase = false;
          // Set stopWords = null;
          Set stopWords = StopFilter.makeStopSet(StopAnalyzer.ENGLISH_STOP_WORDS);

          Analyzer[] analyzers = new Analyzer[]

          { // new SimpleAnalyzer(), // new StopAnalyzer(), // new StandardAnalyzer(), PatternAnalyzer.DEFAULT_ANALYZER, // new WhitespaceAnalyzer(), // new PatternAnalyzer(PatternAnalyzer.NON_WORD_PATTERN, false, null), // new PatternAnalyzer(PatternAnalyzer.NON_WORD_PATTERN, true, stopWords), // new SnowballAnalyzer("English", StopAnalyzer.ENGLISH_STOP_WORDS), }

          ;

          Show
          wolfgang hoschek added a comment - Ok. That means a basic test passes. For some more exhaustive tests, run all the queries in src/test/org/apache/lucene/index/memory/testqueries.txt against matching files such as String[] files = listFiles(new String[] { "*.txt", //"*.html", "*.xml", "xdocs/*.xml", "src/java/test/org/apache/lucene/queryParser/*.java", "src/java/org/apache/lucene/index/memory/*.java", } ); See testMany() for details. Repeat for various analyzer, stopword toLowerCase settings, such as boolean toLowerCase = true; // boolean toLowerCase = false; // Set stopWords = null; Set stopWords = StopFilter.makeStopSet(StopAnalyzer.ENGLISH_STOP_WORDS); Analyzer[] analyzers = new Analyzer[] { // new SimpleAnalyzer(), // new StopAnalyzer(), // new StandardAnalyzer(), PatternAnalyzer.DEFAULT_ANALYZER, // new WhitespaceAnalyzer(), // new PatternAnalyzer(PatternAnalyzer.NON_WORD_PATTERN, false, null), // new PatternAnalyzer(PatternAnalyzer.NON_WORD_PATTERN, true, stopWords), // new SnowballAnalyzer("English", StopAnalyzer.ENGLISH_STOP_WORDS), } ;
          Hide
          Karl Wettin added a comment -

          wolfgang hoschek [21/Nov/06 10:22 AM]
          > Other question: when running the driver in test mode (checking for equality of query
          > results against RAMDirectory) does InstantiatedIndex pass all tests? That would be great!

          It sure does!

          xfiles = [./CHANGES.txt, ./LICENSE.txt]

                              1. iteration=0
                              • FILE=./CHANGES.txt
                                diff=-0.020341659, query=term, scoreII=0.020341659, scoreRAM=0.020341659
                                diff=-0.024093388, query=term*, scoreII=0.024093388, scoreRAM=0.024093388
                                diff=-0.025180675, query=term~, scoreII=0.025180675, scoreRAM=0.025180675
                                diff=-0.018685007, query=Apache, scoreII=0.018685007, scoreRAM=0.018685007
                                diff=-0.014089426, query=Apach~ AND Copy*, scoreII=0.014089426, scoreRAM=0.014089426
                              • FILE=./LICENSE.txt
                                diff=0.0, query=term, scoreII=0.0, scoreRAM=0.0
                                diff=-0.027122213, query=term*, scoreII=0.027122213, scoreRAM=0.027122213
                                diff=-0.028767452, query=term~, scoreII=0.028767452, scoreRAM=0.028767452
                                diff=-0.023488527, query=Apache, scoreII=0.023488527, scoreRAM=0.023488527
                                diff=-0.043373547, query=Apach~ AND Copy*, scoreII=0.043373547, scoreRAM=0.043373547

          secs = 3.766
          queries/sec= 2.655337
          MB/sec = 0.083386995
          No bug found. done.

          Process finished with exit code 0

          Show
          Karl Wettin added a comment - wolfgang hoschek [21/Nov/06 10:22 AM] > Other question: when running the driver in test mode (checking for equality of query > results against RAMDirectory) does InstantiatedIndex pass all tests? That would be great! It sure does! xfiles = [./CHANGES.txt, ./LICENSE.txt] iteration=0 FILE=./CHANGES.txt diff=-0.020341659, query=term, scoreII=0.020341659, scoreRAM=0.020341659 diff=-0.024093388, query=term*, scoreII=0.024093388, scoreRAM=0.024093388 diff=-0.025180675, query=term~, scoreII=0.025180675, scoreRAM=0.025180675 diff=-0.018685007, query=Apache, scoreII=0.018685007, scoreRAM=0.018685007 diff=-0.014089426, query=Apach~ AND Copy*, scoreII=0.014089426, scoreRAM=0.014089426 FILE=./LICENSE.txt diff=0.0, query=term, scoreII=0.0, scoreRAM=0.0 diff=-0.027122213, query=term*, scoreII=0.027122213, scoreRAM=0.027122213 diff=-0.028767452, query=term~, scoreII=0.028767452, scoreRAM=0.028767452 diff=-0.023488527, query=Apache, scoreII=0.023488527, scoreRAM=0.023488527 diff=-0.043373547, query=Apach~ AND Copy*, scoreII=0.043373547, scoreRAM=0.043373547 secs = 3.766 queries/sec= 2.655337 MB/sec = 0.083386995 No bug found. done. Process finished with exit code 0
          Hide
          wolfgang hoschek added a comment -

          Other question: when running the driver in test mode (checking for equality of query results against RAMDirectory) does InstantiatedIndex pass all tests? That would be great!

          Show
          wolfgang hoschek added a comment - Other question: when running the driver in test mode (checking for equality of query results against RAMDirectory) does InstantiatedIndex pass all tests? That would be great!
          Hide
          wolfgang hoschek added a comment -

          What's the benchmark configuration? For example, is throughput bounded by indexing or querying? Measuring N queries against a single preindexed document vs. 1 precompiled query against N documents? See the line

          boolean measureIndexing = false; // toggle this to measure query performance

          in my driver. If measuring indexing, what kind of analyzer / token filter chain is used? If measuring queries, what kind of query types are in the mix, with which relative frequencies?

          You may want to experiment with modifying/commenting/uncommenting various parts of the driver setup, for any given target scenario. Would it be possible to post the benchmark code, test data, queries for analysis?

          Show
          wolfgang hoschek added a comment - What's the benchmark configuration? For example, is throughput bounded by indexing or querying? Measuring N queries against a single preindexed document vs. 1 precompiled query against N documents? See the line boolean measureIndexing = false; // toggle this to measure query performance in my driver. If measuring indexing, what kind of analyzer / token filter chain is used? If measuring queries, what kind of query types are in the mix, with which relative frequencies? You may want to experiment with modifying/commenting/uncommenting various parts of the driver setup, for any given target scenario. Would it be possible to post the benchmark code, test data, queries for analysis?
          Hide
          Karl Wettin added a comment -

          Here is what I just sent to Wolgang. I've adapted his bench test case to also work with InstantiatedIndex. It is worth noticing this is a test with one document only, and the speed is not linear according to my previous tests. InstantiatedIndex is much more than 3x faster than RAMDirectory in a larger index. So this is really only to compare MemoryIndex with InstantiatedIndex, and not as a bench against RAMDirectory.

          RAMDirectory:

          secs = 95.159
          queries/sec= 315.26184
          MB/sec = 9.900338
          Done benchmarking (without checking correctness).

          MemoryIndex:

          secs = 26.692
          queries/sec= 1123.9323
          MB/sec = 35.295456
          Done benchmarking (without checking correctness).

          InstantiatedIndex:

          secs = 27.44
          queries/sec= 1093.2944
          MB/sec = 34.333317
          Done benchmarking (without checking correctness).

          MemoryIndex is a bit faster than InstantiatedIndex. But I'm aware of a couple of small optimizations I can do.

          Show
          Karl Wettin added a comment - Here is what I just sent to Wolgang. I've adapted his bench test case to also work with InstantiatedIndex. It is worth noticing this is a test with one document only, and the speed is not linear according to my previous tests. InstantiatedIndex is much more than 3x faster than RAMDirectory in a larger index. So this is really only to compare MemoryIndex with InstantiatedIndex, and not as a bench against RAMDirectory. RAMDirectory: secs = 95.159 queries/sec= 315.26184 MB/sec = 9.900338 Done benchmarking (without checking correctness). MemoryIndex: secs = 26.692 queries/sec= 1123.9323 MB/sec = 35.295456 Done benchmarking (without checking correctness). InstantiatedIndex: secs = 27.44 queries/sec= 1093.2944 MB/sec = 34.333317 Done benchmarking (without checking correctness). MemoryIndex is a bit faster than InstantiatedIndex. But I'm aware of a couple of small optimizations I can do.
          Hide
          Dejan Nenov added a comment -

          And whil ewe wait - may we please have highres PNGs - so that the zoomed-in versions are a little more readable?

          Show
          Dejan Nenov added a comment - And whil ewe wait - may we please have highres PNGs - so that the zoomed-in versions are a little more readable?
          Hide
          Karl Wettin added a comment -

          > Can we please get the class diagrams in PDF format -
          > the PNGs are so tny - they are undreadable

          Shamless promotion:

          I'm actually in the progress of porting all my old diagrams to <http://www.appliedmodels.com/>, this fantastic MDA-tool a friend of mine just released to the public. So quite soon there will be new diagrams. Pehaps even PDF.

          Until then you're stuck to zooming

          Show
          Karl Wettin added a comment - > Can we please get the class diagrams in PDF format - > the PNGs are so tny - they are undreadable Shamless promotion: I'm actually in the progress of porting all my old diagrams to < http://www.appliedmodels.com/ >, this fantastic MDA-tool a friend of mine just released to the public. So quite soon there will be new diagrams. Pehaps even PDF. Until then you're stuck to zooming
          Hide
          Dejan Nenov added a comment -

          Can we please get the class diagrams in PDF format - the PNGs are so tny - they are undreadable

          Show
          Dejan Nenov added a comment - Can we please get the class diagrams in PDF format - the PNGs are so tny - they are undreadable
          Hide
          Karl Wettin added a comment -

          Performance from live environemt:

          • 150,000 documents, average size is 2K.
          • Consumes 2x the memory of a RAMDirectory.
          • Average user query match 90 documents.
          • RAMDirectory takes 60x more time to collect and instantiate the resulting documents.

          I would very much apreciate if someone with knowledge of the scoring code could take a look at the seven final(tm) failing tests. Them failing is not a problem for me, but it would be nice if they passed.

          Show
          Karl Wettin added a comment - Performance from live environemt: 150,000 documents, average size is 2K. Consumes 2x the memory of a RAMDirectory. Average user query match 90 documents. RAMDirectory takes 60x more time to collect and instantiate the resulting documents. I would very much apreciate if someone with knowledge of the scoring code could take a look at the seven final(tm) failing tests. Them failing is not a problem for me, but it would be nice if they passed.
          Hide
          Karl Wettin added a comment -

          Updated to match the current svn with Fieldable, et.c.

          All changes to Lucene core are now gathered in a small patch (de-finalized Document and Term) and one new class (InterfaceIndexWriter implemented by IndexWriter in patch) instead of attaching the whole trunk.

          Still fails a few score- and RMI-tests.

          Show
          Karl Wettin added a comment - Updated to match the current svn with Fieldable, et.c. All changes to Lucene core are now gathered in a small patch (de-finalized Document and Term) and one new class (InterfaceIndexWriter implemented by IndexWriter in patch) instead of attaching the whole trunk. Still fails a few score- and RMI-tests.
          Hide
          Karl Wettin added a comment -

          New code. More backwards compatible. Just a very few changes required to the Lucene core.

          Now with test cases from distribution, but only search/* has been ported. Fails some (11 of 172) score and RMI related tests that I can not explain. Could really need some help with that

          Except for that this seems to work really great now. I've been running this in a live environment for a few hours (some hundred thousand user queries) and it is really fast.

          Output from failing tests:

          junit.framework.AssertionFailedError: expected:<3> but was:<0>
          at org.apache.lucene.search.TestPhraseQuery.testSlopScoring(TestPhraseQuery.java:298)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)

          junit.framework.AssertionFailedError: Using 10 documents per index:
          at org.apache.lucene.search.TestMultiSearcher.testNormalization(TestMultiSearcher.java:247)
          at org.apache.lucene.search.TestMultiSearcher.testNormalization10(TestMultiSearcher.java:220)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)

          ------- testSimpleEqualScores1 -------
          #0: 1.000000000 - d3
          #1: 1.000000000 - d4
          #2: 0.500000000 - d1
          #3: 0.500000000 - d2

          junit.framework.AssertionFailedError: score #2 is not the same expected:<1.0> but was:<0.5>
          at org.apache.lucene.search.TestDisjunctionMaxQuery.testSimpleEqualScores1(TestDisjunctionMaxQuery.java:142)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)

          ------- testSimpleEqualScores2 -------
          #0: 1.000000000 - d2
          #1: 0.500000000 - d1
          #2: 0.500000000 - d4

          junit.framework.AssertionFailedError: score #1 is not the same expected:<1.0> but was:<0.5>
          at org.apache.lucene.search.TestDisjunctionMaxQuery.testSimpleEqualScores2(TestDisjunctionMaxQuery.java:166)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)

          ------- testSimpleEqualScores3 -------
          #0: 1.000000000 - d2
          #1: 1.000000000 - d3
          #2: 1.000000000 - d4
          #3: 0.500000000 - d1

          junit.framework.AssertionFailedError: score #3 is not the same expected:<1.0> but was:<0.5>
          at org.apache.lucene.search.TestDisjunctionMaxQuery.testSimpleEqualScores3(TestDisjunctionMaxQuery.java:191)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)

          junit.framework.AssertionFailedError: A,B,D, only B in range expected:<1> but was:<2>
          at org.apache.lucene.search.TestRangeQuery.testExclusive(TestRangeQuery.java:39)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)

          junit.framework.AssertionFailedError: A,B,D - A and B in range expected:<2> but was:<5>
          at org.apache.lucene.search.TestRangeQuery.testInclusive(TestRangeQuery.java:63)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)

          junit.framework.AssertionFailedError: Using 10 documents per index:
          at org.apache.lucene.search.TestMultiSearcher.testNormalization(TestMultiSearcher.java:247)
          at org.apache.lucene.search.TestMultiSearcher.testNormalization10(TestMultiSearcher.java:220)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)

          java.rmi.server.ExportException: internal error: ObjID already in use
          at sun.rmi.transport.ObjectTable.putTarget(ObjectTable.java:197)
          at sun.rmi.transport.Transport.exportObject(Transport.java:90)
          at sun.rmi.transport.tcp.TCPTransport.exportObject(TCPTransport.java:231)
          at sun.rmi.transport.tcp.TCPEndpoint.exportObject(TCPEndpoint.java:398)
          at sun.rmi.transport.LiveRef.exportObject(LiveRef.java:131)
          at sun.rmi.server.UnicastServerRef.exportObject(UnicastServerRef.java:195)
          at sun.rmi.registry.RegistryImpl.setup(RegistryImpl.java:107)
          at sun.rmi.registry.RegistryImpl.<init>(RegistryImpl.java:93)
          at java.rmi.registry.LocateRegistry.createRegistry(LocateRegistry.java:198)
          at org.apache.lucene.search.TestSort.startServer(TestSort.java:704)
          at org.apache.lucene.search.TestSort.getRemote(TestSort.java:689)
          at org.apache.lucene.search.TestSort.testRemoteSort(TestSort.java:410)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)

          java.rmi.server.ExportException: internal error: ObjID already in use
          at sun.rmi.transport.ObjectTable.putTarget(ObjectTable.java:197)
          at sun.rmi.transport.Transport.exportObject(Transport.java:90)
          at sun.rmi.transport.tcp.TCPTransport.exportObject(TCPTransport.java:231)
          at sun.rmi.transport.tcp.TCPEndpoint.exportObject(TCPEndpoint.java:398)
          at sun.rmi.transport.LiveRef.exportObject(LiveRef.java:131)
          at sun.rmi.server.UnicastServerRef.exportObject(UnicastServerRef.java:195)
          at sun.rmi.registry.RegistryImpl.setup(RegistryImpl.java:107)
          at sun.rmi.registry.RegistryImpl.<init>(RegistryImpl.java:93)
          at java.rmi.registry.LocateRegistry.createRegistry(LocateRegistry.java:198)
          at org.apache.lucene.search.TestSort.startServer(TestSort.java:704)
          at org.apache.lucene.search.TestSort.getRemote(TestSort.java:689)
          at org.apache.lucene.search.TestSort.testRemoteCustomSort(TestSort.java:417)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)

          java.rmi.server.ExportException: internal error: ObjID already in use
          at sun.rmi.transport.ObjectTable.putTarget(ObjectTable.java:197)
          at sun.rmi.transport.Transport.exportObject(Transport.java:90)
          at sun.rmi.transport.tcp.TCPTransport.exportObject(TCPTransport.java:231)
          at sun.rmi.transport.tcp.TCPEndpoint.exportObject(TCPEndpoint.java:398)
          at sun.rmi.transport.LiveRef.exportObject(LiveRef.java:131)
          at sun.rmi.server.UnicastServerRef.exportObject(UnicastServerRef.java:195)
          at sun.rmi.registry.RegistryImpl.setup(RegistryImpl.java:107)
          at sun.rmi.registry.RegistryImpl.<init>(RegistryImpl.java:93)
          at java.rmi.registry.LocateRegistry.createRegistry(LocateRegistry.java:198)
          at org.apache.lucene.search.TestSort.startServer(TestSort.java:704)
          at org.apache.lucene.search.TestSort.getRemote(TestSort.java:689)
          at org.apache.lucene.search.TestSort.testNormalizedScores(TestSort.java:440)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)

          Show
          Karl Wettin added a comment - New code. More backwards compatible. Just a very few changes required to the Lucene core. Now with test cases from distribution, but only search/* has been ported. Fails some (11 of 172) score and RMI related tests that I can not explain. Could really need some help with that Except for that this seems to work really great now. I've been running this in a live environment for a few hours (some hundred thousand user queries) and it is really fast. Output from failing tests: junit.framework.AssertionFailedError: expected:<3> but was:<0> at org.apache.lucene.search.TestPhraseQuery.testSlopScoring(TestPhraseQuery.java:298) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90) junit.framework.AssertionFailedError: Using 10 documents per index: at org.apache.lucene.search.TestMultiSearcher.testNormalization(TestMultiSearcher.java:247) at org.apache.lucene.search.TestMultiSearcher.testNormalization10(TestMultiSearcher.java:220) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90) ------- testSimpleEqualScores1 ------- #0: 1.000000000 - d3 #1: 1.000000000 - d4 #2: 0.500000000 - d1 #3: 0.500000000 - d2 junit.framework.AssertionFailedError: score #2 is not the same expected:<1.0> but was:<0.5> at org.apache.lucene.search.TestDisjunctionMaxQuery.testSimpleEqualScores1(TestDisjunctionMaxQuery.java:142) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90) ------- testSimpleEqualScores2 ------- #0: 1.000000000 - d2 #1: 0.500000000 - d1 #2: 0.500000000 - d4 junit.framework.AssertionFailedError: score #1 is not the same expected:<1.0> but was:<0.5> at org.apache.lucene.search.TestDisjunctionMaxQuery.testSimpleEqualScores2(TestDisjunctionMaxQuery.java:166) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90) ------- testSimpleEqualScores3 ------- #0: 1.000000000 - d2 #1: 1.000000000 - d3 #2: 1.000000000 - d4 #3: 0.500000000 - d1 junit.framework.AssertionFailedError: score #3 is not the same expected:<1.0> but was:<0.5> at org.apache.lucene.search.TestDisjunctionMaxQuery.testSimpleEqualScores3(TestDisjunctionMaxQuery.java:191) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90) junit.framework.AssertionFailedError: A,B,D, only B in range expected:<1> but was:<2> at org.apache.lucene.search.TestRangeQuery.testExclusive(TestRangeQuery.java:39) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90) junit.framework.AssertionFailedError: A,B,D - A and B in range expected:<2> but was:<5> at org.apache.lucene.search.TestRangeQuery.testInclusive(TestRangeQuery.java:63) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90) junit.framework.AssertionFailedError: Using 10 documents per index: at org.apache.lucene.search.TestMultiSearcher.testNormalization(TestMultiSearcher.java:247) at org.apache.lucene.search.TestMultiSearcher.testNormalization10(TestMultiSearcher.java:220) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90) java.rmi.server.ExportException: internal error: ObjID already in use at sun.rmi.transport.ObjectTable.putTarget(ObjectTable.java:197) at sun.rmi.transport.Transport.exportObject(Transport.java:90) at sun.rmi.transport.tcp.TCPTransport.exportObject(TCPTransport.java:231) at sun.rmi.transport.tcp.TCPEndpoint.exportObject(TCPEndpoint.java:398) at sun.rmi.transport.LiveRef.exportObject(LiveRef.java:131) at sun.rmi.server.UnicastServerRef.exportObject(UnicastServerRef.java:195) at sun.rmi.registry.RegistryImpl.setup(RegistryImpl.java:107) at sun.rmi.registry.RegistryImpl.<init>(RegistryImpl.java:93) at java.rmi.registry.LocateRegistry.createRegistry(LocateRegistry.java:198) at org.apache.lucene.search.TestSort.startServer(TestSort.java:704) at org.apache.lucene.search.TestSort.getRemote(TestSort.java:689) at org.apache.lucene.search.TestSort.testRemoteSort(TestSort.java:410) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90) java.rmi.server.ExportException: internal error: ObjID already in use at sun.rmi.transport.ObjectTable.putTarget(ObjectTable.java:197) at sun.rmi.transport.Transport.exportObject(Transport.java:90) at sun.rmi.transport.tcp.TCPTransport.exportObject(TCPTransport.java:231) at sun.rmi.transport.tcp.TCPEndpoint.exportObject(TCPEndpoint.java:398) at sun.rmi.transport.LiveRef.exportObject(LiveRef.java:131) at sun.rmi.server.UnicastServerRef.exportObject(UnicastServerRef.java:195) at sun.rmi.registry.RegistryImpl.setup(RegistryImpl.java:107) at sun.rmi.registry.RegistryImpl.<init>(RegistryImpl.java:93) at java.rmi.registry.LocateRegistry.createRegistry(LocateRegistry.java:198) at org.apache.lucene.search.TestSort.startServer(TestSort.java:704) at org.apache.lucene.search.TestSort.getRemote(TestSort.java:689) at org.apache.lucene.search.TestSort.testRemoteCustomSort(TestSort.java:417) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90) java.rmi.server.ExportException: internal error: ObjID already in use at sun.rmi.transport.ObjectTable.putTarget(ObjectTable.java:197) at sun.rmi.transport.Transport.exportObject(Transport.java:90) at sun.rmi.transport.tcp.TCPTransport.exportObject(TCPTransport.java:231) at sun.rmi.transport.tcp.TCPEndpoint.exportObject(TCPEndpoint.java:398) at sun.rmi.transport.LiveRef.exportObject(LiveRef.java:131) at sun.rmi.server.UnicastServerRef.exportObject(UnicastServerRef.java:195) at sun.rmi.registry.RegistryImpl.setup(RegistryImpl.java:107) at sun.rmi.registry.RegistryImpl.<init>(RegistryImpl.java:93) at java.rmi.registry.LocateRegistry.createRegistry(LocateRegistry.java:198) at org.apache.lucene.search.TestSort.startServer(TestSort.java:704) at org.apache.lucene.search.TestSort.getRemote(TestSort.java:689) at org.apache.lucene.search.TestSort.testNormalizedScores(TestSort.java:440) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)
          Hide
          Karl Wettin added a comment -

          In order to find the norm-error I ported all test cases. I'm sorry to report that 70 of them fails.

          So if anyone use this code, don't.

          Hopefully most of the problems share the same problem. I'll be at the code this weekend, and perhaps a few days next week if needed.

          Show
          Karl Wettin added a comment - In order to find the norm-error I ported all test cases. I'm sorry to report that 70 of them fails. So if anyone use this code, don't. Hopefully most of the problems share the same problem. I'll be at the code this weekend, and perhaps a few days next week if needed.
          Hide
          Karl Wettin added a comment -

          A comment on memory usage: about 2x a RAMDirectory (900MB and 1800MB) on a 150,000 document corpus (when the corpus term count have been reached?)

          Show
          Karl Wettin added a comment - A comment on memory usage: about 2x a RAMDirectory (900MB and 1800MB) on a 150,000 document corpus (when the corpus term count have been reached?)
          Hide
          Karl Wettin added a comment -

          To make this index work flawless (I hope), remove the if-statement around the following row in InstatiatedIndexWriter (row 477 or so):

          termDocumentInformation.termPositions.add(fieldSettings.position);

          This will fix the termposition bug noted in an earlier comment.

          I'll keep posting bugfixes as comments here, but when I work on it it's really in my branch of lucene 2.0.0, available here: http://www.ginandtonique.org/trac/snigel/wiki/Lucene2-karl

          If someone feels that this layer is an interesting thing to add to Lucene, let me know what is required for commit and I'll make those changes. It still seems to be about 40 times (mean value on a "nomal" index with "normal" amount of terms. have seen 20x-200x) than RAMDirectory when comparing search and to retrieve documents time combined.

          Show
          Karl Wettin added a comment - To make this index work flawless (I hope), remove the if-statement around the following row in InstatiatedIndexWriter (row 477 or so): termDocumentInformation.termPositions.add(fieldSettings.position); This will fix the termposition bug noted in an earlier comment. I'll keep posting bugfixes as comments here, but when I work on it it's really in my branch of lucene 2.0.0, available here: http://www.ginandtonique.org/trac/snigel/wiki/Lucene2-karl If someone feels that this layer is an interesting thing to add to Lucene, let me know what is required for commit and I'll make those changes. It still seems to be about 40 times (mean value on a "nomal" index with "normal" amount of terms. have seen 20x-200x) than RAMDirectory when comparing search and to retrieve documents time combined.
          Hide
          Karl Wettin added a comment -

          There is a bug with phrase queries. Possible term positions. Low priority for me.

          Show
          Karl Wettin added a comment - There is a bug with phrase queries. Possible term positions. Low priority for me.
          Hide
          Karl Wettin added a comment -

          > I'll come with a new number soon enough.

          Right, it was 25% faster. So forget everthing I said about anything.

          Show
          Karl Wettin added a comment - > I'll come with a new number soon enough. Right, it was 25% faster. So forget everthing I said about anything.
          Hide
          Karl Wettin added a comment -

          > If eveything works as it should

          I doesn't. I keep taking out the victories in advance. I'll try not to in the future. So forget about the 1500. I'll come with a new number soon enough.

          Show
          Karl Wettin added a comment - > If eveything works as it should I doesn't. I keep taking out the victories in advance. I'll try not to in the future. So forget about the 1500. I'll come with a new number soon enough.
          Hide
          Karl Wettin added a comment -

          ArrayBoundsOutOfIndex-bugfix.

          If eveything works as it should (I think so) then I'm happy to report that a FuzzyQuery seems to be about 1500 (one thousand five hundred) times faster on this memory implementation than on a RAMDirectory. The speed is gained by not creating a new instance of each Term in a TermEnum.

          Show
          Karl Wettin added a comment - ArrayBoundsOutOfIndex-bugfix. If eveything works as it should (I think so) then I'm happy to report that a FuzzyQuery seems to be about 1500 (one thousand five hundred) times faster on this memory implementation than on a RAMDirectory. The speed is gained by not creating a new instance of each Term in a TermEnum.
          Hide
          Karl Wettin added a comment -

          This update makes InstanciatedIndex compatible with Lucene, given that issue 580 and 581 is adopted.

          It depends on generics and concurrent locks from J2SE 5.0.

          Contains one update in Field:

          public setFieldData(Object fieldData)

          And one in Document:

          public List<Field> getFields()

          { return fields; }
          Show
          Karl Wettin added a comment - This update makes InstanciatedIndex compatible with Lucene, given that issue 580 and 581 is adopted. It depends on generics and concurrent locks from J2SE 5.0. Contains one update in Field: public setFieldData(Object fieldData) And one in Document: public List<Field> getFields() { return fields; }
          Hide
          Karl Wettin added a comment -

          This is the diagram of InstanciatedIndex as of 1.9-karl1

          Show
          Karl Wettin added a comment - This is the diagram of InstanciatedIndex as of 1.9-karl1
          Hide
          Karl Wettin added a comment -

          Doug Cutting commented on LUCENE-550:

          > This looks very promising. Unfortunately the code you provide makes many incompatible API
          > changes (e.g., turning Term into an interface that has far fewer methods) removes lots of
          > useful javadoc, etc. So please don't expect it to be committed soon!

          I agree, there is lots of work to be done on it. It was eaiser for me to think clear when everything was seperated. Basically there are only a few changes to the API that is needed:

          1. Document nor Term may be final.
          2. Something other minor that I forgot about.

          It can all be fixed, but is nothing that I prioritize right now. If you feel it would be a nice thing for 2.0, tolk me what changes you are OK with and gave me at least two weeks notice I /might/ find time to back-factor the code.

          Show
          Karl Wettin added a comment - Doug Cutting commented on LUCENE-550 : > This looks very promising. Unfortunately the code you provide makes many incompatible API > changes (e.g., turning Term into an interface that has far fewer methods) removes lots of > useful javadoc, etc. So please don't expect it to be committed soon! I agree, there is lots of work to be done on it. It was eaiser for me to think clear when everything was seperated. Basically there are only a few changes to the API that is needed: 1. Document nor Term may be final. 2. Something other minor that I forgot about. It can all be fixed, but is nothing that I prioritize right now. If you feel it would be a nice thing for 2.0, tolk me what changes you are OK with and gave me at least two weeks notice I /might/ find time to back-factor the code.
          Hide
          Doug Cutting added a comment -

          This looks very promising. Unfortunately the code you provide makes many incompatible API changes (e.g., turning Term into an interface that has far fewer methods) removes lots of useful javadoc, etc. So please don't expect it to be committed soon!

          A back-compatible way to add an interface is to add it above the old class. So you might add a TermInteface, AbstractTerm, and TermImpl, then change term to extend TermImpl and deprecate it.

          Then there's also the question of whether you really must convert Term to an interface. I would not undertake that change for aesthetic reasons. Is it really required to achieve your goals? You should generally try hard to minimize the size of your diffs and maximize the back-compatiblity.

          Show
          Doug Cutting added a comment - This looks very promising. Unfortunately the code you provide makes many incompatible API changes (e.g., turning Term into an interface that has far fewer methods) removes lots of useful javadoc, etc. So please don't expect it to be committed soon! A back-compatible way to add an interface is to add it above the old class. So you might add a TermInteface, AbstractTerm, and TermImpl, then change term to extend TermImpl and deprecate it. Then there's also the question of whether you really must convert Term to an interface. I would not undertake that change for aesthetic reasons. Is it really required to achieve your goals? You should generally try hard to minimize the size of your diffs and maximize the back-compatiblity.
          Hide
          Karl Wettin added a comment -

          There is a minor norms bug. The value differst +-3 from the Directory norms. Other than that it seems to work great.

          Now about 40x faster than RAMDirectory.

          Stats for test: 500 documents. 1-5K text content.
          10 000 * 5 spans
          10 000 * 13 term and boolean term queries.
          collected top 100 documents for each search results.

          InstanciatedIndex is 40x faster than the RAMDirectory.

          InstanciatedIndex running on Lucene 1.9-karl1
          Corpus creation took 14903 ms.
          Span queries took 12884 ms.
          Term queries took 30221 ms.

          RAMDirectory run on Licene 1.9
          Corpus creation took 9337 ms.
          Span queries took 253412 ms.
          Term queries took 1188492 ms.

          Show
          Karl Wettin added a comment - There is a minor norms bug. The value differst +-3 from the Directory norms. Other than that it seems to work great. Now about 40x faster than RAMDirectory. Stats for test: 500 documents. 1-5K text content. 10 000 * 5 spans 10 000 * 13 term and boolean term queries. collected top 100 documents for each search results. InstanciatedIndex is 40x faster than the RAMDirectory. InstanciatedIndex running on Lucene 1.9-karl1 Corpus creation took 14903 ms. Span queries took 12884 ms. Term queries took 30221 ms. RAMDirectory run on Licene 1.9 Corpus creation took 9337 ms. Span queries took 253412 ms. Term queries took 1188492 ms.
          Hide
          Karl Wettin added a comment -

          Oups

          InstanciatedIndex:
          Corpus creation took 14011 ms.
          Term queries took 33608 ms.

          RAMDirectory:
          Corpus creation took 9144 ms.
          Term queries took 1123565 ms.

          That it 35x the speed.

          Something might be wrong. But my initial tests tells me that it is right. Will look in to this tomorrow. Need to sleep now.

          Show
          Karl Wettin added a comment - Oups InstanciatedIndex: Corpus creation took 14011 ms. Term queries took 33608 ms. RAMDirectory: Corpus creation took 9144 ms. Term queries took 1123565 ms. That it 35x the speed. Something might be wrong. But my initial tests tells me that it is right. Will look in to this tomorrow. Need to sleep now.
          Hide
          Karl Wettin added a comment -

          Some new statistics.

          • A corpus of 500 documents, 1-5K text per document.
          • Placed 150 000 term and boolean queries.
          • Retrieved the top <100 hits from each result.

          Query alone is about 5x faster,
          but 9x if you include the hits collection.

          I belive that span queries will be about 10x-20x faster as the skipTo() is really really optimized. There is a bug in my term position code, so I have not been able to messure it for real yet.

          Hope to have that working and an updated class diagram for you soon.

          Show
          Karl Wettin added a comment - Some new statistics. A corpus of 500 documents, 1-5K text per document. Placed 150 000 term and boolean queries. Retrieved the top <100 hits from each result. Query alone is about 5x faster, but 9x if you include the hits collection. I belive that span queries will be about 10x-20x faster as the skipTo() is really really optimized. There is a bug in my term position code, so I have not been able to messure it for real yet. Hope to have that working and an updated class diagram for you soon.
          Hide
          Karl Wettin added a comment -

          This is a class diagram that explains what it will look like when I'm done.

          It is pretty much only the IndexReader that needs to be refactored.

          Show
          Karl Wettin added a comment - This is a class diagram that explains what it will look like when I'm done. It is pretty much only the IndexReader that needs to be refactored.
          Hide
          Karl Wettin added a comment -

          Due to read and write locks, this is how one must use the extention:

          InstanciatedIndex ii = new InstanciatedIndex();

          IndexWriter iw = ii.new InstanciatedIndexWriter(analyzer, clear); // locks
          iw.close(); // commits

          IndexReader ir = ii.new InstanciatedIndexReader();

          Searcher = ii.getSearcher();

          Show
          Karl Wettin added a comment - Due to read and write locks, this is how one must use the extention: InstanciatedIndex ii = new InstanciatedIndex(); IndexWriter iw = ii.new InstanciatedIndexWriter(analyzer, clear); // locks iw.close(); // commits IndexReader ir = ii.new InstanciatedIndexReader(); Searcher = ii.getSearcher();
          Hide
          Karl Wettin added a comment -

          Class diagram over InstanciatedIndex

          Show
          Karl Wettin added a comment - Class diagram over InstanciatedIndex
          Hide
          Karl Wettin added a comment -

          The whole Lucene core branch.

          I think I've messed something up, queries with Directory-implementations are much slower than normal.

          See the class diagram to understand what I did.

          Show
          Karl Wettin added a comment - The whole Lucene core branch. I think I've messed something up, queries with Directory-implementations are much slower than normal. See the class diagram to understand what I did.
          Hide
          Karl Wettin added a comment -

          > > You might notice that norms are float[] and not byte[]. That is me who refactored it to see if it would do
          > > any good. Bit shifting don't take many ticks, so I might just revert that.

          > Since there are only 256 byte values, many scorers use a simple lookup table Similarity.getNormDecoder()
          > After I sped up norm decoding, a lookup table was only marginally faster anyway (see comments in SmallFloat
          > class). So I wouldn't expect float[] norms to be mesurably faster than byte[] norms in the context of a complete
          > search.

          The hypthesis is that instanciation and unnecessary data parsing is the bad guy. Converting bytes to floats fit that profile, so I moved it to the IO-classes (readFloat -> readByte). I relize that for the the norms alone, it is a marginal win, but if I find enough of these things it might show in the end. Don't know if I'll find enough things to work with though. Been looking at getting ridth of things in the IndexReader as the information it returns in many situations already available in the information passed IndexReader, but I'm afraid it might be a Pyrrhus victory as the Jit usually automatically "caches" things like that. There are more obvious places to save ticks, e.g. replacing collections with arrays.

          Show
          Karl Wettin added a comment - > > You might notice that norms are float[] and not byte[]. That is me who refactored it to see if it would do > > any good. Bit shifting don't take many ticks, so I might just revert that. > Since there are only 256 byte values, many scorers use a simple lookup table Similarity.getNormDecoder() > After I sped up norm decoding, a lookup table was only marginally faster anyway (see comments in SmallFloat > class). So I wouldn't expect float[] norms to be mesurably faster than byte[] norms in the context of a complete > search. The hypthesis is that instanciation and unnecessary data parsing is the bad guy. Converting bytes to floats fit that profile, so I moved it to the IO-classes (readFloat -> readByte). I relize that for the the norms alone, it is a marginal win, but if I find enough of these things it might show in the end. Don't know if I'll find enough things to work with though. Been looking at getting ridth of things in the IndexReader as the information it returns in many situations already available in the information passed IndexReader, but I'm afraid it might be a Pyrrhus victory as the Jit usually automatically "caches" things like that. There are more obvious places to save ticks, e.g. replacing collections with arrays.
          Hide
          Yonik Seeley added a comment -

          Thanks Karl, it's interesting stuff...

          > You might notice that norms are float[] and not byte[]. That is me who refactored it to see if it would do
          > any good. Bit shifting don't take many ticks, so I might just revert that.

          Since there are only 256 byte values, many scorers use a simple lookup table Similarity.getNormDecoder()
          After I sped up norm decoding, a lookup table was only marginally faster anyway (see comments in SmallFloat class). So I wouldn't expect float[] norms to be mesurably faster than byte[] norms in the context of a complete search.

          Show
          Yonik Seeley added a comment - Thanks Karl, it's interesting stuff... > You might notice that norms are float[] and not byte[]. That is me who refactored it to see if it would do > any good. Bit shifting don't take many ticks, so I might just revert that. Since there are only 256 byte values, many scorers use a simple lookup table Similarity.getNormDecoder() After I sped up norm decoding, a lookup table was only marginally faster anyway (see comments in SmallFloat class). So I wouldn't expect float[] norms to be mesurably faster than byte[] norms in the context of a complete search.

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Karl Wettin
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development