|
Karl Wettin made changes - 20/Apr/06 12:47 PM
Karl Wettin made changes - 20/Apr/06 12:47 PM
Karl Wettin made changes - 20/Apr/06 12:48 PM
> > You might notice that norms are float[] and not byte[]. That is me who refactored it to see if it would do
> > any good. Bit shifting don't take many ticks, so I might just revert that. > Since there are only 256 byte values, many scorers use a simple lookup table Similarity.getNormDecoder() The hypthesis is that instanciation and unnecessary data parsing is the bad guy. Converting bytes to floats fit that profile, so I moved it to the IO-classes (readFloat -> readByte). I relize that for the the norms alone, it is a marginal win, but if I find enough of these things it might show in the end. Don't know if I'll find enough things to work with though. Been looking at getting ridth of things in the IndexReader as the information it returns in many situations already available in the information passed IndexReader, but I'm afraid it might be a Pyrrhus victory as the Jit usually automatically "caches" things like that. There are more obvious places to save ticks, e.g. replacing collections with arrays. The whole Lucene core branch.
I think I've messed something up, queries with Directory-implementations are much slower than normal. See the class diagram to understand what I did.
Karl Wettin made changes - 21/Apr/06 07:36 AM
Class diagram over InstanciatedIndex
Karl Wettin made changes - 21/Apr/06 07:37 AM
Due to read and write locks, this is how one must use the extention:
InstanciatedIndex ii = new InstanciatedIndex(); IndexWriter iw = ii.new InstanciatedIndexWriter(analyzer, clear); // locks IndexReader ir = ii.new InstanciatedIndexReader(); Searcher = ii.getSearcher(); This is a class diagram that explains what it will look like when I'm done.
It is pretty much only the IndexReader that needs to be refactored.
Karl Wettin made changes - 22/Apr/06 04:11 AM
Some new statistics.
Query alone is about 5x faster, I belive that span queries will be about 10x-20x faster as the skipTo() is really really optimized. There is a bug in my term position code, so I have not been able to messure it for real yet. Hope to have that working and an updated class diagram for you soon.
Karl Wettin made changes - 10/May/06 04:46 AM
Oups
InstanciatedIndex: RAMDirectory: That it 35x the speed. Something might be wrong. But my initial tests tells me that it is right. Will look in to this tomorrow. Need to sleep now. There is a minor norms bug. The value differst +-3 from the Directory norms. Other than that it seems to work great.
Now about 40x faster than RAMDirectory. Stats for test: 500 documents. 1-5K text content. InstanciatedIndex is 40x faster than the RAMDirectory. InstanciatedIndex running on Lucene 1.9-karl1 RAMDirectory run on Licene 1.9
Karl Wettin made changes - 11/May/06 10:32 PM
This looks very promising. Unfortunately the code you provide makes many incompatible API changes (e.g., turning Term into an interface that has far fewer methods) removes lots of useful javadoc, etc. So please don't expect it to be committed soon!
A back-compatible way to add an interface is to add it above the old class. So you might add a TermInteface, AbstractTerm, and TermImpl, then change term to extend TermImpl and deprecate it. Then there's also the question of whether you really must convert Term to an interface. I would not undertake that change for aesthetic reasons. Is it really required to achieve your goals? You should generally try hard to minimize the size of your diffs and maximize the back-compatiblity. Doug Cutting commented on
> This looks very promising. Unfortunately the code you provide makes many incompatible API I agree, there is lots of work to be done on it. It was eaiser for me to think clear when everything was seperated. Basically there are only a few changes to the API that is needed: 1. Document nor Term may be final. It can all be fixed, but is nothing that I prioritize right now. If you feel it would be a nice thing for 2.0, tolk me what changes you are OK with and gave me at least two weeks notice I /might/ find time to back-factor the code. This is the diagram of InstanciatedIndex as of 1.9-karl1
Karl Wettin made changes - 12/May/06 04:21 AM
This update makes InstanciatedIndex compatible with Lucene, given that issue 580 and 581 is adopted.
It depends on generics and concurrent locks from J2SE 5.0. Contains one update in Field: public setFieldData(Object fieldData) And one in Document: public List<Field> getFields() {
Karl Wettin made changes - 27/May/06 06:42 PM
Otis Gospodnetic made changes - 29/May/06 11:02 AM
Otis Gospodnetic made changes - 29/May/06 11:02 AM
Otis Gospodnetic made changes - 29/May/06 11:02 AM
Otis Gospodnetic made changes - 29/May/06 11:03 AM
Otis Gospodnetic made changes - 29/May/06 11:03 AM
Otis Gospodnetic made changes - 29/May/06 11:03 AM
ArrayBoundsOutOfIndex-bugfix.
If eveything works as it should (I think so) then I'm happy to report that a FuzzyQuery seems to be about 1500 (one thousand five hundred) times faster on this memory implementation than on a RAMDirectory. The speed is gained by not creating a new instance of each Term in a TermEnum.
Karl Wettin made changes - 14/Jun/06 02:35 PM
> If eveything works as it should
I doesn't. I keep taking out the victories in advance. I'll try not to in the future. So forget about the 1500. I'll come with a new number soon enough. > I'll come with a new number soon enough.
Right, it was 25% faster. So forget everthing I said about anything. There is a bug with phrase queries. Possible term positions. Low priority for me.
To make this index work flawless (I hope), remove the if-statement around the following row in InstatiatedIndexWriter (row 477 or so):
termDocumentInformation.termPositions.add(fieldSettings.position); This will fix the termposition bug noted in an earlier comment. I'll keep posting bugfixes as comments here, but when I work on it it's really in my branch of lucene 2.0.0, available here: http://www.ginandtonique.org/trac/snigel/wiki/Lucene2-karl If someone feels that this layer is an interesting thing to add to Lucene, let me know what is required for commit and I'll make those changes. It still seems to be about 40 times (mean value on a "nomal" index with "normal" amount of terms. have seen 20x-200x) than RAMDirectory when comparing search and to retrieve documents time combined. A comment on memory usage: about 2x a RAMDirectory (900MB and 1800MB) on a 150,000 document corpus (when the corpus term count have been reached?)
In order to find the norm-error I ported all test cases. I'm sorry to report that 70 of them fails.
So if anyone use this code, don't. Hopefully most of the problems share the same problem. I'll be at the code this weekend, and perhaps a few days next week if needed. New code. More backwards compatible. Just a very few changes required to the Lucene core.
Now with test cases from distribution, but only search/* has been ported. Fails some (11 of 172) score and RMI related tests that I can not explain. Could really need some help with that Except for that this seems to work really great now. I've been running this in a live environment for a few hours (some hundred thousand user queries) and it is really fast. Output from failing tests: junit.framework.AssertionFailedError: expected:<3> but was:<0> junit.framework.AssertionFailedError: Using 10 documents per index: ------- testSimpleEqualScores1 ------- junit.framework.AssertionFailedError: score #2 is not the same expected:<1.0> but was:<0.5> ------- testSimpleEqualScores2 ------- junit.framework.AssertionFailedError: score #1 is not the same expected:<1.0> but was:<0.5> ------- testSimpleEqualScores3 ------- junit.framework.AssertionFailedError: score #3 is not the same expected:<1.0> but was:<0.5> junit.framework.AssertionFailedError: A,B,D, only B in range expected:<1> but was:<2> junit.framework.AssertionFailedError: A,B,D - A and B in range expected:<2> but was:<5> junit.framework.AssertionFailedError: Using 10 documents per index: java.rmi.server.ExportException: internal error: ObjID already in use java.rmi.server.ExportException: internal error: ObjID already in use java.rmi.server.ExportException: internal error: ObjID already in use
Karl Wettin made changes - 22/Jul/06 07:50 PM
Updated to match the current svn with Fieldable, et.c.
All changes to Lucene core are now gathered in a small patch (de-finalized Document and Term) and one new class (InterfaceIndexWriter implemented by IndexWriter in patch) instead of attaching the whole trunk. Still fails a few score- and RMI-tests.
Karl Wettin made changes - 23/Jul/06 01:59 PM
Hoss Man made changes - 23/Jul/06 08:41 PM
Performance from live environemt:
I would very much apreciate if someone with knowledge of the scoring code could take a look at the seven final(tm) failing tests. Them failing is not a problem for me, but it would be nice if they passed. Can we please get the class diagrams in PDF format - the PNGs are so tny - they are undreadable
> Can we please get the class diagrams in PDF format -
> the PNGs are so tny - they are undreadable Shamless promotion: I'm actually in the progress of porting all my old diagrams to <http://www.appliedmodels.com/ Until then you're stuck to zooming And whil ewe wait - may we please have highres PNGs - so that the zoomed-in versions are a little more readable?
Here is what I just sent to Wolgang. I've adapted his bench test case to also work with InstantiatedIndex. It is worth noticing this is a test with one document only, and the speed is not linear according to my previous tests. InstantiatedIndex is much more than 3x faster than RAMDirectory in a larger index. So this is really only to compare MemoryIndex with InstantiatedIndex, and not as a bench against RAMDirectory.
RAMDirectory: secs = 95.159 MemoryIndex: secs = 26.692 InstantiatedIndex: secs = 27.44 MemoryIndex is a bit faster than InstantiatedIndex. But I'm aware of a couple of small optimizations I can do. What's the benchmark configuration? For example, is throughput bounded by indexing or querying? Measuring N queries against a single preindexed document vs. 1 precompiled query against N documents? See the line
boolean measureIndexing = false; // toggle this to measure query performance in my driver. If measuring indexing, what kind of analyzer / token filter chain is used? If measuring queries, what kind of query types are in the mix, with which relative frequencies? You may want to experiment with modifying/commenting/uncommenting various parts of the driver setup, for any given target scenario. Would it be possible to post the benchmark code, test data, queries for analysis? Other question: when running the driver in test mode (checking for equality of query results against RAMDirectory) does InstantiatedIndex pass all tests? That would be great!
wolfgang hoschek [21/Nov/06 10:22 AM]
> Other question: when running the driver in test mode (checking for equality of query > results against RAMDirectory) does InstantiatedIndex pass all tests? That would be great! It sure does! xfiles = [./CHANGES.txt, ./LICENSE.txt]
secs = 3.766 Process finished with exit code 0 Ok. That means a basic test passes. For some more exhaustive tests, run all the queries in
src/test/org/apache/lucene/index/memory/testqueries.txt against matching files such as String[] files = listFiles(new String[] { "*.txt", //"*.html", "*.xml", "xdocs/*.xml", "src/java/test/org/apache/lucene/queryParser/*.java", "src/java/org/apache/lucene/index/memory/*.java", }); See testMany() for details. Repeat for various analyzer, stopword toLowerCase settings, such as boolean toLowerCase = true; Analyzer[] analyzers = new Analyzer[] { // new SimpleAnalyzer(), // new StopAnalyzer(), // new StandardAnalyzer(), PatternAnalyzer.DEFAULT_ANALYZER, // new WhitespaceAnalyzer(), // new PatternAnalyzer(PatternAnalyzer.NON_WORD_PATTERN, false, null), // new PatternAnalyzer(PatternAnalyzer.NON_WORD_PATTERN, true, stopWords), // new SnowballAnalyzer("English", StopAnalyzer.ENGLISH_STOP_WORDS), }; > diff=-0.024093388, query=term*, scoreII=0.024093388, scoreRAM=0.024093388
Actually, diff != 0 means the test fails, unless the diff is very small due too rounding error, say 10E-9. The driver should report a IllegalStateException("BUG DETECTED:" > > diff=-0.024093388, query=term*, scoreII=0.024093388, scoreRAM=0.024093388
> > Actually, diff != 0 means the test fails, unless the diff is very small due too rounding error, say 10E-9. > The driver should report a IllegalStateException("BUG DETECTED:" Right, that was a bug in my code. The diff /output/ was calculated on scoreMEM - scoreRAM (were scoreMEM is 0) and not scoreII - scoreRAM ; ) wolfgang hoschek [21/Nov/06 12:50 PM]
> Ok. That means a basic test passes. For some more exhaustive tests, run all the queries in All Lucene unit tests have been adapted to work with my alternate index. Everything but proximity queries pass. Have not looked in to why as I don't use them (yet). And I have written an in depth index comparator to make sure that an InstantiatedIndex equals a Directory implementation. Hence I have already verified that the index works as expected. Todays postings from me is more to show that InstantiatedIndex is /almost/ as fast as MemoryIndex and could thus be an interesting replacement, as as it handles more than one document it might even be preferable in some cases. I will however run your suggested tests tomorrow and report back. > All Lucene unit tests have been adapted to work with my alternate index. Everything but proximity queries pass.
Sounds like you're almost there Regarding indexing performance with MemoryIndex: Performance is more than good enough. I've observed and measured that often the bottleneck is not the MemoryIndex itself, but rather the Analyzer type (e.g. StandardAnalayzer) or the I/O for the input files or term lower casing (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265809 Regarding query performance with MemoryIndex: Some queries are more efficient than others. For example, fuzzy queries are much less efficient than wild card queries, which in turn are much less efficient than simple term queries. Such effects seem partly inherent due too the nature of the query type, partly a function of the chosen data structure (RAMDirectory, MemoryIndex, II, ...), and partly a consequence of the overall Lucene API design. The query mix found in testqueries.txt is more intended for correctness testing than benchmarking. Therein, certain query types dominate over others, and thus, conclusions about the performance of individual aspects cannot easily be drawn. Wolfgang. I've now checked in a version of MemoryIndexTest into contrib/memory that more easily allows to switch between measuring indexing or querying. Example output for measuring query throughput on simple term queries: ~500000 queries/sec on a MacBook Pro, jdk 1.5.0_06, server VM. As always, your mileage may vary.
This is the current version of my local Lucene branch, including InstantiatedIndex. As I have not merged with the trunk for a while, it also features my locally patched version. It really is just a few small changes. Some classes are no longer final, plus I have introduced InterfaceIndexWriter and InterfaceIndexModifier.
/lucene2karl/lucene2-apache-karl-patched All tests pass, except remote, multi and parallell searchers. Jira admins: you are more than welcome to remove all old attachments, except images.
Karl Wettin made changes - 22/Nov/06 01:22 PM
Karl Wettin made changes - 22/Nov/06 03:14 PM
Karl Wettin made changes - 27/Nov/06 04:48 PM
Karl Wettin made changes - 27/Nov/06 04:48 PM
Karl Wettin made changes - 27/Nov/06 04:48 PM
Karl Wettin made changes - 27/Nov/06 04:48 PM
> Jira admins: you are more than welcome to remove all old attachments, except images.
oh, i had no clue my status was upgraded. cool. fixed it my self. I don't see a patch file here. Your proposal would be easier to evaluate as a patch file. Also, a contribution like this will be easier to accept if your new classes are in the contrib tree. Then, if they prove popular, they can move into the core. Or perhaps folks will find them so obviously useful they'll want them in the core from the start, but contrib would require less convincing.
Karl Wettin made changes - 13/Jan/07 12:50 AM
Karl Wettin made changes - 13/Jan/07 12:51 AM
Doug Cutting [12/Jan/07 10:16 AM]
> I don't see a patch file here. Your proposal would be easier to evaluate as a patch file. Attached! > easier to accept if your new classes are in the contrib tree. There are a couple of chages in the core, the rest has been moved to contrib/indexfacade and contrib/instantiated. There is some clean up to do: a couple of static tests in instantiated. And perhaps some common logging artifacts left from debugging. I'm quite certain that both contrib/packages depends on java<1.5>. At least concurrency in instantiated.
Karl Wettin made changes - 14/Jan/07 04:59 PM
Karl Wettin made changes - 14/Jan/07 05:00 PM
New patch has all assimilated test cases moved to a new non conflicting package.
Also contains contrib/cache that depends on everything else. I've been trying to follow the work you've been doing Karl, but i must admit a lot of it is over my head – but since i've got a long weekend and your patch now makes so few changes to the core i could acctually make sense of that part, so here are some comments on those changes...
1) some of these changes seem to be duplicated in LUCENE-774 and 2) since the new ScoreDoc.docComparator and ScoreDoc.scoreComparator are public, they should have some javadocs clarifing what they are for. 3) i don't think the Hits.setSearcher method you added is safe ... i believe that at a minimum hitDocs, first, last, and weight all need to be reset – weight's a tricky one since the instance doesn't currently hang on to the orriginal query. 4) I would personally prefer IndexWriterInterface and IndexModifierInterface over InterfaceIndexWriter and InterfaceIndexModifier – if for no other reason then so they sort together .. but that's a minor nit. I've only briefly looked at the new stuff in contrib, because I got lost ... there isn't any package or class level javadocs or a build.xml in either contrib. A big thing i did notice is that the code in indexfacade puts things in the o.a.l.search and o.a.l.index packages, which is being discouraged for contribs (among other reasons it makes it confusing to understand where a class is coming form) ideally those classes should live under o.a.l.indexfacade.index and o.a.l.indexfacade.index (or maybe just o.a.l.facade - but you get the idea) I just realized that all of the tests in contrib/instantiated/src/test/java/org/apache/lucene/instantiated/assimilated/ are duplicates of tests from the core with a few line changes so they use an InstantiatedIndex to get a reader/writer/seracher etc.
I think it would be much better if we changed the orriginal versions of these tests to include an accessors for constructing/fetching those objects which could be subclassed by tests in your contrib – that way any bugs found/fixed in those test classes and any additional test methods added to those classes would automatically be inherited by your versions (instead of winding up with duplicate cut/paste test code)
Hoss Man made changes - 15/Jan/07 09:30 AM
Hoss Man made changes - 15/Jan/07 09:31 AM
Hoss Man made changes - 15/Jan/07 09:32 AM
Hoss Man made changes - 15/Jan/07 09:32 AM
Karl: the trunk.diff i just attached fixes a small autoboxing dependency your patch introduced into the core (preventing compilation on java 1.4). I also added build.xml files to the new contrib dirs, rearanged the directory of the contribs so they match the default for contribs and the the build.xml files could be simple. Once i did this i discovered some unneccessary dependencies on commons-logging that i removed. Then i ran the tests, and got some errors – which are included in test-reports.zip so you can check them out.
Thanks alot Hoss, for taking the time. I sure do appreciate it.
I'll get back on your comments.
Karl Wettin made changes - 21/Jan/07 06:29 PM
Karl Wettin made changes - 21/Jan/07 06:32 PM
Karl Wettin made changes - 21/Jan/07 06:33 PM
Karl Wettin made changes - 21/Jan/07 06:33 PM
Karl Wettin made changes - 21/Jan/07 06:33 PM
Karl Wettin made changes - 21/Jan/07 06:44 PM
New sunday, new code.
Hoss Man [15/Jan/07 12:16 AM] Tried to do something about the java docs. Also made a new fresh class diagram with some comments in it. I can make it PDF or XUL if prefered. That boxing error you fixed might be back. Where was it? Could not find it in the patch (all adding and no -+ fix) and it was too late to apply your patch on my local version.. > Hoss Man [15/Jan/07 12:16 AM] Is it considered better practise to keep all my changes in this one huge issue? I thought it could be nice to pop in minor patches such as them. > 4) I would personally prefer.. There has been a lot of refactoring of packages and class names as suggested. (I'm still not happy with the notification listener classes.) A few new changes to the core: Lazy initialization of the fields collection in Document . Some definalization to allow decoration of IndexReader. > Hoss Man [15/Jan/07 12:16 AM] It smeared out on java-dev: http://www.nabble.com/Decorative-cache-%28and-Hits.setSearcher%29-tf3009848.html#a8428139 I did not investigate this any further with test code, but I have identitfied lazy fields as a problem. Instead I'm considering a supplementary decorated document cache on the IndexReader, and implementing a replacement for Hits. Hoss Man [15/Jan/07 12:39 AM] This is not a bad idea at all, but I will not have time to do it right anytime soon. It would be a simpler task if the facade was a part of the core, as this is just the thing it was built for – unison index handling. Hoss Man [15/Jan/07 01:35 AM] What tool do you recommend to inspect these reports? I know for a fact that remote searchable will fail. I hope for someone to show up, need it and fix it. Patch of the week.
Changes:
Removed Hits cache due to uncertainty but introduced:
TopDocs/TopFieldDocs- and IndexReader cache combined almost replace a fully cached Hits. The number of unit tests and detail of them is increasing. The plan is now to have the cached reader pre-loading documents to memory from an own thread when server load allows it. Also added some abstractation levers used by above:
Had some problems with decorating the IndexModifierInterface against Directory in NotifiableIndex, so removed the Index.indexModifierFactory() and introduced a index facade backed version: org.apache.lucene.index.facade.IndexModifier(myIndex, analyzer, create) where all reader/writer creation is myIndex.indexReaderFactory() and indexWriterFactory(); Makes the Notifiable code a bit simpler.
Karl Wettin made changes - 27/Jan/07 04:52 PM
Karl Wettin made changes - 27/Jan/07 04:53 PM
Karl Wettin made changes - 27/Jan/07 04:54 PM
new diagram with lots of notes
(this is also available in the patch as an uxf-file for umlet)
Karl Wettin made changes - 28/Jan/07 06:22 PM
Karl Wettin made changes - 28/Jan/07 06:25 PM
Karl Wettin made changes - 28/Jan/07 06:25 PM
Refactored the Term->Document relationships a bit for speed optimizations. It also resulted with getting all term frequency vector information except for offsets free of charge. More information on that in the class diagram.
Removed a whole bunch of todo:s in the writer and reader. The current lock implementen is worthless. I need to read up on RentrentLock. Or should I perhaps use the lock Directory:s use? (And that class diagram is of course granted for ASF, my misstake.)
Karl Wettin made changes - 28/Jan/07 06:34 PM
Added support for contrib/memory MemoryIndex, so now it works with readers and writers as if it was any other index.
Added a consumer level index implementation that handles cache, notifications, and all the stuff this issue is about: // This is the instace one is supposed to use for all access against the index in this JVM. // Accessors public class IndexFacade { /** wrapps any storage, optional cache settings */ /** The general consumer searcher to be used when querying this index. Always fresh. */ /** The general consumer read only index reader to be used when inspecting this index. Always fresh. */
Karl Wettin made changes - 03/Feb/07 06:15 PM
Can now be loaded from, and be persisted in an FSDirectory.
The actual implementation is a bit more abtract than that though. It is not super nice yet, but all low level index comparator tests pass. Introduced functionallity to load an instantiated from any index reader (e.g. a FSDirectory) /** * Creates a new instantiated index that looks just like the index in a specific state as represented by a reader. * * @param sourceIndexReader the source index this new instantiated index will be copied from. * @throws IOException if the source index is not optimized, or when accesing the source. */ public InstantiatedIndex(IndexReader sourceIndexReader) throws IOException { Also introduced class SimpleSychronizedIndex, a class that kind of works like unix command "tee", makes sure that all changes to a main index (e.g. an instantiated index) also is applied to a mirror index (e.g. the fs directory loaded to the instantiated index at constructor time). Some class that handles these two things a single entity will probably be added soon. Basiacally this is replicating changes to a secondary index on commits. Thus it takes about twice the time to insert documents. Perhaps the secondary index should be updated in a secondary thread?
Karl Wettin made changes - 10/Feb/07 09:30 PM
the last attachment is of course for ASF distribution. sorry.
Introduced a method in instantiated index that appends the entire content to any other index.
/** * Adds the complete content of this instantiated index on to any other index using an index writer. * <p/> * This can for instance be used for * merging multiple instantiated indices * and periodically storing persistent snapshots in an FSDirectory. * <p/> * Non stored offsets are partially rebuilt. This can be improved quite a bit. See comments in code. * <p/> * The analyzer creates one complete token stream of all fields with the same name the first time it is requested, * and after that an empty for each remaining. todo: this is a problem? * <p/> * It can be buggy if the same token appears as synonym to it self (position increment 0). not really something to worry about.. or? * * @param indexWriter represents the index on wich to add all the content of this instantiated index. * @throws IOException when accessing parameter indexWriter */ public void writeToIndex(IndexWriterInterface indexWriter) throws IOException {
Karl Wettin made changes - 11/Feb/07 05:21 PM
Karl Wettin made changes - 17/Feb/07 07:26 AM
UML class diagram of the adaptive spell checker with all java docs as comments
Karl Wettin made changes - 17/Feb/07 07:29 AM
Karl Wettin made changes - 17/Feb/07 07:54 AM
Karl Wettin made changes - 17/Feb/07 08:24 AM
(now proof read and all)
Package level java doc of the spell checker: A dictionary with weighted suggestions, <h1>What, where, when and how.</h1> <h2>Goal trees</h2> <h2>Adaptive training</h2> <h2>Suggesting</h2> <h2>Second level suggestion</h2> <h1>Consumer interface example</h1> @Override public void testBasicTraining() throws Exception {
QueryGoalNode<R> node;
node = new QueryGoalNode<R>(null, "heroes of nmight and magic", 3);
node = new QueryGoalNode<R>(node, "heroes of night and magic", 3);
node = new QueryGoalNode<R>(node, "heroes of might and magic", 10);
node.new Inspection(23, QueryGoalNode.GOAL);
suggestionFacade.queueGoalTree(node.getRoot());
node = new QueryGoalNode<R>(null, "heroes of night and magic", 3);
node = new QueryGoalNode<R>(node, "heroes of knight and magic", 7);
node = new QueryGoalNode<R>(node, "heroes of might and magic", 20);
node.new Inspection(23, QueryGoalNode.GOAL);
suggestionFacade.queueGoalTree(node);
node = new QueryGoalNode<R>(null, "heroes of might and magic", 20, 1l);
suggestionFacade.queueGoalTree(node);
node = new QueryGoalNode<R>(null, "heroes of night and magic", 7, 0l);
node = new QueryGoalNode<R>(node, "heroes of light and magic", 14, 1l);
node = new QueryGoalNode<R>(node, "heroes of might and magic", 2, 6l);
node.new Inspection(23, QueryGoalNode.GOAL);
node.new Inspection(23, QueryGoalNode.GOAL);
suggestionFacade.queueGoalTree(node);
node = new QueryGoalNode<R>(null, "heroes of night and magic", 4, 0l);
node = new QueryGoalNode<R>(node, "heroes of knight and magic", 17, 1l);
node = new QueryGoalNode<R>(node, "heroes of might and magic", 2, 2l);
node.new Inspection(23, QueryGoalNode.GOAL);
suggestionFacade.queueGoalTree(node);
suggestionFacade.flush();
assertEquals("heroes of might and magic", suggestionFacade.didYouMean("heroes of light and magic"));
assertEquals("heroes of might and magic", suggestionFacade.didYouMean("heroes of night and magic"));
assertEquals("heroes of might and magic", suggestionFacade.didYouMean("heroes ofnight andmagic"));
} I'll try to keep updated and built javadocs at this location:
(Sorry for flooding..)
Karl Wettin made changes - 19/Feb/07 02:20 PM
Support for deleteDocuments in IndexWriterInterface, InstantiatedIndex and NotifiableIndex.
Somewhat hacky solution to pick up the deletions in NotifiableIndex, but it is a solution.
Karl Wettin made changes - 20/Feb/07 08:32 PM
New Patch. Mainly updates in contrib/didyoumean. Merged some core conflicts.
TestGoalJuror now import 200,000 real user queries from a log containing session id, query, category, timestamp and number of hits, ordered by session id and time. This means that the trainer and suggester are not aware of if the user followed or ignored a suggestion from the system, what results was inspected, if the query contained a goal, et c. So it does not work as if trained from the start with the adaptive layer. Still, the suggester navigates the dictionary fairly well and misspelled queries will be suggested the correct suggestion, but many correct spelled phrases will recommend something silly. As one start reporting user interaction to the suggester any silly recommendation should go away. In essence, it can only adapt the suggestions positive based on what the QueryGoalJuror says is a goal. Negative is only when a user don't take a suggestion. It could be solved with bootstrapping. Will mess with that later.
Karl Wettin made changes - 25/Feb/07 11:37 PM
Switched from java.util.PriorityQueue to org.apache.lucene.util.PriorityQueue, and made the latter <Generic>.
Fixed some major bugs in the TermFreqVector inspection for the spell checker. Demonstrate in TestGoalJuror how to build an a priori corpus for the ngram token suggester based on user input by inverting the suggestion dictionary. That should probably be extracted to a helper class in the future. This makes it faster to query the a apriori, but it also means that what the system takes for grantent is correct comes from user input, and even if the correct data is what users point out as a real query goal, it does not have to be correct. Although, it makes the suggester much faster.
Karl Wettin made changes - 03/Mar/07 01:18 PM
Removed the dependencies to LUCENE-626.
Karl Wettin made changes - 03/Mar/07 07:56 PM
Karl Wettin made changes - 03/Mar/07 07:57 PM
Patched contrib/benchmark to support InstantiatedIndex.
Fixed a bug with mergeFactor. Reverted java 1.5<G> changes in PriorityQueue to (ClassCasting). (This is actually a spell checker thingy, but due to local dependencies the changes are located in this patch). Removed write locks. These had severe bugs and need to be reconsidered. Should be back in next patch. Using multiple InstantiatedIndex:es as segments on a MultiReader rather than updating the same index, this can be made completly lockless.
Karl Wettin made changes - 13/Mar/07 02:22 AM
A note on, and output from contrib/benchmark:
I'm getting really poor results compared to my own test and live enviroment stats. At query time I expected maximum 1/6th time spent in InstantiatedIndex than RAMDirectory, but it turns out that in the benchmarker the speed is almost the same as RAMDirectory. Retrieving documents is only 1/5th of the speed rather than maximum 1/60th as expected. Investigated the code a bit and noticed that ReadTask creates a new instance of IndexReader and IndexSearcher for each query. Could this be the reason? Memory consumption is 3x of a RAMDirectory, but half of the memory is spent on keeping the Document instances in heap. Perhaps it would be interesting to use the same persistency for these as in the Directory implementations. The merge factor sweet spot is around 2500, where it turns out to be a little bit faster than the RAMDirectory sweet spot. At defualt 10 InstantiatedIndex consumes about 5x more time than a RAMDirectory. If I fix the locklessness as suggested in previous comment, it most probably will be much faster than a RAMDirectory at any setting. /**
I would not pay to much attention to the numbers below until I've got the benchmarker under control, but here are the stats: Output from InstantiatedIndex: [java] ------------> Report Sum By (any) Name (19 about 160153 out of 160153) Output from RAMDirectory: That's a good point about the task-benchmark karl!
All 4 ReaderTasks are reusing the reader if it is already open, but if it is not already open, each task opens a private one, and closes it after the task is done. I now see that the javadocs can be improved here - especially in the reader sub-tasks. I will update the documentation to clarify this point. Anyhow, for the running tasks to share a reader, the alg part of the .alg file should have something like this: OpenReader ReaderTaskA CloseReader This way all three tasks would share the same, already open, reader. A graph showing performance of hit collection using InstantiatedIndex, RAMDirectory and FSDirectory.
In essence, there is no great win in pure search time when there are more than 7000 documents. However, retreiving documents is still not associate with any cost what so ever, so in a 250000 sized index that use Lucene for persistency of fields, I still see a boost with 6-10x or so compared to RAMDirectory. documents in corpus \t queries per second org.apache.lucene.store.instantiated.InstantiatedIndex@628704 org.apache.lucene.index.facade.RAMDirectoryIndex@af993e org.apache.lucene.index.facade.FSDirectoryIndex@4112c0
Karl Wettin made changes - 17/Mar/07 08:11 PM
Karl Wettin made changes - 17/Mar/07 08:18 PM
Karl Wettin made changes - 17/Mar/07 08:34 PM
Karl Wettin made changes - 17/Mar/07 08:35 PM
This a very interesting benchmark graph ! Note that there is just a little mistake in there : the labels of the axes are switched.
And you said that you still have lot of gain with 250 000 documents because retreiving cost. But if I have to made the choice of having everything in memory, I won't put the data of my own model into Lucene. I will keep them in memory while not transforming them into stored Lucene Document. I will just transform them for indexing purpose and just keep an ID in the Lucene store which will help me map the search result to my own model data. This will avoid the transformation Lucene-Document -> MyModel-Data. (after relooking at the UML diagram) : Unless you allow to put POJO objects in a Document ? > Nicolas Lalevée [18/Mar/07 02:04 AM]
> This a very interesting benchmark graph ! Note that there is just a little mistake in there : the labels of the axes are switched. The test is sort of crued, a set of queries with variable complexity that for each iteration is placed on a new IndexSearcher and IndexReader. Index is optimized at all measure points. > And you said that you still have lot of gain with 250 000 documents because I can only agree. >(after relooking at the UML diagram) : Unless you allow to put POJO objects in a Document ? That is the hypothesis. I've actually been a bit baffled by the results I've seen the last days while benchmarking. The application this was orginially built for (the one with 250 000 documents) is fairly busy, on average one query every 10ms 24/7. Peeks at one every 2ms. On the single machine setup with 4GB and Solaris the CPU went from 90% busy to 90% idle when switching from RAMDirectory to InstantiatedIndex. I can at this point not say if this is due to bad use of Lucene and compensating for that with a crazy solution. But I don't think so. I think I've missed a bunch of benchmark factors. Since that project, and that was some time ago, I have not implemented any applications with a "normal" corpus using InstantiatedIndex. It is the backbone of the active cache (also availabe in this patch). I'm sure people made similar things with MemoryIndex. For each batch of new documents inserted, I apply cached queries on the batch-index to detect if the new data would affect the results associated with the cached query. (The cache does other active things too.) In the didyoumean issue I use InstantiatedIndex as a speedy a priori index, a small index with feature selected text (common user queries known to be correct, very common phrases in document titles, et c) that is used to build ngrams for token suggestions, build phrase suggestions, rearrange term order in phrases, et c. As these documents are very small (a small phrase) it is some 10x-20x faster than a RAMDirectory at 50 000 documents.
Karl Wettin made changes - 18/Mar/07 03:50 PM
Karl Wettin made changes - 18/Mar/07 03:50 PM
This is a small and completely isolated version of InstantiatedIndex, the results of my "last attempt" thread:
http://www.nabble.com/Last-attempt-tf4153815.html It requires no changes to the Lucene core but hogs a bit more RAM and probably depends on your JIT to avoid wasting CPU. So prior required definalization and generalization is replaced by aggregation (strategy pattern). I also had to remove all the polymorphic index handling (IndexWriterInterface et c), and I have removed the IndexWriter in InstantiatedIndex. One now have to create a new InstantiatedIndex and pass down an IndexReader instead. So there is no appending allowed. Also, there are no locks no more, but that should not be needed anymore. The port of the complete test suite from Lucene to the unison index handling has been removed. Ie there are no real test cases that demonstrate this patch. Anything but term vectors and payloads should work great though. The code base is over a year old and these are new features I did not have time to implement or test. No new benchmarks. The greatest loss is the loss of features, not CPU and RAM. Perhaps it waste 15% more resources than the previous patch? As I personally enjoy the features removed in this patch, I will keep on running Lucene 2.0 and the old version, but this should be easier to understand and maintain if anyone else wants to take a look at it.
Karl Wettin made changes - 04/Aug/07 02:28 PM
Grant Ingersoll made changes - 07/Aug/07 10:28 PM
Hey Karl,
I started to look at this, but there are a few stoppers at this point for me: It could also use some documentation, especially on the how and why of the InstantiatedIndex. Cheers, Grant Ingersoll - 07/Aug/07 06:22 PM
> 1. No build file > 2. Tests are virtually non-existent > > It could also use some documentation, especially on the how and why of the InstantiatedIndex. I'll come up with some stuff asap. About tests, the new patch is more or less a redection of the previous patch. The latter contains more or less all tests assimilated to run on instantiated index. WIth the new patch there is no IndexWriter, so I will have to reassimilate it all. In the old patch there is a test case that compare two index readers - enumerating all parts of an a priori reader and a test reader comparing the values. It passed in the old patch, so I don't think there is any problem. I'll reintroduce it though. Do you think that would be enough, or do you want the assimilated tests back? Is the payload API fixed? There is a bunch of TODOs and warnings here and there in the code, the reason for me not implementing it in this store. On the Payload question, it is still marked as experimental, but if your patch gets in before anyone changes it, the onus is on that person to make sure the change is functional, so I would think you are fine to assume the current payload is fixed for the time being.
Added support for payloads
Reintroduced InstantiatedIndexWriter (no locks!) Reintroduced TestIndicesEquals Introduced build.xml Introduced pom.xml (this file is missing java 1.5 setting) Added some silly javadocs It also hit me that I could have a HashMap<Term, Integer> parallell to the List<Term> orderdTerms. The latter is currently beeing binarysearched in TermEnum, and a HashMap would make it much faster, especially as the index grows. Might speed things up alot.
Karl Wettin made changes - 08/Aug/07 09:25 PM
> It also hit me that I could have a HashMap<Term, Integer> parallel to the Just looked in to this. There is some performance to gain, but not much. I'll do some benches later on and see if it was worth it. Most binary searches are placed in the IndexWriter, and I honestly don't care too much about make that part faster if it slows down searching or makes it hog more RAM. Should I wait on this until you figure this out?
Grant Ingersoll - 15/Aug/07 05:17 PM
> Should I wait on this until you figure this out? Please don't. I'm just thinking really lound. I just found a bug that I can not explain.
While scoring this one specific phrase query in this one specific corpus of mine, the scorer calls TermPositions.nextPosition() more than TermPositions.freq() times. Never seen this error before, and it does not do this when running against a Directory. TestIndicesEquals does however pass, so it must be me that does not reset currentTermPosition counter, or something along that way. I have been debugging for hours and hours in the scorer code in order to understand the difference between II and Directory is, but I can't figure it out. Completely lost in this (read: any) scorer code. It sure is a show stopper if it sometimes does not work, so I'll try to find the bug. This is the first time I've seen it though. I mean, I do use phrase queries in other places in conjunction with this store, and that makes it even more strange. I have tried to come up with an isolated test case, but I can't. I can however pass the corpus and code that produce this error to some specific person, but I'm afraid I can't post it here. There is also a minor TermFreqVector bug that throws a NPE, solved in the next patch. Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 12 Previously mentioned problems deloused. The phrase (term position) problem turned out to be the constructor InstantiatedIndex(IndexReader) that had a bug, ending up with a index not equal to one created via InstantiatedIndexWriter.
I also did a bunch of tests on how much it would speed up by replacing the binary searches over lists with hash tables (maps). Gained perhaps 5% speed, but lost quite a bit of RAM, so I reverted those things. Do you want more test cases than the TestIndicesEquals? Payloads needs to be verified. I never really worked with them, and the Directory-centric test will not be ported easily.
Karl Wettin made changes - 17/Aug/07 10:09 PM
If I understand your test correctly, you have gone through and compared term by term, etc. (vectors, etc.)
I would like to see payloads tested as well. I also think you need a package level javadoc that explains the use cases for this and the basics of using it. Also, I notice the caveat about no locking (in the javadocs for InstantiatedIndex) and I notice a TODO as well saying implement locking. Thoughts on implementing it? Grant Ingersoll - 22/Sep/07 05:52 AM
> I would like to see payloads tested as well. I'm new to payloads and don't know what makes sense when it comes to populating the aprioi/test indices. Any preferences? Or should I just randomly add some payloads to the positions of a couple of terms in a couple of documents? > package level javadoc Any comments on how to include graphics in the documentation? (I'm a big fan of UML, you might have noticed there is quite a bit of ASCII class diagram stubs in the javadocs of fields that represent binary associations, association classes and qualifications.) Also, where should I store the XML used to render the graphics? Just pop it all in the src classpath? > I notice a TODO as well saying implement locking. Thoughts on implementing it? It used to be a ReentrantLock, but for some reason I can't seem to recall, this was a bad idea. There are TODO: lock and TODO: release lock tags left throughout the code. I should probably take a look at o.a.l.store.Lock. There are three more caveats I know of, but I'm not certain how important they are to fix. IndexReader: public Document document(int n, FieldSelector fieldSelector) throws IOException { // todo: it does not make to much sense to use field selector using this implementation, // todo: so it simply ignores this and return everything. return document(n); } public Collection getFieldNames(FieldOption fldOption) { IndexWriter.addDocument does not support readerValue and binaryValue. if (field.isTokenized()) { > Any comments on how to include graphics in the documentation? (I'm a big fan of UML,
> you might have noticed there is quite a bit of ASCII class diagram stubs in the javadocs of > fields that represent binary associations, association classes and qualifications.) Also, where > should I store the XML used to render the graphics? Just pop it all in the src classpath? images that you want to embed in (or files you want to link to from) javadocs should live in a "doc-files" directory in the package.... http://java.sun.com/j2se/javadoc/writingdoccomments/#images ...iwould put the XML source for the image in there as well, and put a link to it in the javadocs as well. New in this patch:
I've noticed that there are some differences in the behavior of IndexWriter and InstantiatedIndexWriter when a document containing multiple fields with the same name but different settings, such as: d.add(new Field("f", " All work and no play makes Jack a dull boy", Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS)); d.add(new Field("f", " All work and no play makes Jack a dull boy", Field.Store.NO)); d.add(new Field("f", " All work and no play makes Jack a dull boy", Field.Store.YES, Field.Index.NO_NORMS, Field.TermVector.NO)); Would this be considered an invalid document? Should there be a term vector or not? Or perhaps just term vector for the tokens in the first field?
Karl Wettin made changes - 27/Sep/07 11:49 PM
Oups, the patch is of course granted ASF licence.
In this patch:
http://www.nabble.com/norms%28String-field%2C-byte---bytes%2C-int-offset%29-tf4580460.html#a13075367
Karl Wettin made changes - 08/Oct/07 12:24 AM
In this path:
Karl Wettin made changes - 17/Oct/07 03:17 AM
In this patch:
Karl Wettin made changes - 19/Oct/07 02:00 PM
In this patch:
Karl Wettin made changes - 21/Oct/07 03:44 PM
Karl Wettin made changes - 23/Oct/07 08:20 PM
Karl Wettin made changes - 23/Oct/07 08:20 PM
Karl Wettin made changes - 23/Oct/07 08:20 PM
Karl Wettin made changes - 23/Oct/07 08:21 PM
Karl Wettin made changes - 23/Oct/07 08:21 PM
Karl Wettin made changes - 23/Oct/07 08:21 PM
Karl Wettin made changes - 23/Oct/07 08:21 PM
Karl Wettin made changes - 23/Oct/07 08:21 PM
Karl Wettin made changes - 23/Oct/07 08:21 PM
Karl Wettin made changes - 23/Oct/07 08:21 PM
Karl Wettin made changes - 23/Oct/07 08:21 PM
Karl Wettin made changes - 23/Oct/07 08:21 PM
Karl Wettin made changes - 23/Oct/07 08:22 PM
Karl Wettin made changes - 23/Oct/07 08:22 PM
Karl Wettin made changes - 23/Oct/07 08:22 PM
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> You might notice that norms are float[] and not byte[]. That is me who refactored it to see if it would do
> any good. Bit shifting don't take many ticks, so I might just revert that.
Since there are only 256 byte values, many scorers use a simple lookup table Similarity.getNormDecoder()
After I sped up norm decoding, a lookup table was only marginally faster anyway (see comments in SmallFloat class). So I wouldn't expect float[] norms to be mesurably faster than byte[] norms in the context of a complete search.