Issue Details (XML | Word | Printable)

Key: LUCENE-626
Type: Improvement Improvement
Status: Open Open
Priority: Minor Minor
Assignee: Unassigned
Reporter: Karl Wettin
Votes: 3
Watchers: 9
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Extended spell checker with phrase support and adaptive user session analysis.

Created: 13/Jul/06 04:18 PM   Updated: 17/Apr/08 08:02 PM
Return to search
Component/s: Search
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works LUCENE-626_20071023.txt 2007-10-23 08:18 PM Karl Wettin 234 kB

Lucene Fields: Patch Available


 Description  « Hide
Extensive javadocs available in patch, but I also try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description

A semi-retarded reinforcement learning thingy backed by algorithmic second level suggestion schemes that learns from and adapts to user behavior as queries change, suggestions are accepted or declined, etc.

Except for detecting spelling errors it considers context, composition/decomposition and a few other things.

heroes of light and magik -> heroes of might and magic
vinci da code -> da vinci code
java docs -> javadocs
blacksabbath -> black sabbath

Depends on LUCENE-550



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Karl Wettin added a comment - 03/Feb/07 02:19 AM
All of the old comments was obsolete, so I re-initialized the whole issue.

Karl Wettin added a comment - 03/Feb/07 02:21 AM
NgramPhraseSuggester is now decoupled from the adaptive layer, but I would like to refactor it even more so it is easy to replace the SpellChecker with any other single token suggester.

Karl Wettin added a comment - 17/Feb/07 07:54 AM
Patch in this issue have no dependencies to anything out of the ordinary.

However, a large refactor and well documented version dependent to LUCENE-550 is available.


Karl Wettin added a comment - 03/Mar/07 07:59 PM
Patch depends on LUCENE-550.

Nicolas Lalevée added a comment - 03/Mar/07 09:04 PM
This feature looks interesting, but why should it depend on LUCENE-550 ?

Karl Wettin added a comment - 03/Mar/07 09:13 PM
Nicolas Lalevée [03/Mar/07 01:04 PM]
> This feature looks interesting, but why should it depend on LUCENE-550 ?

It use the Index (notification, unison index factory methods, et c.) and IndexFacade (cache, fresh reader/searcher et c.) available in that patch. And by doing that, it also enables me to use InstantiatedIndex for the a priori corpus and ngram index to speed up the response time even more.


Karl Wettin added a comment - 17/Aug/07 12:10 AM
As the phrase-suggestion layer on top of contrib/spell in this patch was noted in a bunch of forums the last weeks, I've removed the 550-dependency and brought it up to date with the trunk.

Second level suggesting (ngram token, phrase) can run stand alone. See TestTokenPhraseSuggester. However, I recommend the adaptive dictonary as it will act as a cache on top of second level suggestions. (See docs.)

Output from using adaptive layer only, i.e. suggestions based on how users previously behaved. About half a million user queries analyed to build the dictionary (takes 30 seconds to build on my dual core):

3ms pirates ofthe caribbean -> pirates of the caribbean
2ms pirates of the carribbean -> pirates of the caribbean
0ms pirates carricean -> pirates caribbean
1ms pirates of the carriben -> pirates of the caribbean
0ms pirates of the carabien -> pirates of the caribbean
0ms pirates of the carabbean -> pirates of the caribbean
1ms pirates og carribean -> pirates of the caribbean
0ms pirates of the caribbean music -> pirates of the caribbean soundtrack
0ms pirates of the caribbean soundtrack -> pirates of the caribbean score
0ms pirate of carabian -> pirate of caribbean
0ms pirate of caribbean -> pirates of caribbean
0ms pirates of caribbean -> pirates of caribbean
0ms homm 4 -> homm iv
0ms the pilates -> null

Using the phrase ngram token suggestion using token matrices checked against an apriori index. A lot of queries required for one suggestion. Instantiated index as apriori saves plenty of millis. This is expensive stuff, but works pretty good.

72ms the pilates -> the pirates
440ms heroes of fight and magic -> heroes of might and magic
417ms heroes of right and magic -> heroes of might and magic
383ms heroes of magic and light -> heroes of might and magic
20ms heroesof lightand magik -> null
385ms heroes of light and magik -> heroes of might and magic
0ms heroesof lightand magik -> heroes of might and magic
385ms heroes of magic and might -> heroes of might and magic

(That 0ms is becase previous was cached. One does not have to use this cache.)


Karl Wettin added a comment - 17/Aug/07 02:06 AM
RAMDirectory vs. InstantiatedIndex as apriori index: the latter is 5 to 25 times faster (leave first out).

RAMDirectory:
72ms the pilates -> the pirates
440ms heroes of fight and magic -> heroes of might and magic
417ms heroes of right and magic -> heroes of might and magic
383ms heroes of magic and light -> heroes of might and magic
20ms heroesof lightand magik -> null
385ms heroes of light and magik -> heroes of might and magic
0ms heroesof lightand magik -> heroes of might and magic
385ms heroes of magic and might -> heroes of might and magic

InstantiatedIndex:
171ms the pilates -> the pirates
66ms heroes of fight and magic -> heroes of might and magic
36ms heroes of right and magic -> heroes of might and magic
14ms heroes of magic and light -> heroes of might and magic
6ms heroesof lightand magik -> null
15ms heroes of light and magik -> heroes of might and magic
0ms heroesof lightand magik -> heroes of might and magic
16ms heroes of magic and might -> heroes of might and magic


Karl Wettin added a comment - 16/Oct/07 06:45 PM
New in this patch:
  • Dictionary persistence using BDB/je/3.2.44
  • Lots of refactoring to make the code easier to follow

Next patch will probably focus on:

  • Pruning of dictionary
  • Documentation of non-consumer methods

Karl Wettin added a comment - 23/Oct/07 08:18 PM
In this patch:
  • Updated package javadoc
  • Simplified consumer interface with persistent session management :
SuggestionFacade facade = new SuggestionFacade(new File("data"));
facade.getDictionary().getPrioritesBySecondLevelSuggester().putAll(facade.secondLevelSuggestionFactory());
...
QuerySession session = facade.getQuerySessionManager().sessionFactory();
...
String query = "heros of mght and magik";
Hits hits = searcher.search(queryFactory(query));
String suggested = facade.didYouMean(query);
session.query(query, hits.length(), suggested);
...
facade.getQuerySessionManager().getSessionsByID().put(session);
...
facade.trainExpiredSessions();
...
facade.close();
  • Optimizations. On my MacBook it now takes five minutes for the big unit test to process 3,500,000 queries: training the dictionary and extracts an a priori corpus by inverting the dictionary of the phrases and words people have most problem spelling.
  • Depends on LUCENE-550 by default again. When it took 30 seconds to execute 100,000 span near queries in a RAMDirectory and less than one second to do the same witn an InstantiatedIndex it simply did not make sense to use RAMDirectory as default. Replacing one line of code removes the dependency to InstantiatedIndex.
  • New algorithmic second level suggester for queries containing terms not close enough in the text to be found in the a priori. Added with lowest priority and checks against the system index rather than the a priori index. Soon the second level suggster classes will needs a bit of refactoring.

Karl Wettin added a comment - 17/Apr/08 08:02 PM
If anyone have some rather large query logs with session id, time stamp and preferably click through data that I can test on this, that would be great. It really needs to be adjusted to more than one.