Lucene - Core
  1. Lucene - Core
  2. LUCENE-626

Extended spell checker with phrase support and adaptive user session analysis.

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      Extensive javadocs available in patch, but I also try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description

      A semi-retarded reinforcement learning thingy backed by algorithmic second level suggestion schemes that learns from and adapts to user behavior as queries change, suggestions are accepted or declined, etc.

      Except for detecting spelling errors it considers context, composition/decomposition and a few other things.

      heroes of light and magik -> heroes of might and magic
      vinci da code -> da vinci code
      java docs -> javadocs
      blacksabbath -> black sabbath

      Depends on LUCENE-550

        Activity

        Hide
        Karl Wettin added a comment -

        All of the old comments was obsolete, so I re-initialized the whole issue.

        Show
        Karl Wettin added a comment - All of the old comments was obsolete, so I re-initialized the whole issue.
        Hide
        Karl Wettin added a comment -

        NgramPhraseSuggester is now decoupled from the adaptive layer, but I would like to refactor it even more so it is easy to replace the SpellChecker with any other single token suggester.

        Show
        Karl Wettin added a comment - NgramPhraseSuggester is now decoupled from the adaptive layer, but I would like to refactor it even more so it is easy to replace the SpellChecker with any other single token suggester.
        Hide
        Karl Wettin added a comment -

        Patch in this issue have no dependencies to anything out of the ordinary.

        However, a large refactor and well documented version dependent to LUCENE-550 is available.

        Show
        Karl Wettin added a comment - Patch in this issue have no dependencies to anything out of the ordinary. However, a large refactor and well documented version dependent to LUCENE-550 is available.
        Hide
        Karl Wettin added a comment -

        Patch depends on LUCENE-550.

        Show
        Karl Wettin added a comment - Patch depends on LUCENE-550 .
        Hide
        Nicolas Lalevée added a comment -

        This feature looks interesting, but why should it depend on LUCENE-550 ?

        Show
        Nicolas Lalevée added a comment - This feature looks interesting, but why should it depend on LUCENE-550 ?
        Hide
        Karl Wettin added a comment -

        Nicolas Lalevée [03/Mar/07 01:04 PM]
        > This feature looks interesting, but why should it depend on LUCENE-550 ?

        It use the Index (notification, unison index factory methods, et c.) and IndexFacade (cache, fresh reader/searcher et c.) available in that patch. And by doing that, it also enables me to use InstantiatedIndex for the a priori corpus and ngram index to speed up the response time even more.

        Show
        Karl Wettin added a comment - Nicolas Lalevée [03/Mar/07 01:04 PM] > This feature looks interesting, but why should it depend on LUCENE-550 ? It use the Index (notification, unison index factory methods, et c.) and IndexFacade (cache, fresh reader/searcher et c.) available in that patch. And by doing that, it also enables me to use InstantiatedIndex for the a priori corpus and ngram index to speed up the response time even more.
        Hide
        Karl Wettin added a comment -

        As the phrase-suggestion layer on top of contrib/spell in this patch was noted in a bunch of forums the last weeks, I've removed the 550-dependency and brought it up to date with the trunk.

        Second level suggesting (ngram token, phrase) can run stand alone. See TestTokenPhraseSuggester. However, I recommend the adaptive dictonary as it will act as a cache on top of second level suggestions. (See docs.)

        Output from using adaptive layer only, i.e. suggestions based on how users previously behaved. About half a million user queries analyed to build the dictionary (takes 30 seconds to build on my dual core):

        3ms pirates ofthe caribbean -> pirates of the caribbean
        2ms pirates of the carribbean -> pirates of the caribbean
        0ms pirates carricean -> pirates caribbean
        1ms pirates of the carriben -> pirates of the caribbean
        0ms pirates of the carabien -> pirates of the caribbean
        0ms pirates of the carabbean -> pirates of the caribbean
        1ms pirates og carribean -> pirates of the caribbean
        0ms pirates of the caribbean music -> pirates of the caribbean soundtrack
        0ms pirates of the caribbean soundtrack -> pirates of the caribbean score
        0ms pirate of carabian -> pirate of caribbean
        0ms pirate of caribbean -> pirates of caribbean
        0ms pirates of caribbean -> pirates of caribbean
        0ms homm 4 -> homm iv
        0ms the pilates -> null

        Using the phrase ngram token suggestion using token matrices checked against an apriori index. A lot of queries required for one suggestion. Instantiated index as apriori saves plenty of millis. This is expensive stuff, but works pretty good.

        72ms the pilates -> the pirates
        440ms heroes of fight and magic -> heroes of might and magic
        417ms heroes of right and magic -> heroes of might and magic
        383ms heroes of magic and light -> heroes of might and magic
        20ms heroesof lightand magik -> null
        385ms heroes of light and magik -> heroes of might and magic
        0ms heroesof lightand magik -> heroes of might and magic
        385ms heroes of magic and might -> heroes of might and magic

        (That 0ms is becase previous was cached. One does not have to use this cache.)

        Show
        Karl Wettin added a comment - As the phrase-suggestion layer on top of contrib/spell in this patch was noted in a bunch of forums the last weeks, I've removed the 550-dependency and brought it up to date with the trunk. Second level suggesting (ngram token, phrase) can run stand alone. See TestTokenPhraseSuggester. However, I recommend the adaptive dictonary as it will act as a cache on top of second level suggestions. (See docs.) Output from using adaptive layer only, i.e. suggestions based on how users previously behaved. About half a million user queries analyed to build the dictionary (takes 30 seconds to build on my dual core): 3ms pirates ofthe caribbean -> pirates of the caribbean 2ms pirates of the carribbean -> pirates of the caribbean 0ms pirates carricean -> pirates caribbean 1ms pirates of the carriben -> pirates of the caribbean 0ms pirates of the carabien -> pirates of the caribbean 0ms pirates of the carabbean -> pirates of the caribbean 1ms pirates og carribean -> pirates of the caribbean 0ms pirates of the caribbean music -> pirates of the caribbean soundtrack 0ms pirates of the caribbean soundtrack -> pirates of the caribbean score 0ms pirate of carabian -> pirate of caribbean 0ms pirate of caribbean -> pirates of caribbean 0ms pirates of caribbean -> pirates of caribbean 0ms homm 4 -> homm iv 0ms the pilates -> null Using the phrase ngram token suggestion using token matrices checked against an apriori index. A lot of queries required for one suggestion. Instantiated index as apriori saves plenty of millis. This is expensive stuff, but works pretty good. 72ms the pilates -> the pirates 440ms heroes of fight and magic -> heroes of might and magic 417ms heroes of right and magic -> heroes of might and magic 383ms heroes of magic and light -> heroes of might and magic 20ms heroesof lightand magik -> null 385ms heroes of light and magik -> heroes of might and magic 0ms heroesof lightand magik -> heroes of might and magic 385ms heroes of magic and might -> heroes of might and magic (That 0ms is becase previous was cached. One does not have to use this cache.)
        Hide
        Karl Wettin added a comment -

        RAMDirectory vs. InstantiatedIndex as apriori index: the latter is 5 to 25 times faster (leave first out).

        RAMDirectory:
        72ms the pilates -> the pirates
        440ms heroes of fight and magic -> heroes of might and magic
        417ms heroes of right and magic -> heroes of might and magic
        383ms heroes of magic and light -> heroes of might and magic
        20ms heroesof lightand magik -> null
        385ms heroes of light and magik -> heroes of might and magic
        0ms heroesof lightand magik -> heroes of might and magic
        385ms heroes of magic and might -> heroes of might and magic

        InstantiatedIndex:
        171ms the pilates -> the pirates
        66ms heroes of fight and magic -> heroes of might and magic
        36ms heroes of right and magic -> heroes of might and magic
        14ms heroes of magic and light -> heroes of might and magic
        6ms heroesof lightand magik -> null
        15ms heroes of light and magik -> heroes of might and magic
        0ms heroesof lightand magik -> heroes of might and magic
        16ms heroes of magic and might -> heroes of might and magic

        Show
        Karl Wettin added a comment - RAMDirectory vs. InstantiatedIndex as apriori index: the latter is 5 to 25 times faster (leave first out). RAMDirectory: 72ms the pilates -> the pirates 440ms heroes of fight and magic -> heroes of might and magic 417ms heroes of right and magic -> heroes of might and magic 383ms heroes of magic and light -> heroes of might and magic 20ms heroesof lightand magik -> null 385ms heroes of light and magik -> heroes of might and magic 0ms heroesof lightand magik -> heroes of might and magic 385ms heroes of magic and might -> heroes of might and magic InstantiatedIndex: 171ms the pilates -> the pirates 66ms heroes of fight and magic -> heroes of might and magic 36ms heroes of right and magic -> heroes of might and magic 14ms heroes of magic and light -> heroes of might and magic 6ms heroesof lightand magik -> null 15ms heroes of light and magik -> heroes of might and magic 0ms heroesof lightand magik -> heroes of might and magic 16ms heroes of magic and might -> heroes of might and magic
        Hide
        Karl Wettin added a comment -

        New in this patch:

        • Dictionary persistence using BDB/je/3.2.44
        • Lots of refactoring to make the code easier to follow

        Next patch will probably focus on:

        • Pruning of dictionary
        • Documentation of non-consumer methods
        Show
        Karl Wettin added a comment - New in this patch: Dictionary persistence using BDB/je/3.2.44 Lots of refactoring to make the code easier to follow Next patch will probably focus on: Pruning of dictionary Documentation of non-consumer methods
        Hide
        Karl Wettin added a comment -

        In this patch:

        • Updated package javadoc
        • Simplified consumer interface with persistent session management :
        SuggestionFacade facade = new SuggestionFacade(new File("data"));
        facade.getDictionary().getPrioritesBySecondLevelSuggester().putAll(facade.secondLevelSuggestionFactory());
        ...
        QuerySession session = facade.getQuerySessionManager().sessionFactory();
        ...
        String query = "heros of mght and magik";
        Hits hits = searcher.search(queryFactory(query));
        String suggested = facade.didYouMean(query);
        session.query(query, hits.length(), suggested);
        ...
        facade.getQuerySessionManager().getSessionsByID().put(session);
        ...
        facade.trainExpiredSessions();
        ...
        facade.close();
        
        • Optimizations. On my MacBook it now takes five minutes for the big unit test to process 3,500,000 queries: training the dictionary and extracts an a priori corpus by inverting the dictionary of the phrases and words people have most problem spelling.
        • Depends on LUCENE-550 by default again. When it took 30 seconds to execute 100,000 span near queries in a RAMDirectory and less than one second to do the same witn an InstantiatedIndex it simply did not make sense to use RAMDirectory as default. Replacing one line of code removes the dependency to InstantiatedIndex.
        • New algorithmic second level suggester for queries containing terms not close enough in the text to be found in the a priori. Added with lowest priority and checks against the system index rather than the a priori index. Soon the second level suggster classes will needs a bit of refactoring.
        Show
        Karl Wettin added a comment - In this patch: Updated package javadoc Simplified consumer interface with persistent session management : SuggestionFacade facade = new SuggestionFacade( new File( "data" )); facade.getDictionary().getPrioritesBySecondLevelSuggester().putAll(facade.secondLevelSuggestionFactory()); ... QuerySession session = facade.getQuerySessionManager().sessionFactory(); ... String query = "heros of mght and magik" ; Hits hits = searcher.search(queryFactory(query)); String suggested = facade.didYouMean(query); session.query(query, hits.length(), suggested); ... facade.getQuerySessionManager().getSessionsByID().put(session); ... facade.trainExpiredSessions(); ... facade.close(); Optimizations. On my MacBook it now takes five minutes for the big unit test to process 3,500,000 queries: training the dictionary and extracts an a priori corpus by inverting the dictionary of the phrases and words people have most problem spelling. Depends on LUCENE-550 by default again. When it took 30 seconds to execute 100,000 span near queries in a RAMDirectory and less than one second to do the same witn an InstantiatedIndex it simply did not make sense to use RAMDirectory as default. Replacing one line of code removes the dependency to InstantiatedIndex. New algorithmic second level suggester for queries containing terms not close enough in the text to be found in the a priori. Added with lowest priority and checks against the system index rather than the a priori index. Soon the second level suggster classes will needs a bit of refactoring.
        Hide
        Karl Wettin added a comment -

        If anyone have some rather large query logs with session id, time stamp and preferably click through data that I can test on this, that would be great. It really needs to be adjusted to more than one.

        Show
        Karl Wettin added a comment - If anyone have some rather large query logs with session id, time stamp and preferably click through data that I can test on this, that would be great. It really needs to be adjusted to more than one.
        Hide
        Mikkel Kamstrup Erlandsen added a comment -

        FYI: I started working on updating on Karl's patch. The code is not yet in a compilable state, but can be found on GitHub: http://github.com/mkamstrup/lucene-didyoumean

        My plans for this is to:

        • Port it to Lucene 3+ API
        • Abstract out the BDB dependency to allow for any old storage payer (BDB, JDBC, what have we)

        Thanks for the great work Karl!

        Show
        Mikkel Kamstrup Erlandsen added a comment - FYI: I started working on updating on Karl's patch. The code is not yet in a compilable state, but can be found on GitHub: http://github.com/mkamstrup/lucene-didyoumean My plans for this is to: Port it to Lucene 3+ API Abstract out the BDB dependency to allow for any old storage payer (BDB, JDBC, what have we) Thanks for the great work Karl!
        Hide
        Mikkel Kamstrup Erlandsen added a comment -

        @Karl: The test sources refer a file http://ginandtonique.org/~kalle/LUCENE-626/queries_grouped.txt.gz which is not online anymore, is this resource still available somewhere?

        Show
        Mikkel Kamstrup Erlandsen added a comment - @Karl: The test sources refer a file http://ginandtonique.org/~kalle/LUCENE-626/queries_grouped.txt.gz which is not online anymore, is this resource still available somewhere?
        Hide
        Mikkel Kamstrup Erlandsen added a comment -

        Status update: All tests except two are passing since commit 7e4eb7989d81e50cc81b6f33ac5fa188467f5d3e on http://github.com/mkamstrup/lucene-didyoumean :

        1) TestTokenPhraseSuggester gives me a ArrayIndexOutOfBoundsException roughly half way through the test cases (which otherwise pass)
        2) Missing the sample query log to import from the resource http://ginandtonique.org/~kalle/LUCENE-626/queries_grouped.txt.gz

        ! But be aware that this is still work in progress !

        Show
        Mikkel Kamstrup Erlandsen added a comment - Status update: All tests except two are passing since commit 7e4eb7989d81e50cc81b6f33ac5fa188467f5d3e on http://github.com/mkamstrup/lucene-didyoumean : 1) TestTokenPhraseSuggester gives me a ArrayIndexOutOfBoundsException roughly half way through the test cases (which otherwise pass) 2) Missing the sample query log to import from the resource http://ginandtonique.org/~kalle/LUCENE-626/queries_grouped.txt.gz ! But be aware that this is still work in progress !
        Hide
        Mikkel Kamstrup Erlandsen added a comment -

        Status update: All tests pass now, since the tag 'milestone3'

        Missing essentials:

        • An on-disk backend for Dictionary and QuerySessionManager, either via JDBC or some Lucene magic
        • More large scale testing on said on-disk backends

        Missing nice-to-haves:

        • Code cleanup
        • More javadocs
        • Optimizations (there's a lot of low hanging fruit - we sprinkle objects and string copies all over the place!)
        Show
        Mikkel Kamstrup Erlandsen added a comment - Status update: All tests pass now, since the tag 'milestone3' Missing essentials: An on-disk backend for Dictionary and QuerySessionManager, either via JDBC or some Lucene magic More large scale testing on said on-disk backends Missing nice-to-haves: Code cleanup More javadocs Optimizations (there's a lot of low hanging fruit - we sprinkle objects and string copies all over the place!)
        Hide
        Karl Wettin added a comment -

        Hej Mikkel,

        the test case data set is on an HDD hidden away on an attic 600km away from me, but I've asked for someone in the vicinity to fetch it for me. Might take a little while. Sorry!

        However extremely cool that you're working with this old beast! I'm super busy as always but I promise to follow your progress in case there is something you wonder about. It's been a few years since I looked at the code though.

        Show
        Karl Wettin added a comment - Hej Mikkel, the test case data set is on an HDD hidden away on an attic 600km away from me, but I've asked for someone in the vicinity to fetch it for me. Might take a little while. Sorry! However extremely cool that you're working with this old beast! I'm super busy as always but I promise to follow your progress in case there is something you wonder about. It's been a few years since I looked at the code though.
        Hide
        Mikkel Kamstrup Erlandsen added a comment -

        Hej Karl,

        Super that you are still around! Even more sweet that you are willing to try on turn up these files!

        Development is going pretty well, and I am finally feeling that I can wrap my head around it, but it is most valuable to have you in the back hand!

        Show
        Mikkel Kamstrup Erlandsen added a comment - Hej Karl, Super that you are still around! Even more sweet that you are willing to try on turn up these files! Development is going pretty well, and I am finally feeling that I can wrap my head around it, but it is most valuable to have you in the back hand!
        Hide
        Henrik Bitsch Kirk added a comment -

        I have forked Mikkels project, this is now availeble at (also at GitHup): http://github.com/hkirk/lucene-didyoumean

        The plan for this is more or less the same as Mikkel:

        • Keep updated with Lucene trunk (3.0.1+)
        • An on-disk backend for Dictionary and QuerySessionManager + large scale testing

        Running projects:

        • Code cleanup
        • JavaDoc
        • Optimizations
        Show
        Henrik Bitsch Kirk added a comment - I have forked Mikkels project, this is now availeble at (also at GitHup): http://github.com/hkirk/lucene-didyoumean The plan for this is more or less the same as Mikkel: Keep updated with Lucene trunk (3.0.1+) An on-disk backend for Dictionary and QuerySessionManager + large scale testing Running projects: Code cleanup JavaDoc Optimizations

          People

          • Assignee:
            Unassigned
            Reporter:
            Karl Wettin
          • Votes:
            4 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

            • Created:
              Updated:

              Development