Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      In LUCENE-3271, I moved the 'good' queries from the queries contrib to new destinations (primarily the queries module). The remnants now need to find their home. As suggested in LUCENE-3271, these classes are not bad per se, just odd. So lets create a sandbox contrib that they and other 'odd' contrib classes can go to. We can then decide their fate at another time.

      1. LUCENE-3381.patch
        44 kB
        Chris Male
      2. LUCENE-3381.patch
        66 kB
        Chris Male

        Activity

        Hide
        Chris Male added a comment -

        Patch which does the following:

        • Establishes a sandbox contrib
        • Moves the queries from the queries contrib to the sandbox. FuzzyLikeThisQuery has grown on me so I've given it a clean and pushed it to the queries module.
        • Removes the queries contrib
        Show
        Chris Male added a comment - Patch which does the following: Establishes a sandbox contrib Moves the queries from the queries contrib to the sandbox. FuzzyLikeThisQuery has grown on me so I've given it a clean and pushed it to the queries module. Removes the queries contrib
        Hide
        Chris Male added a comment -

        Command for patch:

        svn move lucene/contrib/queries/src/java/org/apache/lucene/search/FuzzyLikeThisQuery.java modules/queries/src/java/org/apache/lucene/queries/FuzzyLikeThisQuery.java
        svn move lucene/contrib/queries/src/test/org/apache/lucene/search/FuzzyLikeThisQueryTest.java modules/queries/src/test/org/apache/lucene/queries/FuzzyLikeThisQueryTest.java
        svn move --parents lucene/contrib/queries/src/java/org/apache/lucene/search/* lucene/contrib/sandbox/src/java/org/apache/lucene/sandbox/queries/
        svn move --parents lucene/contrib/queries/lib/* lucene/contrib/sandbox/lib/
        svn move --parents lucene/contrib/queries/src/test/org/apache/lucene/search/*  lucene/contrib/sandbox/src/test/org/apache/lucene/sandbox/queries/
        svn delete dev-tools/idea/lucene/contrib/queries
        svn delete dev-tools/maven/lucene/contrib/queries
        svn delete lucene/contrib/queries
        
        Show
        Chris Male added a comment - Command for patch: svn move lucene/contrib/queries/src/java/org/apache/lucene/search/FuzzyLikeThisQuery.java modules/queries/src/java/org/apache/lucene/queries/FuzzyLikeThisQuery.java svn move lucene/contrib/queries/src/test/org/apache/lucene/search/FuzzyLikeThisQueryTest.java modules/queries/src/test/org/apache/lucene/queries/FuzzyLikeThisQueryTest.java svn move --parents lucene/contrib/queries/src/java/org/apache/lucene/search/* lucene/contrib/sandbox/src/java/org/apache/lucene/sandbox/queries/ svn move --parents lucene/contrib/queries/lib/* lucene/contrib/sandbox/lib/ svn move --parents lucene/contrib/queries/src/test/org/apache/lucene/search/* lucene/contrib/sandbox/src/test/org/apache/lucene/sandbox/queries/ svn delete dev-tools/idea/lucene/contrib/queries svn delete dev-tools/maven/lucene/contrib/queries svn delete lucene/contrib/queries
        Hide
        Mark Harwood added a comment -

        FuzzyLikeThisQuery has grown on me so I've given it a clean and pushed it to the queries module

        I have thousands of people using it daily in dozens of different installations so keen to avoid any significant changes/loss.

        Show
        Mark Harwood added a comment - FuzzyLikeThisQuery has grown on me so I've given it a clean and pushed it to the queries module I have thousands of people using it daily in dozens of different installations so keen to avoid any significant changes/loss.
        Hide
        Robert Muir added a comment -

        But this query has serious problems:

        • the fact it caches 'rewrittenQuery' in the QUERY itself, this means the Query is keeping IndexSearcher-dependent state in the Query: this is broken.
        • the static similarity, which at the moment must extend TFIDFSimilarity (does not work with other scoring systems)

        Sorry, I think these problems should be fixed first!

        Show
        Robert Muir added a comment - But this query has serious problems: the fact it caches 'rewrittenQuery' in the QUERY itself, this means the Query is keeping IndexSearcher-dependent state in the Query: this is broken. the static similarity, which at the moment must extend TFIDFSimilarity (does not work with other scoring systems) Sorry, I think these problems should be fixed first!
        Hide
        Chris Male added a comment -

        I have thousands of people using it daily in dozens of different installations so keen to avoid any significant changes/loss.

        I have no intention of removing it but as Robert has pointed out in a couple of issues there is huge scope for improvement which we cannot ignore. I'll make sure to document any changes.

        • the fact it caches 'rewrittenQuery' in the QUERY itself, this means the Query is keeping IndexSearcher-dependent state in the Query: this is broken.
        • the static similarity, which at the moment must extend TFIDFSimilarity (does not work with other scoring systems)

        Good points. The cached rewrittenQuery seems like an easy fix (albeit a change that should be documented). I see your comments in code about averaging out statistics. That seems like a bigger issue. Moving away from a static similarity seems like its doable at this stage. Is the agnostic scoring model support a showstopper?

        Show
        Chris Male added a comment - I have thousands of people using it daily in dozens of different installations so keen to avoid any significant changes/loss. I have no intention of removing it but as Robert has pointed out in a couple of issues there is huge scope for improvement which we cannot ignore. I'll make sure to document any changes. the fact it caches 'rewrittenQuery' in the QUERY itself, this means the Query is keeping IndexSearcher-dependent state in the Query: this is broken. the static similarity, which at the moment must extend TFIDFSimilarity (does not work with other scoring systems) Good points. The cached rewrittenQuery seems like an easy fix (albeit a change that should be documented). I see your comments in code about averaging out statistics. That seems like a bigger issue. Moving away from a static similarity seems like its doable at this stage. Is the agnostic scoring model support a showstopper?
        Hide
        Robert Muir added a comment -

        Good points. The cached rewrittenQuery seems like an easy fix (albeit a change that should be documented).

        Well its a bug? We document that you can safely reuse Query, so if a system like Solr or ElasticSearch that caches queries caches this query, this stuff won't function correctly?

        Is the agnostic scoring model support a showstopper?

        Yes, in my opinion it is, we have to think this through.

        Show
        Robert Muir added a comment - Good points. The cached rewrittenQuery seems like an easy fix (albeit a change that should be documented). Well its a bug? We document that you can safely reuse Query, so if a system like Solr or ElasticSearch that caches queries caches this query, this stuff won't function correctly? Is the agnostic scoring model support a showstopper? Yes, in my opinion it is, we have to think this through.
        Hide
        Chris Male added a comment -

        Well its a bug? We document that you can safely reuse Query, so if a system like Solr or ElasticSearch that caches queries caches this query, this stuff won't function correctly?

        Yes its a bug in that it doesn't conform to Query's requirements, but its a change in behaviour for this particular Query impl. I'm happy to document the change under the bugs section in CHANGES.txt so we cover both bases.

        Yes, in my opinion it is, we have to think this through.

        To sandbox it goes!

        Show
        Chris Male added a comment - Well its a bug? We document that you can safely reuse Query, so if a system like Solr or ElasticSearch that caches queries caches this query, this stuff won't function correctly? Yes its a bug in that it doesn't conform to Query's requirements, but its a change in behaviour for this particular Query impl. I'm happy to document the change under the bugs section in CHANGES.txt so we cover both bases. Yes, in my opinion it is, we have to think this through. To sandbox it goes!
        Hide
        Mark Harwood added a comment -

        Sorry, I think these problems should be fixed first!

        I acknowledge those are issues to be addressed for more general use and (luckily) do not affect my existing usage of this class. I expect the query-caching issue can be fixed relatively simply. The scoring system issue gets interesting because this fuzzy functionality relies on tweaking IDF in particular. To work across different scoring systems generically I expect IDF-tweakage would need to be made a pluggable aspect of all these scoring strategies e.g. through a common interface. Messy.

        The point of my previous comment was to register that this was one of the queries languishing outside of core that was genuinely in active use. I imagined this would be useful information if folks are in the process of cleaning out some of the "dead wood" (including relegation to a "useless-but-stored-for-historical-purposes" pile).

        Show
        Mark Harwood added a comment - Sorry, I think these problems should be fixed first! I acknowledge those are issues to be addressed for more general use and (luckily) do not affect my existing usage of this class. I expect the query-caching issue can be fixed relatively simply. The scoring system issue gets interesting because this fuzzy functionality relies on tweaking IDF in particular. To work across different scoring systems generically I expect IDF-tweakage would need to be made a pluggable aspect of all these scoring strategies e.g. through a common interface. Messy. The point of my previous comment was to register that this was one of the queries languishing outside of core that was genuinely in active use. I imagined this would be useful information if folks are in the process of cleaning out some of the "dead wood" (including relegation to a "useless-but-stored-for-historical-purposes" pile).
        Hide
        Robert Muir added a comment -

        To work across different scoring systems generically I expect IDF-tweakage would need to be made a pluggable aspect of all these scoring strategies e.g. through a common interface. Messy.

        I don't think think we need to do that?
        I added a comment to the source code:

            // TODO: generalize this query (at least it should not reuse this static sim!
            // a better way might be to convert this into multitermquery rewrite methods.
            // the rewrite method can 'average' the TermContext's term statistics (docfreq,totalTermFreq) 
            // provided to TermQuery, so that the general idea is agnostic to any scoring system...
        

        I don't think this is really that hard, nor messy?
        Then this Query just invokes rewrite() to a BooleanQuery of ordinary fuzzyqueries, setting its custom rewrite methods (it looks like we need to implement 2 here, depending upon configuration) on each.

        The rewrite methods would average docfreq and totaltermfreq (the only two "collection-wide" term statistics lucene supports), and set these in the TermContexts that they pass to TermQuery. Then the concept works for all scoring systems.

        As a side benefit, this would give some performance benefits anyway since by doing this, the term rewrite will become single pass instead of doing wasted seeks per-segment * per-term.

        Show
        Robert Muir added a comment - To work across different scoring systems generically I expect IDF-tweakage would need to be made a pluggable aspect of all these scoring strategies e.g. through a common interface. Messy. I don't think think we need to do that? I added a comment to the source code: // TODO: generalize this query (at least it should not reuse this static sim! // a better way might be to convert this into multitermquery rewrite methods. // the rewrite method can 'average' the TermContext's term statistics (docfreq,totalTermFreq) // provided to TermQuery, so that the general idea is agnostic to any scoring system... I don't think this is really that hard, nor messy? Then this Query just invokes rewrite() to a BooleanQuery of ordinary fuzzyqueries, setting its custom rewrite methods (it looks like we need to implement 2 here, depending upon configuration) on each. The rewrite methods would average docfreq and totaltermfreq (the only two "collection-wide" term statistics lucene supports), and set these in the TermContexts that they pass to TermQuery. Then the concept works for all scoring systems. As a side benefit, this would give some performance benefits anyway since by doing this, the term rewrite will become single pass instead of doing wasted seeks per-segment * per-term.
        Hide
        Mark Harwood added a comment -

        It's more nuanced than averaging IDF of variants (as discussed at length in LUCENE-329).
        To summarise: the original search term is the closest thing we have to the user's intent. If we average its IDF against all fuzzy variants it is most likely to dilute this value with a set of rare terms (most terms in the termEnum are e.g. typos) that happen to share some characters.
        When sitting this sort of expanded fuzzy query alongside other search terms in a BooleanQuery (e.g. robert~ OR muir) we end up making the "robert~" branch of the query look comparatively rare compared to the straight "muir" term thanks to the IDF dilution from a hundred rare "robert" variations. In my view the correct fix is to use the root term's IDF only (assuming "robert" actually exists in the index otherwise we must drop back to the average of variants).

        That's the trick employed by FuzzyLikeThis that stops my customers complaining about "bad fuzzy matches".

        Show
        Mark Harwood added a comment - It's more nuanced than averaging IDF of variants (as discussed at length in LUCENE-329 ). To summarise: the original search term is the closest thing we have to the user's intent. If we average its IDF against all fuzzy variants it is most likely to dilute this value with a set of rare terms (most terms in the termEnum are e.g. typos) that happen to share some characters. When sitting this sort of expanded fuzzy query alongside other search terms in a BooleanQuery (e.g. robert~ OR muir) we end up making the "robert~" branch of the query look comparatively rare compared to the straight "muir" term thanks to the IDF dilution from a hundred rare "robert" variations. In my view the correct fix is to use the root term's IDF only (assuming "robert" actually exists in the index otherwise we must drop back to the average of variants). That's the trick employed by FuzzyLikeThis that stops my customers complaining about "bad fuzzy matches".
        Hide
        Chris Male added a comment -

        New patch that moves FuzzyLikeThisQuery into the sandbox. DuplicateFilter also goes into the sandbox (incorrectly pushed to the queries module).

        Everything passes and is ready to go.

        I'll open an issue for the fixes to FuzzyLikeThisQuery.

        Show
        Chris Male added a comment - New patch that moves FuzzyLikeThisQuery into the sandbox. DuplicateFilter also goes into the sandbox (incorrectly pushed to the queries module). Everything passes and is ready to go. I'll open an issue for the fixes to FuzzyLikeThisQuery.
        Hide
        Chris Male added a comment -

        Command for patch:

        svn move --parents lucene/contrib/queries/src/java/org/apache/lucene/search/* lucene/contrib/sandbox/src/java/org/apache/lucene/sandbox/queries/
        svn move --parents lucene/contrib/queries/src/test/org/apache/lucene/search/* lucene/contrib/sandbox/src/test/org/apache/lucene/sandbox/queries/
        svn move modules/queries/src/java/org/apache/lucene/queries/DuplicateFilter.java lucene/contrib/sandbox/src/java/org/apache/lucene/sandbox/queries/DuplicateFilter.java
        svn move modules/queries/src/test/org/apache/lucene/queries/DuplicateFilterTest.java lucene/contrib/sandbox/src/test/org/apache/lucene/sandbox/queries/DuplicateFilterTest.java
        svn move --parents lucene/contrib/queries/lib/* lucene/contrib/sandbox/lib/
        svn delete dev-tools/idea/lucene/contrib/queries
        svn delete dev-tools/maven/lucene/contrib/queries
        svn delete lucene/contrib/queries
        
        Show
        Chris Male added a comment - Command for patch: svn move --parents lucene/contrib/queries/src/java/org/apache/lucene/search/* lucene/contrib/sandbox/src/java/org/apache/lucene/sandbox/queries/ svn move --parents lucene/contrib/queries/src/test/org/apache/lucene/search/* lucene/contrib/sandbox/src/test/org/apache/lucene/sandbox/queries/ svn move modules/queries/src/java/org/apache/lucene/queries/DuplicateFilter.java lucene/contrib/sandbox/src/java/org/apache/lucene/sandbox/queries/DuplicateFilter.java svn move modules/queries/src/test/org/apache/lucene/queries/DuplicateFilterTest.java lucene/contrib/sandbox/src/test/org/apache/lucene/sandbox/queries/DuplicateFilterTest.java svn move --parents lucene/contrib/queries/lib/* lucene/contrib/sandbox/lib/ svn delete dev-tools/idea/lucene/contrib/queries svn delete dev-tools/maven/lucene/contrib/queries svn delete lucene/contrib/queries
        Hide
        Robert Muir added a comment -

        It's more nuanced than averaging IDF of variants (as discussed at length in LUCENE-329).

        Mark, yeah but when the term exists providing term B term A's docfreq is trivial.

        the averaging is the part thats a pain in the ass with a TopTermsRewrite

        Show
        Robert Muir added a comment - It's more nuanced than averaging IDF of variants (as discussed at length in LUCENE-329 ). Mark, yeah but when the term exists providing term B term A's docfreq is trivial. the averaging is the part thats a pain in the ass with a TopTermsRewrite
        Hide
        Chris Male added a comment -

        Committed revision 1159846.

        Show
        Chris Male added a comment - Committed revision 1159846.

          People

          • Assignee:
            Chris Male
            Reporter:
            Chris Male
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development