Solr
  1. Solr
  2. SOLR-3145

Velocity "/browse" config should set mm=100% to behave as in 3.x

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: web gui
    • Labels:
      None

      Description

      After SOLR-1889 was committed, the default for DisMax "mm" parameter changes depending on q.op. Since defaultOperator=OR in example schema.xml, and no "mm" parameter is specified in the "/browse" request handler, DisMax will fallback to mm=0%. To be consistent with 3.x behavior, we should add mm=100% for "/browse" config.

      1. SOLR-3145.patch
        0.5 kB
        Jan Høydahl

        Activity

        Hide
        Jan Høydahl added a comment -

        Clarified title and description.

        Show
        Jan Høydahl added a comment - Clarified title and description.
        Hide
        Yonik Seeley added a comment -

        explicit sorting is a no-go for most use cases

        Heh. Seems we live in very different worlds. Perhaps Lucene is only about full-text search (at least it was in the past), but Solr has always been about much more than that. Sorting by other things than "full-text relevance" is extremely common.

        Show
        Yonik Seeley added a comment - explicit sorting is a no-go for most use cases Heh. Seems we live in very different worlds. Perhaps Lucene is only about full-text search (at least it was in the past), but Solr has always been about much more than that. Sorting by other things than "full-text relevance" is extremely common.
        Hide
        Uwe Schindler added a comment -

        I agree with Jan about "and" is good for some use cases, but only for the case that the user wants to override scoring and just sort e.g. by date, which is bogus for full text search engines alltogether. The first thing a full-text search consultat should do to the company representatives is explaining that explicit sorting is a no-go for most use cases. If the user wants to influence scoring, he can do that e.g. by adding per-document boost factors as DocValues field or by multiplying other score factors like geo distance, but never ever simply sort by distance in geo search (simple example: a "cocktail bar" in 2 miles distance might be a better result than a bar called "cocktail stripper" in 100 yards for users that entered "cocktail bar" into their search engine - just as example).

        Show
        Uwe Schindler added a comment - I agree with Jan about "and" is good for some use cases, but only for the case that the user wants to override scoring and just sort e.g. by date, which is bogus for full text search engines alltogether. The first thing a full-text search consultat should do to the company representatives is explaining that explicit sorting is a no-go for most use cases. If the user wants to influence scoring, he can do that e.g. by adding per-document boost factors as DocValues field or by multiplying other score factors like geo distance, but never ever simply sort by distance in geo search (simple example: a "cocktail bar" in 2 miles distance might be a better result than a bar called "cocktail stripper" in 100 yards for users that entered "cocktail bar" into their search engine - just as example).
        Hide
        Uwe Schindler added a comment -

        Rather strong/blanket statement. It seems roughly true that adding non-trivial words to a google search lowers the number of matches.

        This "seems" to be true in lot's of cases. But if you search Google for "google number of results" you will see pages from all over the internet discussing this topic. Even Google states in its FAQ that the number of results is just a guess and depends on various factors that appear quite random, there is no relationship in counts regarding adding/removing terms. Even the same search returns largely different counts when you change pages (e.g. going from page 1->2 completely changes the count). The reason for this is of course query preprocessing, different search clusters and user-specific preferences. To get a more solr-like result, use "wortwörtlich" (German) / "verbatim" (English) on the left sidebar.

        Lot of people simply say: the google count is just arbitrary and useless for any metrics.

        Show
        Uwe Schindler added a comment - Rather strong/blanket statement. It seems roughly true that adding non-trivial words to a google search lowers the number of matches. This "seems" to be true in lot's of cases. But if you search Google for "google number of results" you will see pages from all over the internet discussing this topic. Even Google states in its FAQ that the number of results is just a guess and depends on various factors that appear quite random, there is no relationship in counts regarding adding/removing terms. Even the same search returns largely different counts when you change pages (e.g. going from page 1->2 completely changes the count). The reason for this is of course query preprocessing, different search clusters and user-specific preferences. To get a more solr-like result, use "wortwörtlich" (German) / "verbatim" (English) on the left sidebar. Lot of people simply say: the google count is just arbitrary and useless for any metrics.
        Hide
        Yonik Seeley added a comment -

        > I believe this is how google does it?

        This is false.

        Rather strong/blanket statement. It seems roughly true that adding non-trivial words to a google search lowers the number of matches.

        I guess we'll continue to disagree with a "lowest common denominator" approach to languages.
        It's too bad that our example has no stopwords or stemming any more because of this philosophy.

        Show
        Yonik Seeley added a comment - > I believe this is how google does it? This is false. Rather strong/blanket statement. It seems roughly true that adding non-trivial words to a google search lowers the number of matches. I guess we'll continue to disagree with a "lowest common denominator" approach to languages. It's too bad that our example has no stopwords or stemming any more because of this philosophy.
        Hide
        Robert Muir added a comment -

        But Jan is talking about just changing the default for just an example GUI (/browse), and not any query parsers.

        I think its pretty important. The problem is that in some languages, someone enters a search query with some useless particle
        or something and misses documents completely only because of grammatical structure.

        Also for a lot of languages (e.g. chinese), tokenization into 'query terms' is not even close to completely accurate!

        That's pretty minor - not a big deal either way, but I do think that from a "finished product" perspective, more people expect all of their query terms to appear in matching documents (and I believe this is how google does it?

        This is false. Search for 'lucid in imagination' and look for the first result, it does not contain the word 'in'.
        This is just an illustration of my point (its hard to come up with examples for english), but other examples
        would be simple things like searching for U.S.A-China relations and missing documents that have U.S.-China relations.

        In general most of the stopwords lists we have are very incomplete and minimal: I think this is good. But if you choose
        to use AND as a default, you need to be much more aggressive about these things.

        Also i'm completely failing to mention use cases that do more natural language searches (e.g. longer queries) would really
        suffer more here.

        Again I think: don't wire the queryparser to force 100% query-term-importance, lean on the ranking system to do this.
        As i mentioned, its my opinion there are serious problems with lucene's sqrt() tf normalization (it grows too fast and does
        not represent the information gain of additional term occurrences well), causing additional occurences of only a few terms
        to blow up the score versus documents that actually do contain all terms: but we shouldn't solve that with a hammer like this.

        So from a 'finished product' I think it should work reasonably well for as many languages and use cases as possible out of box:
        it should be generic. This kind of tuning thats specific to only certain use cases/languages/configurations is well documented
        (its easy to change the default operator) and not tricky to do.

        Show
        Robert Muir added a comment - But Jan is talking about just changing the default for just an example GUI (/browse), and not any query parsers. I think its pretty important. The problem is that in some languages, someone enters a search query with some useless particle or something and misses documents completely only because of grammatical structure. Also for a lot of languages (e.g. chinese), tokenization into 'query terms' is not even close to completely accurate! That's pretty minor - not a big deal either way, but I do think that from a "finished product" perspective, more people expect all of their query terms to appear in matching documents (and I believe this is how google does it? This is false. Search for 'lucid in imagination' and look for the first result, it does not contain the word 'in'. This is just an illustration of my point (its hard to come up with examples for english), but other examples would be simple things like searching for U.S.A-China relations and missing documents that have U.S.-China relations. In general most of the stopwords lists we have are very incomplete and minimal: I think this is good. But if you choose to use AND as a default, you need to be much more aggressive about these things. Also i'm completely failing to mention use cases that do more natural language searches (e.g. longer queries) would really suffer more here. Again I think: don't wire the queryparser to force 100% query-term-importance, lean on the ranking system to do this. As i mentioned, its my opinion there are serious problems with lucene's sqrt() tf normalization (it grows too fast and does not represent the information gain of additional term occurrences well), causing additional occurences of only a few terms to blow up the score versus documents that actually do contain all terms: but we shouldn't solve that with a hammer like this. So from a 'finished product' I think it should work reasonably well for as many languages and use cases as possible out of box: it should be generic. This kind of tuning thats specific to only certain use cases/languages/configurations is well documented (its easy to change the default operator) and not tricky to do.
        Hide
        Yonik Seeley added a comment -

        SOLR-1889 was the correct change [...] Changing the queryparser default to AND is very bad

        +1 (but probably for different reasons than you

        But Jan is talking about just changing the default for just an example GUI (/browse), and not any query parsers. That's pretty minor - not a big deal either way, but I do think that from a "finished product" perspective, more people expect all of their query terms to appear in matching documents (and I believe this is how google does it?)

        Show
        Yonik Seeley added a comment - SOLR-1889 was the correct change [...] Changing the queryparser default to AND is very bad +1 (but probably for different reasons than you But Jan is talking about just changing the default for just an example GUI (/browse), and not any query parsers. That's pretty minor - not a big deal either way, but I do think that from a "finished product" perspective, more people expect all of their query terms to appear in matching documents (and I believe this is how google does it?)
        Hide
        Robert Muir added a comment -

        Because I think SOLR-1889 was the correct change: the default Lucene queryparser is OR, and there are many good
        reasons for this.

        Changing the queryparser default to AND is very bad for isolating languages. I strongly disagree with doing this.

        Show
        Robert Muir added a comment - Because I think SOLR-1889 was the correct change: the default Lucene queryparser is OR, and there are many good reasons for this. Changing the queryparser default to AND is very bad for isolating languages. I strongly disagree with doing this.
        Hide
        Jan Høydahl added a comment -

        It is no surprise that you get better recall with OR - and thus find certain documents related to one of the terms which do not contain all terms. That's ABC and you don't need to prove that. But that is not the same as assuming that most Solr users prefer OR over AND. People seem to have been happy with "/browse" being AND for the past years, so why change now?

        Show
        Jan Høydahl added a comment - It is no surprise that you get better recall with OR - and thus find certain documents related to one of the terms which do not contain all terms. That's ABC and you don't need to prove that. But that is not the same as assuming that most Solr users prefer OR over AND. People seem to have been happy with "/browse" being AND for the past years, so why change now?
        Hide
        Robert Muir added a comment -

        I guess I'm not very trendy.

        I can run tests comparing AND and OR for you on standard test collections if you want, I already know the answers.
        For defaults, we should take the conservative approach. Trendy people can change the defaults.

        Show
        Robert Muir added a comment - I guess I'm not very trendy. I can run tests comparing AND and OR for you on standard test collections if you want, I already know the answers. For defaults, we should take the conservative approach. Trendy people can change the defaults.
        Hide
        Jan Høydahl added a comment -

        Another thing - ALL applications that want to do sorting should care about the precision of their search.

        Thats not searching, thats matching. I think we should default to good behavior for search.

        Come again? Are you saying people don't build search driven applications these days? If so, you're just missing out on a big trend in the market... Our customers tend to request a seamless mix of advanced full-text search, navigation and metadata filtering/sorting. Forcing people into either strict metadata matching OR free-text search is artificial.

        Anyway, this is a side track. This issue is about NOT changing the "/browse" behaviour from 3.x to 4.x

        Show
        Jan Høydahl added a comment - Another thing - ALL applications that want to do sorting should care about the precision of their search. Thats not searching, thats matching. I think we should default to good behavior for search. Come again? Are you saying people don't build search driven applications these days? If so, you're just missing out on a big trend in the market... Our customers tend to request a seamless mix of advanced full-text search, navigation and metadata filtering/sorting. Forcing people into either strict metadata matching OR free-text search is artificial. Anyway, this is a side track. This issue is about NOT changing the "/browse" behaviour from 3.x to 4.x
        Hide
        Robert Muir added a comment -

        Feel free to open new JIRAs for the other shortcomings you mentioned, like better Similarity defaults - I'm a big fan of that as well!

        The problem is 4.x must still be able to read 3.x indexes and return good results, but 3.x indexes don't have the statistics we need
        to e.g. default to BM25 or something else. So I was hoping to bring this up for 5.0, it seems for 4.0 we should take the conservative
        approach and keep what we have: so that any migrating users don't have bad performance (yes all those Sims will work in degraded mode
        for preflex indexes but i don't like that).

        Show
        Robert Muir added a comment - Feel free to open new JIRAs for the other shortcomings you mentioned, like better Similarity defaults - I'm a big fan of that as well! The problem is 4.x must still be able to read 3.x indexes and return good results, but 3.x indexes don't have the statistics we need to e.g. default to BM25 or something else. So I was hoping to bring this up for 5.0, it seems for 4.0 we should take the conservative approach and keep what we have: so that any migrating users don't have bad performance (yes all those Sims will work in degraded mode for preflex indexes but i don't like that).
        Hide
        Robert Muir added a comment -

        Another thing - ALL applications that want to do sorting should care about the precision of their search.

        Thats not searching, thats matching. I think we should default to good behavior for search.

        Show
        Robert Muir added a comment - Another thing - ALL applications that want to do sorting should care about the precision of their search. Thats not searching, thats matching. I think we should default to good behavior for search.
        Hide
        Jan Høydahl added a comment -

        Another thing - ALL applications that want to do sorting should care about the precision of their search. If there are 100 relevant docs for a given query, say q=sports car, and your result set returns 1000 docs since you use q.op=OR, then you may very well get the best sports cars on top, but try sorting by date, price, popularity or anything other than "score" and your results are crap because you only paid attention to search recall, not to precision. It's like a scale - gain one and you lose the other.

        Show
        Jan Høydahl added a comment - Another thing - ALL applications that want to do sorting should care about the precision of their search. If there are 100 relevant docs for a given query, say q=sports car , and your result set returns 1000 docs since you use q.op=OR, then you may very well get the best sports cars on top, but try sorting by date, price, popularity or anything other than "score" and your results are crap because you only paid attention to search recall, not to precision. It's like a scale - gain one and you lose the other.
        Hide
        Jan Høydahl added a comment -

        Robert, we don't disagree on the fact that search is more difficult than a simple OR or AND. People need to invest in designing a good search experience, taking these factors as well as many other into consideration. There is no silver bullet to recall or relevancy, nor is an advice to use "OR". I have been involved in more than 100 enterprise search installations world wide and in perhaps 2 or three of them we chose "OR" as default. Most often it's a matter of "AND" as default plus a lot of careful design in order to increase recall without sacrificing too much precision. Another key point is that people expect AND-ish behavior from the large public search engines, and are puzzled if they keep getting more results the more words they enter in the search box.

        Feel free to open new JIRAs for the other shortcomings you mentioned, like better Similarity defaults - I'm a big fan of that as well!

        Show
        Jan Høydahl added a comment - Robert, we don't disagree on the fact that search is more difficult than a simple OR or AND. People need to invest in designing a good search experience, taking these factors as well as many other into consideration. There is no silver bullet to recall or relevancy, nor is an advice to use "OR". I have been involved in more than 100 enterprise search installations world wide and in perhaps 2 or three of them we chose "OR" as default. Most often it's a matter of "AND" as default plus a lot of careful design in order to increase recall without sacrificing too much precision. Another key point is that people expect AND-ish behavior from the large public search engines, and are puzzled if they keep getting more results the more words they enter in the search box. Feel free to open new JIRAs for the other shortcomings you mentioned, like better Similarity defaults - I'm a big fan of that as well!
        Hide
        Robert Muir added a comment -

        Robert, I don't get your comment - what does this have to do with stopwords or Similarity? It sounds more like a general opinion that you like OR better than AND, the more hits the better...

        Its not a general opinion. I dont care how many 'totalHits' are returned. I care about the relevance of the top N.

        And when good results are discarded simply because the query contained a useless word like 'his', thats bad news.

        People are too quick to jump to AND without debugging the real problem. The problem is that they see results that don't contain all of their query terms
        ranked above results that do: this is a direct result of lucene's sqrt() tf normalization function (which it tries to make up for with coord): as opposed
        to other alternatives that are less aggressive and are known to perform better.

        By forcing everything to AND, it then means the ranking system extremely fragile in cases like stopwords, but this is applying a hammer,
        its not the right default.

        Show
        Robert Muir added a comment - Robert, I don't get your comment - what does this have to do with stopwords or Similarity? It sounds more like a general opinion that you like OR better than AND, the more hits the better... Its not a general opinion. I dont care how many 'totalHits' are returned. I care about the relevance of the top N. And when good results are discarded simply because the query contained a useless word like 'his', thats bad news. People are too quick to jump to AND without debugging the real problem. The problem is that they see results that don't contain all of their query terms ranked above results that do: this is a direct result of lucene's sqrt() tf normalization function (which it tries to make up for with coord): as opposed to other alternatives that are less aggressive and are known to perform better. By forcing everything to AND, it then means the ranking system extremely fragile in cases like stopwords, but this is applying a hammer, its not the right default.
        Hide
        Jan Høydahl added a comment -

        Robert, I don't get your comment - what does this have to do with stopwords or Similarity? It sounds more like a general opinion that you like OR better than AND, the more hits the better...

        What this is about is letting the example "/browse" GUI stick to its previous mm=100% behavior so 3.x "/browse" users will have a consistent experience. If people want "OR" they can change it. Personally I'd prefer changing defaultOperator in example schema to "AND", but I'm fine with OR there if /browse gets fixed.

        Show
        Jan Høydahl added a comment - Robert, I don't get your comment - what does this have to do with stopwords or Similarity? It sounds more like a general opinion that you like OR better than AND, the more hits the better... What this is about is letting the example "/browse" GUI stick to its previous mm=100% behavior so 3.x "/browse" users will have a consistent experience. If people want "OR" they can change it. Personally I'd prefer changing defaultOperator in example schema to "AND", but I'm fine with OR there if /browse gets fixed.
        Hide
        Robert Muir added a comment -

        I think defaulting to AND is very dangerous: especially with more minimal stopword lists
        like Lucene's. Then shorter documents that happen to be missing some useless pronoun
        don't show up in results at all.

        Any problems that this would "Fix" are really problems with Lucene's Similarity: the term
        frequency normalization function grows too fast, etc.

        Why not fix the real problem instead: either default to a Similarity with a stronger coord()
        implementation, or a stronger ranking algorithm all together.

        Show
        Robert Muir added a comment - I think defaulting to AND is very dangerous: especially with more minimal stopword lists like Lucene's. Then shorter documents that happen to be missing some useless pronoun don't show up in results at all. Any problems that this would "Fix" are really problems with Lucene's Similarity: the term frequency normalization function grows too fast, etc. Why not fix the real problem instead: either default to a Similarity with a stronger coord() implementation, or a stronger ranking algorithm all together.
        Hide
        Jan Høydahl added a comment -

        Other optinions? If not, I'll prepare a patch changing "/browse" to default to mm=100%

        Show
        Jan Høydahl added a comment - Other optinions? If not, I'll prepare a patch changing "/browse" to default to mm=100%
        Hide
        Yonik Seeley added a comment -

        B) Add a mm=100% to the requestHandler config of "browse"

        +1

        Show
        Yonik Seeley added a comment - B) Add a mm=100% to the requestHandler config of "browse" +1
        Hide
        Jan Høydahl added a comment -

        Two options

        A) Change schema.xml defaultOperator from OR to AND. That's what the majority of people want anyway isn't it?

        B) Add a mm=100% to the requestHandler config of "browse"

        What do you prefer?

        Show
        Jan Høydahl added a comment - Two options A) Change schema.xml defaultOperator from OR to AND. That's what the majority of people want anyway isn't it? B) Add a mm=100% to the requestHandler config of "browse" What do you prefer?

          People

          • Assignee:
            Jan Høydahl
            Reporter:
            Jan Høydahl
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development