But Jan is talking about just changing the default for just an example GUI (/browse), and not any query parsers.
I think its pretty important. The problem is that in some languages, someone enters a search query with some useless particle
or something and misses documents completely only because of grammatical structure.
Also for a lot of languages (e.g. chinese), tokenization into 'query terms' is not even close to completely accurate!
That's pretty minor - not a big deal either way, but I do think that from a "finished product" perspective, more people expect all of their query terms to appear in matching documents (and I believe this is how google does it?
This is false. Search for 'lucid in imagination' and look for the first result, it does not contain the word 'in'.
This is just an illustration of my point (its hard to come up with examples for english), but other examples
would be simple things like searching for U.S.A-China relations and missing documents that have U.S.-China relations.
In general most of the stopwords lists we have are very incomplete and minimal: I think this is good. But if you choose
to use AND as a default, you need to be much more aggressive about these things.
Also i'm completely failing to mention use cases that do more natural language searches (e.g. longer queries) would really
suffer more here.
Again I think: don't wire the queryparser to force 100% query-term-importance, lean on the ranking system to do this.
As i mentioned, its my opinion there are serious problems with lucene's sqrt() tf normalization (it grows too fast and does
not represent the information gain of additional term occurrences well), causing additional occurences of only a few terms
to blow up the score versus documents that actually do contain all terms: but we shouldn't solve that with a hammer like this.
So from a 'finished product' I think it should work reasonably well for as many languages and use cases as possible out of box:
it should be generic. This kind of tuning thats specific to only certain use cases/languages/configurations is well documented
(its easy to change the default operator) and not tricky to do.