Details
-
Improvement
-
Status: Patch Available
-
Major
-
Resolution: Fixed
-
3.0
-
None
-
New
Description
Similarity can only be set per index, but I may want to adjust scoring behaviour at a field level, to faciliate this could we pass make field name available to all score methods.
Currently it is only passed to some such as lengthNorm() but not others such as tf()
Attachments
Attachments
- LUCENE-2236.patch
- 178 kB
- Robert Muir
- LUCENE-2236.patch
- 100 kB
- Robert Muir
- LUCENE-2236.patch
- 104 kB
- Doron Cohen
- LUCENE-2236.patch
- 104 kB
- Doron Cohen
- LUCENE-2236.patch
- 105 kB
- Robert Muir
Activity
Updated patch, with Doron's suggested wording for IndexSearcher.getDefaultSimilarityProvider
I also corrected a javadocs warning in Weight (bad reference to queryNorm).
I think we are ready to commit unless there are any objections.
Hi Robert,
I can update the javadocs to be a little bit more verbose for this method, as I would like to discourage its use... its essentially just like
using Similarity.getDefault() before, which I tried to cleanup across lucene
Yes I was confused by this, it is really an accessor method for the constant, and indeed with this change it is no more possible for users to modify the default behavior regarding similarity JVM wide - as they can still do for e.g. BQ.setMaxClauseCount().
I read it wrongly, as if it said "in general, this setting should not remain... " maybe it is just me.
I agree something as you propose like this would be probably clear on first reading:
/** * Expert: returns a default SimilarityProvider instance. * In general, this method is only called to initialize searchers and writers. * User code and query implementations should respect e.g. * {@link ..... IndexSearcher#getSimilarityProvider()}. * @lucene.internal */
or something similar (I am sure you'll write it better...)
Hi Doron, thanks for the cleanups here!
Would it be cleaner to have a new class DefaultSimilarityProvider which would return DefaultSimilarity? not sure, but thought it's worth mentioning.
I went back and forth on this one... it could go either way? But since the default does the same thing for every field, I thought
it made sense... what do you think? We could always revisit it too later, if we later prefer these to be separate.
javadoc in line 78:
"In general, this should not be used."
Is that so? I would like to believe that in general the default is pretty good.
In this case, I should have been more verbose... what I meant here is that 'in general, code should not call this static method, instead it should
respect the SimilarityProvider set on IndexSearcher [getSimilarityProvider()]...'. This is just the default exposed for internal purposes: just to
have a single immutable default instance for MemoryIndex, InstantiatedIndex, IndexWriterConfig, IndexSearcher... but if someone writes
a custom Query for example, they should be using IndexSearcher.getSimilarityProvider(), and not this static.
I can update the javadocs to be a little bit more verbose for this method, as I would like to discourage its use... its essentially just like
using Similarity.getDefault() before, which I tried to cleanup across lucene.
Attaching fixed patch - these two tests fixed to pass again (I shouldn't have taken nextNorm() out of the loop...) - all Lucene (core) tests pass, Solr ones failed to compile as (from command line) JAVA_HOME pointed to Java 5, - will complete the run later.
note: with last patch both tests fail - TestNorms and TestIndexReaderCloneNorms.
Robert, thanks for renaming back to Similarity.
I reviewed the patch - it looks good to me, just technical simple comments here (it may seem a long list but these are really just simple changes)
- Similarity:
- line 365: should be
- {@link SimilarityProvider#coord(int, int) coord(q,d)}
- line 565: should be:
- @see org.apache.lucene.index.IndexWriterConfig#setSimilarityProvider(SimilarityProvider)
- @see IndexSearcher#setSimilarityProvider(SimilarityProvider)
- line 365: should be
- InstantiatedIndexWriter & MemoryIndex & DocumentWriter & DocState & BooleanQuery
- methods and member should be named similar to IndexSearcher:
- similarityProvider
- getSimilarityProvider()
- setSimilarityProvider(SimilarityProvider similarityProvider)
- methods and member should be named similar to IndexSearcher:
- MIGRATE.txt - I think it should read like this:
- DefaultSimilarity can no longer be set statically (and dangerously) for the entire JVM.
Instead, IndexWriterConfig and IndexSearcher now take a SimilarityProvider.
Similarity can now be configured on a per-field basis.
Similarity retains only the field-specific relevance methods such as tf() and idf().
Methods that apply to the entire query such as coord() and queryNorm() exist in SimilarityProvider.
- DefaultSimilarity can no longer be set statically (and dangerously) for the entire JVM.
- IndexWriterConfig javadocs lines 240-241 should discuss SimilarityProvider (not Similarity)
- Would it be cleaner to have a new class DefaultSimilarityProvider which would return DefaultSimilarity? not sure, but thought it's worth mentioning.
- IndexSearcher
- javadoc in line 78:
"In general, this should not be used."
Is that so? I would like to believe that in general the default is pretty good.
- javadoc in line 78:
- TestIndexReaderCloneNorms the "TODO" added there can be fixed (not too efficiently, but okay for the test I think) like this:
private Document newDoc() { Document d = new Document(); for (int i = 0; i < 10; i++) { String fname = "f" + i; Field f = newField(fname, "v" + i, Store.NO, Index.NOT_ANALYZED); float boost = nextNorm(fname); f.setBoost(boost); d.add(f); } return d; } // return unique norm values that are unchanged by encoding/decoding private float nextNorm(String fname) { float norm = lastNorm + normDelta; do { Similarity sim = new DefaultSimilarity().get(fname);
- TestNorms - can fix the "bogus" TODO similar to the above code snippet.
Here's an updated patch, now that some cleanup work has been committed separately, its easier to manage.
I kept the original name Similarity: I definitely like this better, and tried to clean up the patch to make it committable.
If this patch is ok, I would prefer to make a follow-up issue to make it easier to use per-field Similarity with a Solr schema file.
Currently Solr lets you specify the index-wide Similarity; with this patch, this can be customized per-field, but you have to
manage this in your java implementation.
In my opinion, it would be better to have more declarative schema integration, especially if we get scoring going with more flexibility.
So let's keep that name (Similarity)
OK, I'll fix the patch, to rename FieldSimilarity->Similarity
So Similarity is not only per field but also per query/scorer..
and Query would have an abstract method getSimilarityProvider(fieldName) which would be implemented by each concrete query, neatly separating finding matches from scores computation, and allowing more extendable scoring. Nice.
Also, perhaps what seems to be like an inflation of Similarity objects (per query per field) is one more good reason to keep the field name params for now.
Well I'm not totally sure how we want to do it, but definitely I think we want to split Scorer's calculations and finding matches as you say,
and also split Weight's calculations and "resource management"
For example, TermWeight today has a PerReaderTermState, which contains all the information you need to calculate the "setup" portion
without doing any real I/O (e.g. docFreq, totalTermFreq, totalCollectionFreq, ...) So maybe this is the right thing to pass to Similarity's "query setup".
The Weight then would just be responsible for managing termstate and creating a Scorer...
I think also the Similarity needs to be fully responsible for Explanations... but most users wouldn't have to interact with this I think.
Instead I think typically their "base class" (TFIDFSimilarity or whatever it is) would typically provide this, based on the methods and API
it exposes: tf(), idf(), but this would allow us to also have other fully-fleshed out base classes like BM25Similarity, that you can extend
and tune based on the parameters that make sense to it.
Anyway these are just some thoughts, first I'm going to adjust the patch to keep our existing name "Similarity".
well my concern about the deprecated methods is we get into the hairy backwards compat situation...
It might be ok to essentially fix Similarity to be the way we want for 4.0 (break it) since its an expert API anyway.
This patch was just a quick stab...
I definitely agree with you about the name though, i prefer Similarity.
So let's keep that name (Similarity)
Well honestly I think what you are saying is really needed for the future
Ok one step at a time makes sense.. so it means that fieldName parameters remain, although the Similarity object is created per given field, well, ok, another day...
Similarity would need to be able to 'setup' a query (e.g. things like IDF, building score caches for the query, whatever), and then also score an individual document.
Interesting...
(flexible-scoring and bulk-postings works are still unknowns to me.)
So Similarity is not only per field but also per query/scorer..
and Query would have an abstract method getSimilarityProvider(fieldName) which would be implemented by each concrete query, neatly separating finding matches from scores computation, and allowing more extendable scoring. Nice.
Also, perhaps what seems to be like an inflation of Similarity objects (per query per field) is one more good reason to keep the field name params for now.
Is that too bad?
well my concern about the deprecated methods is we get into the hairy backwards compat situation...
we already had issues with this with Similarity.
It might be ok to essentially fix Similarity to be the way we want for 4.0 (break it) since its an expert API anyway.
This patch was just a quick stab...
I definitely agree with you about the name though, i prefer Similarity.
should Sim be aware of for which field it was created, so that no need to pass it as parameter in its methods in case this is ever important?
Well honestly I think what you are saying is really needed for the future... but I would prefer to actually delay that until a future patch
Making an optimized TermScorer is becoming more and more complicated, see the one in the bulkpostings branch for example. Because of this,
its extremely tricky to customize the scoring with good performance. I think the score caching etc in term scorer needs to be moved out of TermScorer,
instead the responsibility of calculating the score should reside in Similarity, including any caching it needs to do (which is really impl dependent).
Basically Similarity needs to be responsible for score(), but let TermScorer etc deal with enumerating postings etc.
For example, we now have the stats totalTermFreq/totalCollectionFreq by field for a term, but you can't e.g. take these and make a
Language-modelling based scorer, which you should be able to do right now, except for limitations in our APIs.
So in a future issue I would like to propose a patch to do just this, so that TermScorer, for example is more general. Similarity would need to be able
to 'setup' a query (e.g. things like IDF, building score caches for the query, whatever), and then also score an individual document.
In the flexible scoring prototype this is what we did, but we went even further, where a Similarity is also responsible for 'setting up' a searcher, too.
So that means, its responsible for managing norm byte[] (in that patch, you only had a byte[] norms, if you made it in your Similarity yourself).
I think long term that approach is definitely really interesting, but I think we can go ahead and make scoring a lot more flexible in tiny steps
like this without rewriting all of lucene in one enormous patch... and this is safer as we can benchmark performance each step of the way.
I like this flexibility - allowing different scoring parameters for different fields.
Wondering though - would it be nicer to keep original name Similarity and just change the way it is created and used to be field specific...?
For this, should deprecate Similarity methods taking a field/name param and introduce new ones without it - hmm.. the class would then have 3 deprecated methods:
- scorePayload()
- lengthNorm() (which anyway is already deprecated)
- computeNorm()
Is that too bad?
For FieldSimilarity (and for the new method in Similarity if deciding to keep that name after all) can we get rid of the first parameter fieldName from computeNorm() and scorePayload(), or are they used in a way I am missing even when one knows that the similarity is for certain (or all) field?
In the new class FS (if kept) I think the deprecated lengthNorm() can be removed?
Also, since Similarity (or FS) is becoming field aware, perhaps add a method getFieldName() - it can return null or ANY in case there is no dedicated similarity for that field... Hmm... If the logic employed is this "For fields F1, F2 return Sim1, for F3, F4 return Sim2, for all other fields return Sim3, then it is less simple, still, the similarity provider called with a field name can pass it to the Similarity object created and that in turn would be able to return it... just thinking out loud here... should Sim be aware of for which field it was created, so that no need to pass it as parameter in its methods in case this is ever important?
I wonder why SimilarityProvider is an interface while it could be an abstract class
This is an expert API and we already agreed we can "use interfaces without fear".
Nice work robert, while I only looked briefly at this patch but I wonder why SimilarityProvider is an interface while it could be an abstract class. Adding methods later to an abstract class is much easier than to an interface though. I hope I can have a closer look at this patch soon!
also, i removed Similarity.get/setDefault in this patch.
Instead we have a private @lucene.internal method (in IndexSearcher, for lack of a better place).
This is only needed for things like Instantiated, MemoryIndex, etc.
In general things shouldn't rely upon some global static if we want to make scoring more flexible.
attached is a patch to enable per-field similarity.
it includes the prerequisite patch to LUCENE-2869, which i separately plan to commit soon (then update my local).
the patch creates two new classes:
- SimilarityProvider: contains the field-independent methods like coord(). also has a get(String field) that returns a FieldSimilarity
- FieldSimilarity: all the field-specific stuff like tf, etc.
the old Similarity is deprecated, it is a FieldSimilarity and also a SimilarityProvider, that always returns itself.
Maybe it could be removed since i tried to cutover all code, I didnt check.
Removing it would allow us to remove the redundant field arguments from the FieldSimilarity methods easily.
Additionally Scorer's "Similarity" is removed. This is useless and most scorers ignored it or populated it with garbage.
Scorers themselves, know if they need a sim or not (e.g. BooleanScorer uses only the top-level SimilarityProvider for coord).
all lucene/contrib/solr tests pass, i also added a simple unit test.
This issue was moved to GitHub issue: #3312.