Issue Details (XML | Word | Printable)

Key: NUTCH-92
Type: Bug Bug
Status: Open Open
Priority: Major Major
Assignee: Andrzej Bialecki
Reporter: Andrzej Bialecki
Votes: 5
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Nutch

DistributedSearch incorrectly scores results

Created: 16/Sep/05 04:27 AM   Updated: 03/Feb/09 03:30 PM
Return to search
Component/s: searcher
Affects Version/s: 1.1
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works distributed-idf-v2.patch 2006-11-27 07:23 PM Doğacan Güney 18 kB
Text File Licensed for inclusion in ASF works distributed-idf.patch 2006-11-20 04:58 PM Doğacan Güney 18 kB


 Description  « Hide
When running search servers in a distributed setup, using DistributedSearch$Server and Client, total scores are incorrectly calculated. The symptoms are that scores differ depending on how segments are deployed to Servers, i.e. if there is uneven distribution of terms in segment indexes (due to segment size or content differences) then scores will differ depending on how many and which segments are deployed on a particular Server. This may lead to prioritizing of non-relevant results over more relevant ones.

The underlying reason for this is that each IndexSearcher (which uses local index on each Server) calculates scores based on the local IDFs of query terms, and not the global IDFs from all indexes together. This means that scores arriving from different Servers to the Client cannot be meaningfully compared, unless all indexes have similar distribution of Terms and similar numbers of documents in them. However, currently the Client mixes all scores together, sorts them by absolute values and picks top hits. These absolute values will change if segments are un-evenly deployed to Servers.

Currently the workaround is to deploy the same number of documents in segments per Server, and to ensure that segments contain well-randomized content so that term frequencies for common terms are very similar.

The solution proposed here (as a result of discussion between ab and cutting, patches are coming) is to calculate global IDFs prior to running the query, and pre-boost query Terms with these global IDFs. This will require one more RPC call per each query (this can be optimized later, e.g. through caching). Then the scores will become normalized according to the global IDFs, and Client will be able to meaningfully compare them. Scores will also become independent of the segment content or local number of documents per Server. This will involve at least the following changes:

  • change NutchSimilarity.idf(Term, Searcher) to always return 1.0f. This enables us to manipulate scores independently of local IDFs.
  • add a new method to Searcher interface, int[] getDocFreqs(Term[]), which will return document frequencies for query terms.
  • modify getSegmentNames() so that it returns also the total number of documents in each segment, or implement this as a separate method (this will be called once during segment init)
  • in DistributedSearch$Client.search() first make a call to servers to return local IDFs for the current query, and calculate global IDFs for each relevant Term in that query.
  • multiply the TermQuery boosts by idf(totalDocFreq, totalIndexedDocs), and PhraseQuery boosts by the sum of the idf(totalDocFreqs, totalIndexedDocs) for all of its terms

This solution should be applicable with only minor changes to all branches, but initially the patches will be relative to trunk/ .

Comments, suggestions and review are welcome!



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Otis Gospodnetic added a comment - 16/Sep/05 05:19 AM
I recall a discussion on lucene-dev list several (6+?) months back about this or very similar issue. Lucene's MultiSearcher has the same problem. Chuck lead the discussion and had some proposed solutions, if I recall correctly, but I don't think they ever made it into Lucene core.

I'm saying this because maybe this can be fixed on a lower (Lucene) level and benefit both Lucene and Nutch.


Doug Cutting added a comment - 16/Sep/05 05:20 AM
A minor detail:

In Searcher, instead of

int[] getDocFreqs(Term[]);

The new method will probably have to be something like

public int[] getDocFreqs(TermSet);

And TermSet can implement Writable, as Nutch can't serialize Lucene Terms.


Doug Cutting added a comment - 16/Sep/05 05:23 AM
Otis, I think this was actually comitted to Lucene, but the solution isn't quite appropriate for Nutch, which does not use Lucene's RMI-based RemoteSearchable, but instead has its own, leaner, RPC mechanism.

Otis Gospodnetic added a comment - 16/Sep/05 05:28 AM
Ah, you are right, I remember this getting in the core. As a matter of fact, it might have been me who committed it in the end. Re RMI vs. Nutch's RPC - right.

byron miller added a comment - 29/Dec/05 11:44 AM
Has there been any advancement on this front?

Doğacan Güney added a comment - 20/Nov/06 04:58 PM
I ended up doing somethings very differently than ab initially suggested.

Here is a breakdown:

  • change NutchSimilarity.idf(Term, Searcher) to always return 1.0f.
  • Add a new method getNumDocs to get total indexed docs per index server. (this will be called once during segment init)
  • add a new method to Searcher interface, MapWritable getDocFreqs(Query), which will return document frequencies for query terms.
  • This is the "very different" part. A simple int[] getDocFreqs(Term[]) doesn't work here because DistributedSearch$Client doesn't know which terms will be generated from the query. And it is not enough to run the query filters on client because each server may run different query filters too. So we send to query as it is, the servers each run their query filters on it, then return a map of form <Term, Frequency> (I ended up implementing WritableStringPair too, I didn't know how else to represent Terms).
  • After each server returns their individual docFreqs, we combine them in the client.
  • in DistributedSearch$Client.search() first make a call to servers to return local IDFs for the current query, and calculate global IDFs for each relevant Term in that query.
  • multiply the TermQuery boosts by idf(totalDocFreq, totalIndexedDocs), and PhraseQuery boosts by the sum of the idf(totalDocFreqs, totalIndexedDocs) for all of its terms.

This patch ended up being rather larger than I initially concieved. Comments, suggestions, reviews, corrections are welcome.


Doğacan Güney added a comment - 27/Nov/06 07:22 PM
Here is my second attempt at this. Now DistributedSearch$Client keeps a mapping from addresses to numDocs, and in search(), computes total number of documents from live servers.

Andrzej Bialecki added a comment - 03/Feb/09 03:30 PM
Moving to 1.1 - needs further discussion, see also this thread: http://markmail.org/message/xyqdz3go6jwu4ozm