[SOLR-1599] Improve IDF and relevance by separately indexing different entity types sharing a common schema - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: Schema and Analysis
Labels:
None

Description

In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index. This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists". The ranking for search on the name field of track entities will be (much?) more accurate if the IDF for the name field does not include counts from artist entities. The effect on ranking would be most pronounced for query terms that have a low document frequency for track entities but a high frequency for artist entities, or visa versa.

The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores. This would be more complicated with replication, and more so with sharding, to maintain a core for artists and a core for tracks on each node.

David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed ~~SOLR-1158~~, where he suggests calculating numDocs after the application of filters. He recognises however that the document frequency (DF_t) for each query term in a track search would also needs to exclude artist entities from the DF_t total to get the correct IDF_t=log(N/DF_t). DF_t must be calculated at index time, when Solr does not know what filters will be applied.

I suggest having a metadata field entitytype specified on submitting a batch of documents. The the schema would specify a list of allowed entity types and a default entity type. For example, document could say either entitytype="track" or entitytype="artist". Each each entity type has an independent set of document frequencies, so the term "foo" will have a DF for entitytype="artist" and a different DF for entitytype="track". This might be implemented by instantiating a separate Lucene index for each configured entity type. Filtering on entitytype="artist" would be implemented by searching only the artist index, analogous to searching only on the artist core in the multi-core workaround.

With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of configuring, replicating and sharding a Solr core for every entity type.

Attachments

Issue Links

blocks

SOLR-1158 Scoring, "numDocs" should be number after applying filters, not entire index

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Graham P

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 25/Nov/09 06:26

Updated:: 07/Aug/15 11:42

Resolved:: 07/Aug/15 11:42

Time Tracking

Estimated:

504h

Remaining:

504h

Logged:

Not Specified