Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-1599

Improve IDF and relevance by separately indexing different entity types sharing a common schema

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • Schema and Analysis
    • None

    Description

      In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the documents in an index. This introduces relevance problems when using a single schema to store multiple entity types, for example to support "search for tracks" and "search for artists". The ranking for search on the name field of track entities will be (much?) more accurate if the IDF for the name field does not include counts from artist entities. The effect on ranking would be most pronounced for query terms that have a low document frequency for track entities but a high frequency for artist entities, or visa versa.

      The current work-around to make the IDF be entity-specific is to use a separate Solr core for each entity type sharing the schema - and repeating the process of copying solrconfig.xml and schema.xml to all the cores. This would be more complicated with replication, and more so with sharding, to maintain a core for artists and a core for tracks on each node.

      David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed SOLR-1158, where he suggests calculating numDocs after the application of filters. He recognises however that the document frequency (DF_t) for each query term in a track search would also needs to exclude artist entities from the DF_t total to get the correct IDF_t=log(N/DF_t). DF_t must be calculated at index time, when Solr does not know what filters will be applied.

      I suggest having a metadata field entitytype specified on submitting a batch of documents. The the schema would specify a list of allowed entity types and a default entity type. For example, document could say either entitytype="track" or entitytype="artist". Each each entity type has an independent set of document frequencies, so the term "foo" will have a DF for entitytype="artist" and a different DF for entitytype="track". This might be implemented by instantiating a separate Lucene index for each configured entity type. Filtering on entitytype="artist" would be implemented by searching only the artist index, analogous to searching only on the artist core in the multi-core workaround.

      With this solution (entity type metadata field implemented with separate Lucene indeces) a single Solr core can support many different entity types that share a common schema but use partially overlapping subsets of fields, instead of configuring, replicating and sharding a Solr core for every entity type.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              grahamp Graham P
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 504h
                  504h
                  Remaining:
                  Remaining Estimate - 504h
                  504h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified