Lucene - Core
  1. Lucene - Core
  2. LUCENE-1360

A Similarity class which has unique length norms for numTerms <= 10

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Trivial Trivial
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/query/scoring
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      A Similarity class which extends DefaultSimilarity and simply overrides lengthNorm. lengthNorm is implemented as a lookup for numTerms <= 10, else as 1/sqrt(numTerms). This is to avoid term counts below 11 from having the same lengthNorm after stored as a single byte in the index.

      This is useful if your search is only on short fields such as titles or product descriptions.

      See mailing list discussion: http://www.nabble.com/How-to-boost-the-score-higher-in-case-user-query-matches-entire-field-value-than-just-some-words-within-a-field-td19079221.html

      1. ShortFieldNormSimilarity.java
        2 kB
        Sean Timm
      2. LUCENE-1380 visualization.pdf
        9 kB
        Lance Norskog
      3. LUCENE-1360.patch
        4 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        The only issue i have with the floatToByte52 is its a 'trap' so to speak,
        that if you use it on a too-long field (or maybe too-small boost), you end
        out with a norm of zero.

        In my opinion, the whole purpose of per-field support is so that you don't
        have to make these sort of tradeoffs, but i imagine someone could
        use an inappropriate similarity/schema sometime (misconfiguration)

        to degrade better in this case, I suggest this change, which decodes 0-byte norms
        as if they were 1-byte, so that scores won't be zeroed in the misconfiguration case...

        change:

          static {
            for (int i = 0; i < 256; i++)
              NORM_TABLE[i] = SmallFloat.byte52ToFloat((byte)i);
          }
        

        to:

          static {
            NORM_TABLE[0] = SmallFloat.byte52ToFloat((byte)1);
            for (int i = 1; i < 256; i++)
              NORM_TABLE[i] = SmallFloat.byte52ToFloat((byte)i);
          }
        
        Show
        Robert Muir added a comment - The only issue i have with the floatToByte52 is its a 'trap' so to speak, that if you use it on a too-long field (or maybe too-small boost), you end out with a norm of zero. In my opinion, the whole purpose of per-field support is so that you don't have to make these sort of tradeoffs, but i imagine someone could use an inappropriate similarity/schema sometime (misconfiguration) to degrade better in this case, I suggest this change, which decodes 0-byte norms as if they were 1-byte, so that scores won't be zeroed in the misconfiguration case... change: static { for (int i = 0; i < 256; i++) NORM_TABLE[i] = SmallFloat.byte52ToFloat((byte)i); } to: static { NORM_TABLE[0] = SmallFloat.byte52ToFloat((byte)1); for (int i = 1; i < 256; i++) NORM_TABLE[i] = SmallFloat.byte52ToFloat((byte)i); }
        Hide
        Lance Norskog added a comment -

        Cool! Looks great.

        Show
        Lance Norskog added a comment - Cool! Looks great.
        Hide
        Robert Muir added a comment -

        Lance, here's a patch with the similarity i suggested, for lucene's contrib with a unit test.

        Then, i think as i mentioned earlier (also on LUCENE-2236), we should create a Solr
        issue to make per-field similarity more declarative and add an example short field type.

        Show
        Robert Muir added a comment - Lance, here's a patch with the similarity i suggested, for lucene's contrib with a unit test. Then, i think as i mentioned earlier (also on LUCENE-2236 ), we should create a Solr issue to make per-field similarity more declarative and add an example short field type.
        Hide
        Robert Muir added a comment -

        In my opinion, the best thing to do would be to open an issue
        for better per-field Similarity integration into the solr schema.

        Currently you can pass a SimProvider to the 'global' SimilarityFactory for the entire schema.
        in this java code you would have to e.g. make a hashset with "smallfield1", "smallfield2", "smallfield3",
        and return SmallFloatSimilarity for these.

        Instead, it would be better if the FieldType? (dunno if this is even the best)
        could simply have similarity=SmallFloatSimilarity or whatever, so that the specification is more declarative.

        Then solr could have an example 'short field type' FieldType in the example schema.
        (with the tradeoffs of the fact floatToByte52 maxes out at 1984, so don't use for large fields or big boosts).

        This way, people could make their metadata fields of this smalltype, but their large document fields
        still use the ordinary text type (e.g. guys like Hathitrust with some enormous fields), and everything in
        their application works, they just get quantization that makes sense for each field...

        Show
        Robert Muir added a comment - In my opinion, the best thing to do would be to open an issue for better per-field Similarity integration into the solr schema. Currently you can pass a SimProvider to the 'global' SimilarityFactory for the entire schema. in this java code you would have to e.g. make a hashset with "smallfield1", "smallfield2", "smallfield3", and return SmallFloatSimilarity for these. Instead, it would be better if the FieldType? (dunno if this is even the best) could simply have similarity=SmallFloatSimilarity or whatever, so that the specification is more declarative. Then solr could have an example 'short field type' FieldType in the example schema. (with the tradeoffs of the fact floatToByte52 maxes out at 1984, so don't use for large fields or big boosts). This way, people could make their metadata fields of this smalltype, but their large document fields still use the ordinary text type (e.g. guys like Hathitrust with some enormous fields), and everything in their application works, they just get quantization that makes sense for each field...
        Hide
        Lance Norskog added a comment -

        Lance, this is a bit misleading. only lengths {3,4} , {6,7}, and {8,9,10} share the same values.

        I thought I got them all the same when I tested with Lucene 2.9, but ok.

        For most uses, this isn't really that big of a deal that a few numbers quantize to the same bytes.

        The problem is then the curve of how much field norms affect boosting.

        Sure, close this. My goal is to make Solr work smoothly in all environments.

        Lance

        Show
        Lance Norskog added a comment - Lance, this is a bit misleading. only lengths {3,4} , {6,7}, and {8,9,10} share the same values. I thought I got them all the same when I tested with Lucene 2.9, but ok. For most uses, this isn't really that big of a deal that a few numbers quantize to the same bytes. The problem is then the curve of how much field norms affect boosting. Sure, close this. My goal is to make Solr work smoothly in all environments. Lance
        Hide
        Robert Muir added a comment -

        Unfortunately, that value is packed in such a way that it gives the same value for 1-10 words in a field.

        Lance, this is a bit misleading. only lengths

        {3,4}

        ,

        {6,7}

        , and

        {8,9,10}

        share the same values.

        For most uses, this isn't really that big of a deal that a few numbers quantize to the same bytes.

        If you care about this, use SmallFloat.floatToByte52/byteToFloat52 as I suggested. Then they are all unique.

        You can also do this on a per-field basis now, e.g. only for your small-fields... thats why I recommended we close this issue as obselete.

        Show
        Robert Muir added a comment - Unfortunately, that value is packed in such a way that it gives the same value for 1-10 words in a field. Lance, this is a bit misleading. only lengths {3,4} , {6,7} , and {8,9,10} share the same values. For most uses, this isn't really that big of a deal that a few numbers quantize to the same bytes. If you care about this, use SmallFloat.floatToByte52/byteToFloat52 as I suggested. Then they are all unique. You can also do this on a per-field basis now, e.g. only for your small-fields... thats why I recommended we close this issue as obselete.
        Hide
        Lance Norskog added a comment -

        The current default Solr configuration uses the standard academic formula for indexed field norms. Unfortunately, that value is packed in such a way that it gives the same value for 1-10 words in a field. This makes it useless with short fields like book & movie titles.

        Here's the high-level request: the Solr default configuration should supply field norms that work well with very short fields. We should not need to change the configuration at all.

        Show
        Lance Norskog added a comment - The current default Solr configuration uses the standard academic formula for indexed field norms. Unfortunately, that value is packed in such a way that it gives the same value for 1-10 words in a field. This makes it useless with short fields like book & movie titles. Here's the high-level request: the Solr default configuration should supply field norms that work well with very short fields. We should not need to change the configuration at all.
        Hide
        Robert Muir added a comment -

        Now that we have custom norm encoders, is this one obselete?
        you can just use SmallFloat.floatToByte52 to enc/dec your norms?

        Show
        Robert Muir added a comment - Now that we have custom norm encoders, is this one obselete? you can just use SmallFloat.floatToByte52 to enc/dec your norms?
        Hide
        Lance Norskog added a comment - - edited

        This is a graph of the standard norms against the results of this patch. The orange/red dots at the left are the elevated values for boosting short documents.

        Both displays show the norms after the 8-bit encode/decode process, rather than raw 1/x. Here is the code for the generator:

        public class FloatEncode {
        	private static float ARR[] = { 0.0f, 1.5f, 1.25f, 1.0f, 0.875f, 0.75f, 
        			0.625f, 0.5f, 0.4375f, 0.375f, 0.3125f};
        
        	public static void main(String[] args) {
        		for(int i = 1; i < 100; i++) {
        			float f = i;
        			f = 1/f;
        			byte b = SmallFloat.floatToByte315(f);
        			float f2 = SmallFloat.byte315ToFloat(b);
        			float ff = f2;
        			if (i < ARR.length)
        				ff = ARR[i];
        			System.out.println(i + "," + f2 + "," + ff);
        		}
        
        	}
        }
        

        (Please pretend I named it LUCENE-1360 instead of LUCENE-1380.)

        Show
        Lance Norskog added a comment - - edited This is a graph of the standard norms against the results of this patch. The orange/red dots at the left are the elevated values for boosting short documents. Both displays show the norms after the 8-bit encode/decode process, rather than raw 1/x. Here is the code for the generator: public class FloatEncode { private static float ARR[] = { 0.0f, 1.5f, 1.25f, 1.0f, 0.875f, 0.75f, 0.625f, 0.5f, 0.4375f, 0.375f, 0.3125f}; public static void main( String [] args) { for ( int i = 1; i < 100; i++) { float f = i; f = 1/f; byte b = SmallFloat.floatToByte315(f); float f2 = SmallFloat.byte315ToFloat(b); float ff = f2; if (i < ARR.length) ff = ARR[i]; System .out.println(i + "," + f2 + "," + ff); } } } (Please pretend I named it LUCENE-1360 instead of LUCENE-1380 .)
        Hide
        Shalin Shekhar Mangar added a comment -

        I'm interested in this issue as well.

        Show
        Shalin Shekhar Mangar added a comment - I'm interested in this issue as well.
        Hide
        Lance Norskog added a comment -

        Is this code still interesting? That is, are there newer tools in Lucene that handle this problem?

        I have found searching movie titles, product descriptions etc. difficult to manage really well. Mainstream text retrieval research & applied tech is very strongly biased towards bodies of text.

        Show
        Lance Norskog added a comment - Is this code still interesting? That is, are there newer tools in Lucene that handle this problem? I have found searching movie titles, product descriptions etc. difficult to manage really well. Mainstream text retrieval research & applied tech is very strongly biased towards bodies of text.

          People

          • Assignee:
            Otis Gospodnetic
            Reporter:
            Sean Timm
          • Votes:
            3 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development