Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.2, 6.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I was curious what the performance would be, because it might be useful option to use packedints for norms if you have lots of fields and still want good scoring:

      Today the smallest norm per-field-per-doc you can use is a single byte, and if you have f fields with norms enabled and n docs, it uses f * n bytes of space in RAM.

      Especially if you aren't using index-time boosting (or even if you are, but not with ridiculous values), this could be wasting a ton of RAM.

      But then I noticed there was no clean way to allow you to do this in your Similarity: its a trivial patch.

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          I tried this on that geonames database since my default indexing (just shoving everything in as a TextField)
          creates a huge .nrm file today (150MB: 8M docs * 19 fields). Just as a test I tried a simple similarity
          implementation that uses

          @Override
          public void computeNorm(FieldInvertState state, Norm norm) {
            norm.setPackedLong(state.getLength());
          }
          
          -rw-rw-r--  1 rmuir rmuir  49339454 Nov  5 22:30 _7e_nrm.cfs
          

          If you want to use boosts too, you would have to be careful how you encode, but I think this can be useful.

          In this case its 1/3 of the RAM, even though documents lengths are exact vs. lossy (though most fields are
          shortish, some are huge, like alternate names fields for major countries and cities, which have basically every
          language imaginable shoved in the field: thats why it doesnt save more I think)

          Show
          Robert Muir added a comment - I tried this on that geonames database since my default indexing (just shoving everything in as a TextField) creates a huge .nrm file today (150MB: 8M docs * 19 fields). Just as a test I tried a simple similarity implementation that uses @Override public void computeNorm(FieldInvertState state, Norm norm) { norm.setPackedLong(state.getLength()); } -rw-rw-r-- 1 rmuir rmuir 49339454 Nov 5 22:30 _7e_nrm.cfs If you want to use boosts too, you would have to be careful how you encode, but I think this can be useful. In this case its 1/3 of the RAM, even though documents lengths are exact vs. lossy (though most fields are shortish, some are huge, like alternate names fields for major countries and cities, which have basically every language imaginable shoved in the field: thats why it doesnt save more I think)
          Hide
          Simon Willnauer added a comment -

          +1 - should we also document that we don't have similarities that can make use of it at this point?

          Show
          Simon Willnauer added a comment - +1 - should we also document that we don't have similarities that can make use of it at this point?
          Hide
          Michael McCandless added a comment -

          +1, very cool!

          Show
          Michael McCandless added a comment - +1, very cool!
          Hide
          Robert Muir added a comment -

          I don't understand the question Simon: all the ones we provide happen to use Norm.setByte

          I don't think we need to add documentation to Norm.setFloat,Norm.setDouble saying that we don't
          provide any similarities that call these methods: thats not important to anybody.

          Show
          Robert Muir added a comment - I don't understand the question Simon: all the ones we provide happen to use Norm.setByte I don't think we need to add documentation to Norm.setFloat,Norm.setDouble saying that we don't provide any similarities that call these methods: thats not important to anybody.
          Hide
          Simon Willnauer added a comment -

          I don't understand the question Simon: all the ones we provide happen to use Norm.setByte

          Just to clarify. Currently if we write packed ints and a similarity calls Source#getArray you get an UOE. I think we should document that our current impls won't handle this.

          Show
          Simon Willnauer added a comment - I don't understand the question Simon: all the ones we provide happen to use Norm.setByte Just to clarify. Currently if we write packed ints and a similarity calls Source#getArray you get an UOE. I think we should document that our current impls won't handle this.
          Hide
          Robert Muir added a comment -

          I don't see how its relevant. Issues will happen if you use Norm.setFloat (as they expect a byte).

          I'm not going to confuse the documentation. The built-in Similarities at query-time
          depend upon their index-time norm implementation: this is documented extensively everywhere!

          Show
          Robert Muir added a comment - I don't see how its relevant. Issues will happen if you use Norm.setFloat (as they expect a byte). I'm not going to confuse the documentation. The built-in Similarities at query-time depend upon their index-time norm implementation: this is documented extensively everywhere!
          Hide
          Simon Willnauer added a comment -

          fair enough. I just wanted to mention it..

          Show
          Simon Willnauer added a comment - fair enough. I just wanted to mention it..
          Hide
          Robert Muir added a comment -

          If someone changes their similarity to use a different norm type at index-time than at query-time,
          then he or she is an idiot!

          Show
          Robert Muir added a comment - If someone changes their similarity to use a different norm type at index-time than at query-time, then he or she is an idiot!
          Hide
          Robert Muir added a comment -

          I plan to revert this for 4.1 to contain the amount of backwards compatibility code we need to implement for LUCENE-4547.

          If someone uses this functionality in its current form, they will easily hit the LUCENE-4547 bug.

          I implemented this more efficiently with the new APIs in the lucene4547 branch anyway: when it would save RAM, and the # of values is small, it dereferences the unique values and packs ords. This is typically the case with our smallfloat encoding.

          Show
          Robert Muir added a comment - I plan to revert this for 4.1 to contain the amount of backwards compatibility code we need to implement for LUCENE-4547 . If someone uses this functionality in its current form, they will easily hit the LUCENE-4547 bug. I implemented this more efficiently with the new APIs in the lucene4547 branch anyway: when it would save RAM, and the # of values is small, it dereferences the unique values and packs ords. This is typically the case with our smallfloat encoding.
          Hide
          Commit Tag Bot added a comment -
          Show
          Commit Tag Bot added a comment - [trunk commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1432096 LUCENE-4540 : revert
          Hide
          Robert Muir added a comment -

          I backed this out of 4.1. When LUCENE-4547 lands, we can resolve it with that implementation.

          Show
          Robert Muir added a comment - I backed this out of 4.1. When LUCENE-4547 lands, we can resolve it with that implementation.
          Hide
          Commit Tag Bot added a comment -
          Show
          Commit Tag Bot added a comment - [branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1432100 LUCENE-4540 : revert
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Robert Muir
          http://svn.apache.org/viewvc?view=revision&revision=1406433

          LUCENE-4540: allow packed ints norms

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1406433 LUCENE-4540 : allow packed ints norms
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development