Lucene - Core
  1. Lucene - Core
  2. LUCENE-2846

omitTF is viral, but omitNorms is anti-viral.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      omitTF is viral. if you add document 1 with field "foo" as omitTF, then document 2 has field "foo" without omitTF, they are both treated as omitTF.

      but omitNorms is the opposite. if you have a million documents with field "foo" with omitNorms, then you add just one document without omitting norms,
      now you suddenly have a million 'real norms'.

      I think it would be good for omitNorms to be viral too, just for consistency, and also to prevent huge byte[]'s.
      but another option is to make omitTF anti-viral, which is more "schemaless" i guess.

      1. LUCENE-2846.patch
        58 kB
        Robert Muir
      2. LUCENE-2846.patch
        57 kB
        Robert Muir
      3. LUCENE-2846.patch
        57 kB
        Robert Muir
      4. LUCENE-2846.patch
        30 kB
        Robert Muir

        Activity

        Hide
        Michael McCandless added a comment -

        +1 for omitNorms to be viral.

        Show
        Michael McCandless added a comment - +1 for omitNorms to be viral.
        Hide
        Robert Muir added a comment -

        here's an initial patch hacked up by mike and I... also removed the "multireader norms" method that
        takes a byte[]+offset from IndexReader.

        one oddity is that MultiNorms.norms() always returns a filled byte[] here for non-atomic readers (never null).
        But i think this is ok for MultiNorms, its not used in searching (only for SlowMultiReaderWrapper etc)

        i think somehow it would be good to have more tests that test "doesnt have field" versus "omits norms",
        and also (likely not in this is issue) we should think about IR's norm-setting methods.

        I don't like that these use Similarity.getDefault(): it seems we could require you to pass in the Sim for the float case.
        I also don't like that we expose a public setNorm that takes a byte value either!

        Long-term we should look at pulling this norm-encoding stuff out of Sim... the Sim should just be dealing with floats,
        this encoding stuff belongs somewhere else.

        Show
        Robert Muir added a comment - here's an initial patch hacked up by mike and I... also removed the "multireader norms" method that takes a byte[]+offset from IndexReader. one oddity is that MultiNorms.norms() always returns a filled byte[] here for non-atomic readers (never null). But i think this is ok for MultiNorms, its not used in searching (only for SlowMultiReaderWrapper etc) i think somehow it would be good to have more tests that test "doesnt have field" versus "omits norms", and also (likely not in this is issue) we should think about IR's norm-setting methods. I don't like that these use Similarity.getDefault(): it seems we could require you to pass in the Sim for the float case. I also don't like that we expose a public setNorm that takes a byte value either! Long-term we should look at pulling this norm-encoding stuff out of Sim... the Sim should just be dealing with floats, this encoding stuff belongs somewhere else.
        Hide
        Robert Muir added a comment -

        an alternative to totally clear up the faking here that mike thought of:

        If we can somehow differentiate between omitNorms (null), and 'doesnt have field' (say, exception),
        we wouldn't need to fake. In multinorms we could then safely return null if any reader returns null,
        but throw an exception if all readers throw an exception.

        Show
        Robert Muir added a comment - an alternative to totally clear up the faking here that mike thought of: If we can somehow differentiate between omitNorms (null), and 'doesnt have field' (say, exception), we wouldn't need to fake. In multinorms we could then safely return null if any reader returns null, but throw an exception if all readers throw an exception.
        Hide
        Robert Muir added a comment -

        here's an updated patch:

        • The IR.setNorm(float) is also removed, forcing the user to use the correct similarity versus us using the wrong one (the static)
        • MultiNorms doesn't fake norms anymore, instead it handles the case of non-existent field versus omitted norms.
        • When a document doesnt have a field, its (undefined) norms are written as zero bytes instead of Similarity.getDefault().encodeNorm(1f).
        • All uses of Similarity.get/setDefault are now gone in lucene core, except for in IndexSearcher and IndexWriterConfig.
        Show
        Robert Muir added a comment - here's an updated patch: The IR.setNorm(float) is also removed, forcing the user to use the correct similarity versus us using the wrong one (the static) MultiNorms doesn't fake norms anymore, instead it handles the case of non-existent field versus omitted norms. When a document doesnt have a field, its (undefined) norms are written as zero bytes instead of Similarity.getDefault().encodeNorm(1f). All uses of Similarity.get/setDefault are now gone in lucene core, except for in IndexSearcher and IndexWriterConfig.
        Hide
        Robert Muir added a comment -

        sorry i had a piece of backwards logic in MultiNorms.

        of course all tests pass either way, which is why we need a good mixed-schema test (with RIW)
        for this issue before it can go in (no matter what we do)

        Show
        Robert Muir added a comment - sorry i had a piece of backwards logic in MultiNorms. of course all tests pass either way, which is why we need a good mixed-schema test (with RIW) for this issue before it can go in (no matter what we do)
        Hide
        Robert Muir added a comment -

        ok here's the final patch: with a reasonably good test and docs.

        I think this one is ready to commit: gets rid of fake norms and use of the static Similarity.getDefault()

        Show
        Robert Muir added a comment - ok here's the final patch: with a reasonably good test and docs. I think this one is ready to commit: gets rid of fake norms and use of the static Similarity.getDefault()
        Hide
        Robert Muir added a comment -

        For branch 3.x, i would like to deprecate the IndexReader.setNorm(float) based method, so its no surprise when its removed here.

        here was the changes entry (so i would insert some text like this in the deprecation):

        LUCENE-2846: Remove the deprecated IndexReader.setNorm(int, String, float).
          This method was only syntactic sugar for setNorm(int, String, byte), but
          using the global Similarity.getDefault().encodeNormValue.  Use the byte-based
          method instead to ensure that the norm is encoded with your Similarity.
        
        Show
        Robert Muir added a comment - For branch 3.x, i would like to deprecate the IndexReader.setNorm(float) based method, so its no surprise when its removed here. here was the changes entry (so i would insert some text like this in the deprecation): LUCENE-2846: Remove the deprecated IndexReader.setNorm(int, String, float). This method was only syntactic sugar for setNorm(int, String, byte), but using the global Similarity.getDefault().encodeNormValue. Use the byte-based method instead to ensure that the norm is encoded with your Similarity.
        Hide
        Grant Ingersoll added a comment -

        I get the comparison to omitTF, but this functionality has been around a long time. Why does it have to be all or nothing? Couldn't we investigate a sparse data structure to be used instead? We use the current dense approach when a high percentage contain norms and the sparse when less have that amount? I'm not sure what that data structure is just yet, but over in Mahout we have sparse and dense vectors and we have primitive collections that could be useful.

        Show
        Grant Ingersoll added a comment - I get the comparison to omitTF, but this functionality has been around a long time. Why does it have to be all or nothing? Couldn't we investigate a sparse data structure to be used instead? We use the current dense approach when a high percentage contain norms and the sparse when less have that amount? I'm not sure what that data structure is just yet, but over in Mahout we have sparse and dense vectors and we have primitive collections that could be useful.
        Hide
        Robert Muir added a comment -

        I get the comparison to omitTF, but this functionality has been around a long time.

        Can you explain what functionality you are losing with this patch?

        Show
        Robert Muir added a comment - I get the comparison to omitTF, but this functionality has been around a long time. Can you explain what functionality you are losing with this patch?
        Hide
        Michael McCandless added a comment -

        Couldn't we investigate a sparse data structure to be used instead?

        This would be very interesting to explore. The fact that norms are dense, on disk and in memory, can cause horrific problems like your index taking much much more disk space and RAM on optimize/big merge finishing.

        But I think that's orthogonal to the improvements here? Ie this issue removes fake norms, invalid uses of default Sim, etc.

        Also, I would worry about the lookup cost of sparse vectors in RAM – looking up the norm per doc is a severe hotspot on Lucene.

        Show
        Michael McCandless added a comment - Couldn't we investigate a sparse data structure to be used instead? This would be very interesting to explore. The fact that norms are dense, on disk and in memory, can cause horrific problems like your index taking much much more disk space and RAM on optimize/big merge finishing. But I think that's orthogonal to the improvements here? Ie this issue removes fake norms, invalid uses of default Sim, etc. Also, I would worry about the lookup cost of sparse vectors in RAM – looking up the norm per doc is a severe hotspot on Lucene.
        Hide
        Robert Muir added a comment -

        But I think that's orthogonal to the improvements here? Ie this issue removes fake norms, invalid uses of default Sim, etc.

        Right, after talking to Grant, I think his comments are on the larger issues of norms... I am totally not trying to address
        this here. This one is just about cleaning up things that are "buggy", especially using Similarity.getDefault when we shouldn't.

        The scope of this one (even though its just cleanup) is large enough I think... I'd like to proceed with this unless there are
        any objections. We can address real improvements to norms somewhere else, and hopefully this will help make that easier.

        Show
        Robert Muir added a comment - But I think that's orthogonal to the improvements here? Ie this issue removes fake norms, invalid uses of default Sim, etc. Right, after talking to Grant, I think his comments are on the larger issues of norms... I am totally not trying to address this here. This one is just about cleaning up things that are "buggy", especially using Similarity.getDefault when we shouldn't. The scope of this one (even though its just cleanup) is large enough I think... I'd like to proceed with this unless there are any objections. We can address real improvements to norms somewhere else, and hopefully this will help make that easier.
        Hide
        Robert Muir added a comment -

        Committed revision 1058367.

        I deprecated the dangerous setNorm(float) method in 3.x in revision 1058370,
        instead pointing at setNorm(byte) and using Similarity.encodeNormValue(),
        so you can ensure your Similarity is always used (not Similarity.getDefault)

        Show
        Robert Muir added a comment - Committed revision 1058367. I deprecated the dangerous setNorm(float) method in 3.x in revision 1058370, instead pointing at setNorm(byte) and using Similarity.encodeNormValue(), so you can ensure your Similarity is always used (not Similarity.getDefault)

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development