Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.1
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The static span and resolution of the 8 bit norms codec might not fit with all applications.

      My use case requires that 100f-250f is discretized in 60 bags instead of the default.. 10?

      1. Lucene-1260-2.patch
        30 kB
        Johan Kindgren
      2. Lucene-1260-1.patch
        25 kB
        Johan Kindgren
      3. LUCENE-1260.txt
        5 kB
        Karl Wettin
      4. LUCENE-1260.txt
        12 kB
        Karl Wettin
      5. LUCENE-1260.txt
        23 kB
        Karl Wettin
      6. Lucene-1260.patch
        22 kB
        Johan Kindgren
      7. LUCENE-1260_defaultsim.patch
        4 kB
        Robert Muir

        Activity

        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1
        Hide
        Robert Muir added a comment -

        (Updating fix-version correctly, also).

        I think its safe to mark this resolved... the issues are totally cleared up in 4.0,
        and only some (documented) corner cases in 3.x where we still use the default sim.

        Show
        Robert Muir added a comment - (Updating fix-version correctly, also). I think its safe to mark this resolved... the issues are totally cleared up in 4.0, and only some (documented) corner cases in 3.x where we still use the default sim.
        Hide
        Yonik Seeley added a comment -

        I think we need to stop faking norms, independent of whether/when we cutover to CSF to store norms / index stats?

        +1, it was only intended to be a short-term thing for back compat (see way back to LUCENE-448)

        Show
        Yonik Seeley added a comment - I think we need to stop faking norms, independent of whether/when we cutover to CSF to store norms / index stats? +1, it was only intended to be a short-term thing for back compat (see way back to LUCENE-448 )
        Hide
        Michael McCandless added a comment -

        I think we need to stop faking norms, independent of whether/when we cutover to CSF to store norms / index stats?

        Ie the two issues are orthogonal (and both are important!).

        Show
        Michael McCandless added a comment - I think we need to stop faking norms, independent of whether/when we cutover to CSF to store norms / index stats? Ie the two issues are orthogonal (and both are important!).
        Hide
        Robert Muir added a comment -

        hmm, not sure if I understand this correctly. how values are encoded / decoded depends on the DocValues implementation which can be customized since it is exposed via codec. That means that users of the API always operate on float and the encoding and decoding happens inside codec and per field. So encode/decode in Sim would be obsolet, right?

        the issues remaining here involve mostly "fake norms", for the omitNorms case (also "empty norms" I think).
        So, the stuff I listed must be fixed regardless, to clean up the fake norms case, it does not matter if "real norms" are encoded with CSF or not.

        Doing things like cleaning up how we deal with fake norms, and removing Similarity.get/setDefault is completely unrelated to DocValues... its just stuff we must fix.

        As long as we have these statics like Similarity.get/setDefault, its not even useful to think about things like flexible scoring or per-field SImilarity...!

        Show
        Robert Muir added a comment - hmm, not sure if I understand this correctly. how values are encoded / decoded depends on the DocValues implementation which can be customized since it is exposed via codec. That means that users of the API always operate on float and the encoding and decoding happens inside codec and per field. So encode/decode in Sim would be obsolet, right? the issues remaining here involve mostly "fake norms", for the omitNorms case (also "empty norms" I think). So, the stuff I listed must be fixed regardless, to clean up the fake norms case, it does not matter if "real norms" are encoded with CSF or not. Doing things like cleaning up how we deal with fake norms, and removing Similarity.get/setDefault is completely unrelated to DocValues... its just stuff we must fix. As long as we have these statics like Similarity.get/setDefault, its not even useful to think about things like flexible scoring or per-field SImilarity...!
        Hide
        Simon Willnauer added a comment -

        So, you would have the same problem with DocValues!

        hmm, not sure if I understand this correctly. how values are encoded / decoded depends on the DocValues implementation which can be customized since it is exposed via codec. That means that users of the API always operate on float and the encoding and decoding happens inside codec and per field. So encode/decode in Sim would be obsolet, right?

        Show
        Simon Willnauer added a comment - So, you would have the same problem with DocValues! hmm, not sure if I understand this correctly. how values are encoded / decoded depends on the DocValues implementation which can be customized since it is exposed via codec. That means that users of the API always operate on float and the encoding and decoding happens inside codec and per field. So encode/decode in Sim would be obsolet, right?
        Hide
        Robert Muir added a comment -

        I didn't follow the entire thread here but is it worth all the effort what robert is suggesting or should we simply land docvalues branch and make norms a DocValues field? The infrastructure is already there, its integrated into codec and gives users the freedom to use any Type they want.

        Simon, the the problem is encode/decode is in Similarity (instead of somewhere else).

        So, you would have the same problem with DocValues!

        Show
        Robert Muir added a comment - I didn't follow the entire thread here but is it worth all the effort what robert is suggesting or should we simply land docvalues branch and make norms a DocValues field? The infrastructure is already there, its integrated into codec and gives users the freedom to use any Type they want. Simon, the the problem is encode/decode is in Similarity (instead of somewhere else). So, you would have the same problem with DocValues!
        Hide
        Simon Willnauer added a comment -

        For trunk, here is what i suggest:

        I didn't follow the entire thread here but is it worth all the effort what robert is suggesting or should we simply land docvalues branch and make norms a DocValues field? The infrastructure is already there, its integrated into codec and gives users the freedom to use any Type they want.

        Show
        Simon Willnauer added a comment - For trunk, here is what i suggest: I didn't follow the entire thread here but is it worth all the effort what robert is suggesting or should we simply land docvalues branch and make norms a DocValues field? The infrastructure is already there, its integrated into codec and gives users the freedom to use any Type they want.
        Hide
        Robert Muir added a comment -

        Is there no way to remove this stupid static default and deprecate Similarity.(g|s)etDefault()? Can we not use the Similarity from IndexWriter for the case of NormsWriter?

        I think this is totally what we should try to do in trunk, especially after LUCENE-2846.

        In this case, i want to fix the issue in a backwards-compatible way for lucene 3.x
        The warning is a little crazy I know, really people shouldnt rely upon their encoder being used for fake norms.
        But i think its fair to document the corner case, just because its not really fixable easily in 3.x

        For trunk, here is what i suggest:

        • LUCENE-2846: remove all uses of fake norms. We never fill fake norms anymore at all, once we fix this issue. If you have a non-atomic reader with two segments, and one has no norms, then the whole norms[] should be null. this is consistent with omitTF. So, for example MultiNorms would never create fake
          norms.
        • LUCENE-2854: Mike is working on some issues i think where BooleanQuery uses this static or some other silliness with Similarity, i think we can clean that up there.
        • finally at this point, I would like to remove Similarity.getDefault/setDefault alltogether. I would prefer instead that IndexSearcher has a single 'DefaultSimilarity' that is the default value if you don't provide one, and likewise with IndexWriterConfig.
        Show
        Robert Muir added a comment - Is there no way to remove this stupid static default and deprecate Similarity.(g|s)etDefault()? Can we not use the Similarity from IndexWriter for the case of NormsWriter? I think this is totally what we should try to do in trunk, especially after LUCENE-2846 . In this case, i want to fix the issue in a backwards-compatible way for lucene 3.x The warning is a little crazy I know, really people shouldnt rely upon their encoder being used for fake norms . But i think its fair to document the corner case, just because its not really fixable easily in 3.x For trunk, here is what i suggest: LUCENE-2846 : remove all uses of fake norms. We never fill fake norms anymore at all, once we fix this issue. If you have a non-atomic reader with two segments, and one has no norms, then the whole norms[] should be null. this is consistent with omitTF. So, for example MultiNorms would never create fake norms. LUCENE-2854 : Mike is working on some issues i think where BooleanQuery uses this static or some other silliness with Similarity, i think we can clean that up there. finally at this point, I would like to remove Similarity.getDefault/setDefault alltogether. I would prefer instead that IndexSearcher has a single 'DefaultSimilarity' that is the default value if you don't provide one, and likewise with IndexWriterConfig.
        Hide
        Uwe Schindler added a comment -

        Here's a patch for the general case, and it also adds a warning that you should set your similarity with Similarity.setDefault, especially if you omit norms.

        Is there no way to remove this stupid static default and deprecate Similarity.(g|s)etDefault()? Can we not use the Similarity from IndexWriter for the case of NormsWriter?

        Show
        Uwe Schindler added a comment - Here's a patch for the general case, and it also adds a warning that you should set your similarity with Similarity.setDefault, especially if you omit norms. Is there no way to remove this stupid static default and deprecate Similarity.(g|s)etDefault()? Can we not use the Similarity from IndexWriter for the case of NormsWriter?
        Hide
        Robert Muir added a comment -

        Here's a patch for the general case, and it also adds a warning
        that you should set your similarity with Similarity.setDefault, especially if you omit norms.

        We can backport this to 3.x

        The other cases involve fake norms, which I think we should completely remove in trunk
        with LUCENE-2846, then there is no longer an issue and we can remove the warning in trunk.

        Show
        Robert Muir added a comment - Here's a patch for the general case, and it also adds a warning that you should set your similarity with Similarity.setDefault, especially if you omit norms. We can backport this to 3.x The other cases involve fake norms, which I think we should completely remove in trunk with LUCENE-2846 , then there is no longer an issue and we can remove the warning in trunk.
        Hide
        Robert Muir added a comment -

        I think there are serious traps here, that if you supply Similarity to IWConfig etc rather than
        setting the global static Similarity.setDefault, your code will have no effect.

        The biggest offendor can be seen in the patch:

              final float norm = docState.similarity.computeNorm(fieldInfo.name, fieldState);
        -      norms[upto] = Similarity.encodeNorm(norm);
        +      norms[upto] = Similarity.getDefault().encodeNormValue(norm);
        

        shouldnt that simply call docState.similarity.encodeNormValue?

        There are other problems with decode too.
        I think we need to review all places where we use the static Similarity.getDefault() carefully.

        Show
        Robert Muir added a comment - I think there are serious traps here, that if you supply Similarity to IWConfig etc rather than setting the global static Similarity.setDefault, your code will have no effect. The biggest offendor can be seen in the patch: final float norm = docState.similarity.computeNorm(fieldInfo.name, fieldState); - norms[upto] = Similarity.encodeNorm(norm); + norms[upto] = Similarity.getDefault().encodeNormValue(norm); shouldnt that simply call docState.similarity.encodeNormValue? There are other problems with decode too. I think we need to review all places where we use the static Similarity.getDefault() carefully.
        Hide
        Michael McCandless added a comment -

        Thanks Johan!

        Show
        Michael McCandless added a comment - Thanks Johan!
        Hide
        Michael McCandless added a comment -

        Patch looks good! Thanks Johan. I'll commit in a day or two...

        Show
        Michael McCandless added a comment - Patch looks good! Thanks Johan. I'll commit in a day or two...
        Hide
        Johan Kindgren added a comment -

        I've added the old static methods again, but made them deprecated.

        In contrib/misc there is still a reference to the static encodeNorm method, maybe that should be replaced with Similarity.getDefaultSimilarity().encodeNormValue(f)? This call to the static method is only done if no similarity is passed to the FieldNormModifier.

        I added a short javadoc description to the static methods, not sure if that is enough? (I guess they will be removed, so the relevant javadoc is probably in the instance methods?)

        Show
        Johan Kindgren added a comment - I've added the old static methods again, but made them deprecated. In contrib/misc there is still a reference to the static encodeNorm method, maybe that should be replaced with Similarity.getDefaultSimilarity().encodeNormValue(f)? This call to the static method is only done if no similarity is passed to the FieldNormModifier. I added a short javadoc description to the static methods, not sure if that is enough? (I guess they will be removed, so the relevant javadoc is probably in the instance methods?)
        Hide
        Johan Kindgren added a comment -

        I haven't had time so far to create a new patch, and I will be away for the next couple of days. Feel free to modify my patch if you like to finish up this issue. 'decodeNormValue' sounds fine by me!
        Otherwise I hope I can come up with a patch by the end of the week (probably late, sunday).

        Show
        Johan Kindgren added a comment - I haven't had time so far to create a new patch, and I will be away for the next couple of days. Feel free to modify my patch if you like to finish up this issue. 'decodeNormValue' sounds fine by me! Otherwise I hope I can come up with a patch by the end of the week (probably late, sunday).
        Hide
        Michael McCandless added a comment -

        Johan are you working up a new patch here? (to fix the back compat issue)

        Show
        Michael McCandless added a comment - Johan are you working up a new patch here? (to fix the back compat issue)
        Hide
        Michael McCandless added a comment -

        Would you like me to continue working with this, or do you already have suggestions for new names of the instance methods?

        Yes, please, could you turnaround a new patch?

        Hmm, naming is always the hardest part.... maybe decodeNormValue/encodeNormValue? normToByte/byteToNorm? getEncodedNorm/getDecodedNorm? Something else?

        Show
        Michael McCandless added a comment - Would you like me to continue working with this, or do you already have suggestions for new names of the instance methods? Yes, please, could you turnaround a new patch? Hmm, naming is always the hardest part.... maybe decodeNormValue/encodeNormValue? normToByte/byteToNorm? getEncodedNorm/getDecodedNorm? Something else?
        Hide
        Uwe Schindler added a comment - - edited

        The 3.0 release will not break backwards compatibility for users that upgraded to 2.9.1 and got rid of deprecation warnings. 3.0 release cicle will start at the weekend, most issues are organizational ones, the rest is finished soon.

        I tend to add the deprecated static method and leave this as 3.1.

        Show
        Uwe Schindler added a comment - - edited The 3.0 release will not break backwards compatibility for users that upgraded to 2.9.1 and got rid of deprecation warnings. 3.0 release cicle will start at the weekend, most issues are organizational ones, the rest is finished soon. I tend to add the deprecated static method and leave this as 3.1.
        Hide
        Johan Kindgren added a comment -

        Haven't really thought about the back-compat question yet, but that's of course an important aspect. When is the 3.0 release planned? I noticed that there were a couple of issues still open, and that release will already break the compatibility...
        Maybe this kind of change should be tested for a couple of weeks before bringing it to a release, if the 3.0 is impending.

        Would you like me to continue working with this, or do you already have suggestions for new names of the instance methods?

        Show
        Johan Kindgren added a comment - Haven't really thought about the back-compat question yet, but that's of course an important aspect. When is the 3.0 release planned? I noticed that there were a couple of issues still open, and that release will already break the compatibility... Maybe this kind of change should be tested for a couple of weeks before bringing it to a release, if the 3.0 is impending. Would you like me to continue working with this, or do you already have suggestions for new names of the instance methods?
        Hide
        Michael McCandless added a comment -

        Performance looks good – I tested with query "1" on a 5M doc wikipedia index and any difference appears to be in the noise.

        But, the current patch breaks back-compat (eg ant test-tag -Dtestcase=TestNorms fails) – I think we have to put back the static methods (mark them deprecated), and then find new names for the instance methods?

        Show
        Michael McCandless added a comment - Performance looks good – I tested with query "1" on a 5M doc wikipedia index and any difference appears to be in the noise. But, the current patch breaks back-compat (eg ant test-tag -Dtestcase=TestNorms fails) – I think we have to put back the static methods (mark them deprecated), and then find new names for the instance methods?
        Hide
        Johan Kindgren added a comment -

        Added 'final' modifier to the Similarity field where it was used.

        The norm-array in Similarity was already made 'final', so there's no change there. I think there could be further refactoring of the use of the Similarity instance, but that is perhaps out of the scope for this issue. I hope this will pass the performance-tests!

        Show
        Johan Kindgren added a comment - Added 'final' modifier to the Similarity field where it was used. The norm-array in Similarity was already made 'final', so there's no change there. I think there could be further refactoring of the use of the Similarity instance, but that is perhaps out of the scope for this issue. I hope this will pass the performance-tests!
        Hide
        Michael McCandless added a comment -

        Would you like me to create another patch with the above changes?

        Yes, please – then I'll run some basic perf tests. Thanks!

        Show
        Michael McCandless added a comment - Would you like me to create another patch with the above changes? Yes, please – then I'll run some basic perf tests. Thanks!
        Hide
        Johan Kindgren added a comment -

        Regarding the performance of the TermScorer, there could be two things to handle to ensure that the Jvm will inline the code:
        1. In the Scorer base-class, make the field 'similarity' final. (Shouldn't be any problem since it's imutable?)
        2. In the Similarity, make the internal decoder array final. That's really up to the implementor, but the default implementations should perhaps use a final field. Also add a note in the javadoc of this?
        Would you like me to create another patch with the above changes? Maybe there could other optimizations, haven't really looked at optimizing the code yet.

        Show
        Johan Kindgren added a comment - Regarding the performance of the TermScorer, there could be two things to handle to ensure that the Jvm will inline the code: 1. In the Scorer base-class, make the field 'similarity' final. (Shouldn't be any problem since it's imutable?) 2. In the Similarity, make the internal decoder array final. That's really up to the implementor, but the default implementations should perhaps use a final field. Also add a note in the javadoc of this? Would you like me to create another patch with the above changes? Maybe there could other optimizations, haven't really looked at optimizing the code yet.
        Hide
        Michael McCandless added a comment -

        I think this is a reasonable change, but we probably should wait for 3.1 as long as 3.0 comes out soonish.

        Show
        Michael McCandless added a comment - I think this is a reasonable change, but we probably should wait for 3.1 as long as 3.0 comes out soonish.
        Hide
        Michael McCandless added a comment -

        Has anyone tested performance of this last patch? One thing that concerns me is this change to TermScorer:

        -    return norms == null ? raw : raw * SIM_NORM_DECODER[norms[doc] & 0xFF]; // normalize for field
        +    return norms == null ? raw : raw * getSimilarity().decodeNorm(norms[doc]); // normalize for field
        

        though it could easily be in practice that it doesn't matter.

        Show
        Michael McCandless added a comment - Has anyone tested performance of this last patch? One thing that concerns me is this change to TermScorer: - return norms == null ? raw : raw * SIM_NORM_DECODER[norms[doc] & 0xFF]; // normalize for field + return norms == null ? raw : raw * getSimilarity().decodeNorm(norms[doc]); // normalize for field though it could easily be in practice that it doesn't matter.
        Hide
        Karl Wettin added a comment -

        Hi Johan,

        didn't try it out yet but the patch looks nice and clean. +1 from me. Let's try to convince some of the old -1:ers.

        YONIK? See, it's not just me. ; )

        I do however still think it's nice with the serializable codec interface as in the previous patches in order for all applications to use the index as intended (Luke and what not). 256 bytes stored to a file and by default backed by a binary search or so unless there is a registred codec that handles it algorithmic. I'll copy and paste that in as an alternative suggestion ASAP.

        (I think the next move should be to allow for per field variable norms resolution, but that is a whole new issue.)

        Show
        Karl Wettin added a comment - Hi Johan, didn't try it out yet but the patch looks nice and clean. +1 from me. Let's try to convince some of the old -1:ers. YONIK? See, it's not just me. ; ) I do however still think it's nice with the serializable codec interface as in the previous patches in order for all applications to use the index as intended (Luke and what not). 256 bytes stored to a file and by default backed by a binary search or so unless there is a registred codec that handles it algorithmic. I'll copy and paste that in as an alternative suggestion ASAP. (I think the next move should be to allow for per field variable norms resolution, but that is a whole new issue.)
        Hide
        Johan Kindgren added a comment - - edited

        Removed 'static' keyword to enable a pluggable behavior for encoding/decoding norms. Our business-case for this is to fix scoring when using NGrams. If a word is split into three parts, the norm for these parts would then become ~0.3125 (don't remember exactly) in the current implementation. A search for the exact same word would then generate a score of less than 1.0. With a pluggable norm-calculation, we could use a norm-table with values 0-100 and get a better scoring.

        Minor changes in 11 core-classes and some tests. Also minor changes in analyzers, instantiated, memory and miscellaneous.

        Show
        Johan Kindgren added a comment - - edited Removed 'static' keyword to enable a pluggable behavior for encoding/decoding norms. Our business-case for this is to fix scoring when using NGrams. If a word is split into three parts, the norm for these parts would then become ~0.3125 (don't remember exactly) in the current implementation. A search for the exact same word would then generate a score of less than 1.0. With a pluggable norm-calculation, we could use a norm-table with values 0-100 and get a better scoring. Minor changes in 11 core-classes and some tests. Also minor changes in analyzers, instantiated, memory and miscellaneous.
        Hide
        Karl Wettin added a comment -

        Wouldn't the simplest solution be to refactor out the static methods, replace them with instance methods and remove the getNormDecoder method? This would enable a pluggable behavior without introducing a new Codec.

        Hi Johan,

        feel free to post a patch!

        Show
        Karl Wettin added a comment - Wouldn't the simplest solution be to refactor out the static methods, replace them with instance methods and remove the getNormDecoder method? This would enable a pluggable behavior without introducing a new Codec. Hi Johan, feel free to post a patch!
        Hide
        Johan Kindgren added a comment -

        Wouldn't the simplest solution be to refactor out the static methods, replace them with instance methods and remove the getNormDecoder method? This would enable a pluggable behavior without introducing a new Codec.
        Would cause minor changes to 11 classes in the core, and would also clean up the code from static stuff.

        As described in LUCENE-1261.

        Show
        Johan Kindgren added a comment - Wouldn't the simplest solution be to refactor out the static methods, replace them with instance methods and remove the getNormDecoder method? This would enable a pluggable behavior without introducing a new Codec. Would cause minor changes to 11 classes in the core, and would also clean up the code from static stuff. As described in LUCENE-1261 .
        Hide
        Karl Wettin added a comment -

        The file is just something secondary I added on "request", personally I use a hardcoded codec. All it does is to allow a simple way in to change the current static norm translation table.

        Show
        Karl Wettin added a comment - The file is just something secondary I added on "request", personally I use a hardcoded codec. All it does is to allow a simple way in to change the current static norm translation table.
        Hide
        Yonik Seeley added a comment -

        This solves a particular usecase nicely, but is it really generic enough and durable enough to put in core?
        This essentially adds a new file into the index, but it's not really part of the index. It wouldn't work with any possible upcoming similarity-per-field to give different NormCodecs per field, and it requires the user to handle their own management of the file (using lucene addIndexes to copy from one place to another won't grab this file, etc).

        Show
        Yonik Seeley added a comment - This solves a particular usecase nicely, but is it really generic enough and durable enough to put in core? This essentially adds a new file into the index, but it's not really part of the index. It wouldn't work with any possible upcoming similarity-per-field to give different NormCodecs per field, and it requires the user to handle their own management of the file (using lucene addIndexes to copy from one place to another won't grab this file, etc).
        Hide
        Karl Wettin added a comment -

        I'd like to see this committed in 2.4, but I don't have core access.

        Show
        Karl Wettin added a comment - I'd like to see this committed in 2.4, but I don't have core access.
        Hide
        Karl Wettin added a comment -

        I think I've takes this as far as it can without refactoring it out of the static scope.

        Show
        Karl Wettin added a comment - I think I've takes this as far as it can without refactoring it out of the static scope.
        Hide
        Karl Wettin added a comment -

        New patch additionally includes:

        • Lots of javadocs with warnings
        • Similarity#readNormCodec(Directory):NodeCodec
        • Similarity#writeNormCodec(Directory, NodeCode)
        Show
        Karl Wettin added a comment - New patch additionally includes: Lots of javadocs with warnings Similarity#readNormCodec(Directory):NodeCodec Similarity#writeNormCodec(Directory, NodeCode)
        Hide
        Karl Wettin added a comment -

        This is a retroactive ASL blessing of the patch posted 11/Apr/08 06:01 AM

        Show
        Karl Wettin added a comment - This is a retroactive ASL blessing of the patch posted 11/Apr/08 06:01 AM
        Hide
        Karl Wettin added a comment -

        Fixed some typos and added some tests. Perhaps it needs new javadocs too?

        Show
        Karl Wettin added a comment - Fixed some typos and added some tests. Perhaps it needs new javadocs too?
        Hide
        Karl Wettin added a comment -

        I notice there is a tyop in the patch. And there is no test case for SimpleNormCodec. I'll come up with that too.

        Show
        Karl Wettin added a comment - I notice there is a tyop in the patch. And there is no test case for SimpleNormCodec. I'll come up with that too.
        Hide
        Karl Wettin added a comment -

        1) "norms" is a vague term. currently "lengthNorm" is folded in with "field boosts" and "doc boosts" to form a generic "fieldNorm" ... I assumed you were interested in a more general way to improve the resolution of "fieldNorm"

        I still am but mainly because it is the simplest and only way to get better document boost resolution at the moment.

        Show
        Karl Wettin added a comment - 1) "norms" is a vague term. currently "lengthNorm" is folded in with "field boosts" and "doc boosts" to form a generic "fieldNorm" ... I assumed you were interested in a more general way to improve the resolution of "fieldNorm" I still am but mainly because it is the simplest and only way to get better document boost resolution at the moment.
        Hide
        Hoss Man added a comment -

        My use case is really about document boost and not normalization.

        So another solution to this is to introduce a (variable bit sized?) document boost file and completely separate it from the norms instead...

        1) "norms" is a vague term. currently "lengthNorm" is folded in with "field boosts" and "doc boosts" to form a generic "fieldNorm" ... I assumed you were interested in a more general way to improve the resolution of "fieldNorm"

        2) your description of general purpose variable sized document boosting sounds exactly like LUCENE-1231 ... in the long run utilities using LUCENE-1231 (or something like it) to replace "field boosts" and "length norms" might make the most sense as a way to eliminate the current static Norm encoding and put more flexibility in the hands of users

        Show
        Hoss Man added a comment - My use case is really about document boost and not normalization. So another solution to this is to introduce a (variable bit sized?) document boost file and completely separate it from the norms instead... 1) "norms" is a vague term. currently "lengthNorm" is folded in with "field boosts" and "doc boosts" to form a generic "fieldNorm" ... I assumed you were interested in a more general way to improve the resolution of "fieldNorm" 2) your description of general purpose variable sized document boosting sounds exactly like LUCENE-1231 ... in the long run utilities using LUCENE-1231 (or something like it) to replace "field boosts" and "length norms" might make the most sense as a way to eliminate the current static Norm encoding and put more flexibility in the hands of users
        Hide
        Karl Wettin added a comment -

        As long as the norm remains a fixed size (1 byte) then it doesn't really matter whether it's tied to Similarity's or the store itself - it would be nice if the Index could tell you which normDecoder to use, but it's not any more unreasonable to expect the application to keep track of this (if it's not the default encoding) since applications already have to keep track of things like which Analyzer is "compatible" with querying this index.

        If we want norms to be more flexible, so tat apps can pick not only the encoding but also the size... then things get more interesting, but it's still feasible to say "if you customize this, you have to make your reading apps and your writing apps smart enough to know about your customization."

        I like the idea of an index that is completely self aware of norm encoding, what payloads mean, et c.

        I also want to move it to the instance scope so I can have multiple indices with unique norm span/resolutions created from the same classloader.

        My use case is really about document boost and not normalization.

        So another solution to this is to introduce a (variable bit sized?) document boost file and completely separate it from the norms instead of as now where normalization and document boost is baked up as the same thing. I think there would be no need to touch the norms encoding then, that the default resolution is good enough for /normalization/. It would fix several caveats with norms as I see it.

        Show
        Karl Wettin added a comment - As long as the norm remains a fixed size (1 byte) then it doesn't really matter whether it's tied to Similarity's or the store itself - it would be nice if the Index could tell you which normDecoder to use, but it's not any more unreasonable to expect the application to keep track of this (if it's not the default encoding) since applications already have to keep track of things like which Analyzer is "compatible" with querying this index. If we want norms to be more flexible, so tat apps can pick not only the encoding but also the size... then things get more interesting, but it's still feasible to say "if you customize this, you have to make your reading apps and your writing apps smart enough to know about your customization." I like the idea of an index that is completely self aware of norm encoding, what payloads mean, et c. I also want to move it to the instance scope so I can have multiple indices with unique norm span/resolutions created from the same classloader. My use case is really about document boost and not normalization. So another solution to this is to introduce a (variable bit sized?) document boost file and completely separate it from the norms instead of as now where normalization and document boost is baked up as the same thing. I think there would be no need to touch the norms encoding then, that the default resolution is good enough for /normalization/. It would fix several caveats with norms as I see it.
        Hide
        Hoss Man added a comment -

        I haven't thought too much about it yet, but it seems to me that norm codec has more to do with the physical store (Directory) than Similarity and should perhaps be moved there instead?

        As long as the norm remains a fixed size (1 byte) then it doesn't really matter whether it's tied to Similarity's or the store itself – it would be nice if the Index could tell you which normDecoder to use, but it's not any more unreasonable to expect the application to keep track of this (if it's not the default encoding) since applications already have to keep track of things like which Analyzer is "compatible" with querying this index.

        If we want norms to be more flexible, so tat apps can pick not only the encoding but also the size... then things get more interesting, but it's still feasible to say "if you customize this, you have to make your reading apps and your writing apps smart enough to know about your customization."

        I also want to move it to the instance scope so I can have multiple indices with unique norm span/resolutions created from the same classloader.

        I agree, it's a good goal.

        Show
        Hoss Man added a comment - I haven't thought too much about it yet, but it seems to me that norm codec has more to do with the physical store (Directory) than Similarity and should perhaps be moved there instead? As long as the norm remains a fixed size (1 byte) then it doesn't really matter whether it's tied to Similarity's or the store itself – it would be nice if the Index could tell you which normDecoder to use, but it's not any more unreasonable to expect the application to keep track of this (if it's not the default encoding) since applications already have to keep track of things like which Analyzer is "compatible" with querying this index. If we want norms to be more flexible, so tat apps can pick not only the encoding but also the size... then things get more interesting, but it's still feasible to say "if you customize this, you have to make your reading apps and your writing apps smart enough to know about your customization." I also want to move it to the instance scope so I can have multiple indices with unique norm span/resolutions created from the same classloader. I agree, it's a good goal.
        Hide
        Karl Wettin added a comment -

        I suppose it would be possible to implement a NormCodec that would listen to encodeNorm(float) while one is creating a subset of the index in order to find all norm resolution sweetspots for that corpus using some appropriate algorithm. Mean shift?.

        Perhaps it even would be possible to compress it down to n bags from the start and then allow for it to grow in case new documents with other norm requirements are added to the store.

        I haven't thought too much about it yet, but it seems to me that norm codec has more to do with the physical store (Directory) than Similarity and should perhaps be moved there instead? I have no idea how, but I also want to move it to the instance scope so I can have multiple indices with unique norm span/resolutions created from the same classloader.

        Show
        Karl Wettin added a comment - I suppose it would be possible to implement a NormCodec that would listen to encodeNorm(float) while one is creating a subset of the index in order to find all norm resolution sweetspots for that corpus using some appropriate algorithm. Mean shift?. Perhaps it even would be possible to compress it down to n bags from the start and then allow for it to grow in case new documents with other norm requirements are added to the store. I haven't thought too much about it yet, but it seems to me that norm codec has more to do with the physical store (Directory) than Similarity and should perhaps be moved there instead? I have no idea how, but I also want to move it to the instance scope so I can have multiple indices with unique norm span/resolutions created from the same classloader.
        Hide
        Karl Wettin added a comment -
        • Simlarity#getNormCodec()
        • Simlarity#setNormCodec(NormCodec)
        • Similarity$NormCodec
        • Similarity$DefaultNormCodec
        • Similarity$SimpleNormCodec (binsearches over a sorted float[])

        I also depricated Similarity#getNormsTable() and replaced the only use I could find of it - in TermScorer. Could not spont any problems with performance or anything with that.

        Show
        Karl Wettin added a comment - Simlarity#getNormCodec() Simlarity#setNormCodec(NormCodec) Similarity$NormCodec Similarity$DefaultNormCodec Similarity$SimpleNormCodec (binsearches over a sorted float[]) I also depricated Similarity#getNormsTable() and replaced the only use I could find of it - in TermScorer. Could not spont any problems with performance or anything with that.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Karl Wettin
          • Votes:
            7 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development