Details

    • Lucene Fields:
      New

      Description

      This is a first step (nowhere near committable!), implementing the
      design iterated to in the recent "Baby steps towards making Lucene's
      scoring more flexible" java-dev thread.

      The idea is (if you turn it on for your Field; it's off by default) to
      store full stats in the index, into a new _X.sts file, per doc (X
      field) in the index.

      And then have FieldSimilarityProvider impls that compute doc's boost
      bytes (norms) from these stats.

      The patch is able to index the stats, merge them when segments are
      merged, and provides an iterator-only API. It also has starting point
      for per-field Sims that use the stats iterator API to compute boost
      bytes. But it's not at all tied into actual searching! There's still
      tons left to do, eg, how does one configure via Field/FieldType which
      stats one wants indexed.

      All tests pass, and I added one new TestStats unit test.

      The stats I record now are:

      • field's boost
      • field's unique term count (a b c a a b --> 3)
      • field's total term count (a b c a a b --> 6)
      • total term count per-term (sum of total term count for all docs
        that have this term)

      Still need at least the total term count for each field.

      1. ASF.LICENSE.NOT.GRANTED--LUCENE-2392.patch
        70 kB
        Michael McCandless
      2. LUCENE-2392.patch
        119 kB
        Robert Muir
      3. LUCENE-2392_take2.patch
        103 kB
        Robert Muir
      4. LUCENE-2392.patch
        121 kB
        Robert Muir
      5. LUCENE-2392.patch
        248 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          Committed revision 1144158.

          Show
          Robert Muir added a comment - Committed revision 1144158.
          Hide
          Robert Muir added a comment -

          Attached is a patch, with this CHANGES entry:

          * LUCENE-2392: Decoupled vector space scoring from Query/Weight/Scorer. If you
            extended Similarity directly before, you should extend TFIDFSimilarity instead.
            Similarity is now a lower-level API to implement other scoring algorithms.
            See MIGRATE.txt for more details.
          

          I would like to commit this, and then proceed onward with issues such as LUCENE-3220 and LUCENE-3221

          Show
          Robert Muir added a comment - Attached is a patch, with this CHANGES entry: * LUCENE-2392: Decoupled vector space scoring from Query/Weight/Scorer. If you extended Similarity directly before, you should extend TFIDFSimilarity instead. Similarity is now a lower-level API to implement other scoring algorithms. See MIGRATE.txt for more details. I would like to commit this, and then proceed onward with issues such as LUCENE-3220 and LUCENE-3221
          Hide
          Simon Willnauer added a comment -

          I'd like to get this merged in as quickly as possible. I don't think the svn history is interesting, especially given
          all the frustrations I am having with merging... The easiest way will be to commit a patch, I'll get everything in shape
          and upload one soon, like, today.

          +1 even if this is not entirely in shape we can still iterate on trunk.

          Show
          Simon Willnauer added a comment - I'd like to get this merged in as quickly as possible. I don't think the svn history is interesting, especially given all the frustrations I am having with merging... The easiest way will be to commit a patch, I'll get everything in shape and upload one soon, like, today. +1 even if this is not entirely in shape we can still iterate on trunk.
          Hide
          Robert Muir added a comment -

          I think we need to commit the refactoring portions (separating TF-IDF out) to trunk very soon, because its really difficult
          to keep this branch in sync with trunk, e.g. lots of activity and refactoring going on.

          I'd like to get this merged in as quickly as possible. I don't think the svn history is interesting, especially given
          all the frustrations I am having with merging... The easiest way will be to commit a patch, I'll get everything in shape
          and upload one soon, like, today.

          Show
          Robert Muir added a comment - I think we need to commit the refactoring portions (separating TF-IDF out) to trunk very soon, because its really difficult to keep this branch in sync with trunk, e.g. lots of activity and refactoring going on. I'd like to get this merged in as quickly as possible. I don't think the svn history is interesting, especially given all the frustrations I am having with merging... The easiest way will be to commit a patch, I'll get everything in shape and upload one soon, like, today.
          Hide
          Robert Muir added a comment -

          Updated patch, i brought the patch to trunk, cleaned up, enabled some more of the stats in scoring (e.g. totalTermFreq/sumOfTotalTermFreq).

          In src/test i added a MockLMSimilarity, that implements "Bayesian smoothing using Dirichlet priors" from http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.136.8113

          This one is interesting, as its faster than lucene's scoring formula today

          I want to get some of this stuff in shape for David (or any other GSOC students) to be able to implement their algorithms, but there is a lot of refactoring (e.g. explains) to do.

          I'll create a branch under https://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring with this infrastructure in a bit.

          Tonight i'll see if i can get the avg doc length stuff in the branch too.

          Show
          Robert Muir added a comment - Updated patch, i brought the patch to trunk, cleaned up, enabled some more of the stats in scoring (e.g. totalTermFreq/sumOfTotalTermFreq). In src/test i added a MockLMSimilarity, that implements "Bayesian smoothing using Dirichlet priors" from http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.136.8113 This one is interesting, as its faster than lucene's scoring formula today I want to get some of this stuff in shape for David (or any other GSOC students) to be able to implement their algorithms, but there is a lot of refactoring (e.g. explains) to do. I'll create a branch under https://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring with this infrastructure in a bit. Tonight i'll see if i can get the avg doc length stuff in the branch too.
          Hide
          Robert Muir added a comment -

          here's a really really rough "take 2" at the problem.

          The general idea is to take a smaller "baby-step" as Mike calls it, to the problem.
          Really we have been working our way towards this anyway, exposing additional statistics,
          making Similarity per-field, fixing up inconsistencies... and this is the way I prefer, as we
          get things actually committed and moving.

          So whatever is in this patch (which is full of nocommits, but all tests pass and all queries work with it),
          we could possibly then split up into other issues and continue slowly proceeding, or maybe
          create a branch, whatever.

          My problem with the other patch is it requires a ton more work to make any progress on it...
          and things don't even compile with it, forget about tests.

          The basics here are to:

          1. Split the "matching" and "scoring calculations" of Scorer. All responsibility of calculations belongs
            in the Similarity, the Scorer should be matching positions, working docsEnums, etc etc.
          2. Similarity as we know it now, gets a more low-level API, and TFIDFSimilarity implements this API,
            but exposes its customizations via the tf(), idf(), etc we know now.
          3. Things like score-caching and specialization of calculations are the responsibility of the Similarity,
            as these depend upon the formula being used. For TFIDFSimilarity, i added some optimizations here,
            for example it specializes its norms == null case away to remove the per-doc "if".
          4. Since all Weights create PerReaderTermState (<-- this one needs a new name), to separate the
            seeking/stats collection from the calculations, i also optimized PhraseQuery's Weight/Scorer construction
            to be single-pass.

          Also I like to benchmark every step of the way, so we don't come up with
          this design that won't be performant: here are the scores for lucene's default Sim with the patch:

          Query QPS trunk QPS patch Pct diff
          spanNear([unit, state], 10, true) 3.04 2.92 -4.0%
          doctitle:.[Uu]nited. 4.00 3.99 -0.1%
          +unit +state 8.11 8.12 0.2%
          united~2.0 4.36 4.40 1.0%
          united~1.0 18.70 18.93 1.2%
          unit~2.0 8.54 8.71 2.1%
          spanFirst(unit, 5) 11.35 11.59 2.2%
          unit~1.0 8.69 8.91 2.6%
          unit state 7.03 7.23 2.8%
          "unit state"~3 3.74 3.86 3.2%
          u*d 16.72 17.30 3.5%
          state 19.24 20.04 4.1%
          un*d 49.42 51.55 4.3%
          "unit state" 5.99 6.31 5.3%
          +nebraska +state 140.74 151.85 7.9%
          uni* 10.66 11.55 8.4%
          unit* 18.77 20.41 8.7%
          doctimesecnum:[10000 TO 60000] 6.97 7.70 10.4%

          All Lucene/Solr tests pass, but there are lots of nocommits, especially

          1. No Javadocs
          2. Explains need to be fixed: in general the explanation of "matching" belongs where it is now,
            but the explanation of "score calculations" belongs in the Similarity.
          3. need to refactor more out of Weight, currently we pass it to the docscorer, but
            its the wrong object, as it can only "hold" a single float.

          Anyway, its gonna take some time to rough all this out I'm sure, but I wanted
          to show some progress/invite ideas, and also show we can do this stuff
          without losing performance.

          I have separate patches that need to be integrated/relevance tested e.g.
          for average doc length... maybe i'll do that next so we can get some concrete
          alternate sims in here before going any further.

          Show
          Robert Muir added a comment - here's a really really rough "take 2" at the problem. The general idea is to take a smaller "baby-step" as Mike calls it, to the problem. Really we have been working our way towards this anyway, exposing additional statistics, making Similarity per-field, fixing up inconsistencies... and this is the way I prefer, as we get things actually committed and moving. So whatever is in this patch (which is full of nocommits, but all tests pass and all queries work with it), we could possibly then split up into other issues and continue slowly proceeding, or maybe create a branch, whatever. My problem with the other patch is it requires a ton more work to make any progress on it... and things don't even compile with it, forget about tests. The basics here are to: Split the "matching" and "scoring calculations" of Scorer. All responsibility of calculations belongs in the Similarity, the Scorer should be matching positions, working docsEnums, etc etc. Similarity as we know it now, gets a more low-level API, and TFIDFSimilarity implements this API, but exposes its customizations via the tf(), idf(), etc we know now. Things like score-caching and specialization of calculations are the responsibility of the Similarity, as these depend upon the formula being used. For TFIDFSimilarity, i added some optimizations here, for example it specializes its norms == null case away to remove the per-doc "if". Since all Weights create PerReaderTermState (<-- this one needs a new name), to separate the seeking/stats collection from the calculations, i also optimized PhraseQuery's Weight/Scorer construction to be single-pass. Also I like to benchmark every step of the way, so we don't come up with this design that won't be performant: here are the scores for lucene's default Sim with the patch: Query QPS trunk QPS patch Pct diff spanNear( [unit, state] , 10, true) 3.04 2.92 -4.0% doctitle:. [Uu] nited. 4.00 3.99 -0.1% +unit +state 8.11 8.12 0.2% united~2.0 4.36 4.40 1.0% united~1.0 18.70 18.93 1.2% unit~2.0 8.54 8.71 2.1% spanFirst(unit, 5) 11.35 11.59 2.2% unit~1.0 8.69 8.91 2.6% unit state 7.03 7.23 2.8% "unit state"~3 3.74 3.86 3.2% u*d 16.72 17.30 3.5% state 19.24 20.04 4.1% un*d 49.42 51.55 4.3% "unit state" 5.99 6.31 5.3% +nebraska +state 140.74 151.85 7.9% uni* 10.66 11.55 8.4% unit* 18.77 20.41 8.7% doctimesecnum: [10000 TO 60000] 6.97 7.70 10.4% All Lucene/Solr tests pass, but there are lots of nocommits, especially No Javadocs Explains need to be fixed: in general the explanation of "matching" belongs where it is now, but the explanation of "score calculations" belongs in the Similarity. need to refactor more out of Weight, currently we pass it to the docscorer, but its the wrong object, as it can only "hold" a single float. Anyway, its gonna take some time to rough all this out I'm sure, but I wanted to show some progress/invite ideas, and also show we can do this stuff without losing performance. I have separate patches that need to be integrated/relevance tested e.g. for average doc length... maybe i'll do that next so we can get some concrete alternate sims in here before going any further.
          Hide
          Robert Muir added a comment -

          i brought the patch up to trunk: didn't fix any problems
          (e.g. some tests still fail, things outside of core lucene won't even compile)

          Show
          Robert Muir added a comment - i brought the patch up to trunk: didn't fix any problems (e.g. some tests still fail, things outside of core lucene won't even compile)
          Hide
          Robert Muir added a comment -

          Really, maybe somehow we should be using at attr about the token
          itself? Instead of posIncr == 0? I mean a broken synonym injector
          could conceivably provide the synonyms first (w/ first one having
          posIncr 1), followed by the real term (w/ posIncr 0)?

          Right, its my opinion all along (others disagree with me) that since
          we have this 'ordered (incrementToken) Attributesource' / Token*Stream* that
          this sorta broken filter isn't a valid equivalent..., its definitely a different
          TokenStream,even if its treated in some ways today as the same... we gotta
          break away from this for reasons like this.

          otherwise why have it ordered at all?

          Show
          Robert Muir added a comment - Really, maybe somehow we should be using at attr about the token itself? Instead of posIncr == 0? I mean a broken synonym injector could conceivably provide the synonyms first (w/ first one having posIncr 1), followed by the real term (w/ posIncr 0)? Right, its my opinion all along (others disagree with me) that since we have this 'ordered (incrementToken) Attributesource' / Token*Stream* that this sorta broken filter isn't a valid equivalent..., its definitely a different TokenStream,even if its treated in some ways today as the same... we gotta break away from this for reasons like this. otherwise why have it ordered at all?
          Hide
          Shai Erera added a comment -

          I'd like to withdraw my request from above. I misunderstood that the stats I need are stored per-field per-doc. So that will allow me to compute the docLength as I want.

          Show
          Shai Erera added a comment - I'd like to withdraw my request from above. I misunderstood that the stats I need are stored per-field per-doc. So that will allow me to compute the docLength as I want.
          Hide
          Michael McCandless added a comment -

          I think what I'm saying is that if we can open up the norms computation to custom code - that will do what I want, right?

          I'm calling the norms "boost bytes" This was Marvin's term.. I
          like it.

          This patch makes boost byte computation completely private to the
          sim (see the *FieldSimProvider). Ie the sim providers walk the stats
          and do whatever they want to "prepare" for real searching. EG if you
          have the RAM maybe you want to use a true float[] not boost bytes. Or
          if you really don't have the RAM maybe you use only 4 bits per-doc,
          not 8. The FieldSim just provides a "float boost(int docID)" so what
          it does under the hood is private.

          Maybe we can have a class like DocLengthProvider which apps can plug in if they want to customize how that length is computed.

          So... I'm actually trying to avoid extensibility on the first go, here
          (this is the "baby steps" part of the original thread).

          Ie, the IR world seems to have converged on a smallish set of "stats"
          that are commonly required, so I'd like to make those initial stats
          work well, for starters. Commit that (it enables all sorts of state
          of the art scoring models), and perhaps cutover to the default Robert
          created in LUCENE-2187 (which needs stats to work correctly). And
          then (phase 2) work out plugability so you can put your own stats
          in....

          Wherever we write the norms, we'll call that impl, which by default will do what Lucene does today?

          Right, this is the DefaultSimProvider in my current patch – it simply
          computes the same thing Lucene does today, but uses the stats at IR
          open time (once it's hooked up) to do, instead of doing so during
          indexing.

          I think though that it's not a field-level setting, but an IW one?

          It's field level now and I think we should keep it that way. EG
          Terrier was apparently document oriented in the past but has now
          deprecated that and moved to per-field.

          You can always make a catch-all field if you "really" want aggregated
          stats across the entire doc?

          Show
          Michael McCandless added a comment - I think what I'm saying is that if we can open up the norms computation to custom code - that will do what I want, right? I'm calling the norms "boost bytes" This was Marvin's term.. I like it. This patch makes boost byte computation completely private to the sim (see the *FieldSimProvider). Ie the sim providers walk the stats and do whatever they want to "prepare" for real searching. EG if you have the RAM maybe you want to use a true float[] not boost bytes. Or if you really don't have the RAM maybe you use only 4 bits per-doc, not 8. The FieldSim just provides a "float boost(int docID)" so what it does under the hood is private. Maybe we can have a class like DocLengthProvider which apps can plug in if they want to customize how that length is computed. So... I'm actually trying to avoid extensibility on the first go, here (this is the "baby steps" part of the original thread). Ie, the IR world seems to have converged on a smallish set of "stats" that are commonly required, so I'd like to make those initial stats work well, for starters. Commit that (it enables all sorts of state of the art scoring models), and perhaps cutover to the default Robert created in LUCENE-2187 (which needs stats to work correctly). And then (phase 2) work out plugability so you can put your own stats in.... Wherever we write the norms, we'll call that impl, which by default will do what Lucene does today? Right, this is the DefaultSimProvider in my current patch – it simply computes the same thing Lucene does today, but uses the stats at IR open time (once it's hooked up) to do, instead of doing so during indexing. I think though that it's not a field-level setting, but an IW one? It's field level now and I think we should keep it that way. EG Terrier was apparently document oriented in the past but has now deprecated that and moved to per-field. You can always make a catch-all field if you "really" want aggregated stats across the entire doc?
          Hide
          Michael McCandless added a comment -

          Mike, I don't think overlapTermCount should really exist in the Stats.

          OK I will remove it – I was unsure whether it was overkill So
          it's purely an index time decision, whether the posIncr 0 tokens
          "count".

          Hmm, but... we have a problem, which is that these posIncr 0 tokens
          are now counted in the unique token count. Have to mull how to avoid
          that...hmm... to do it correctly, I think means "count this token as
          +1 on the unique tokens for this doc if ever it has non-zero posIncr"?

          Really, maybe somehow we should be using at attr about the token
          itself? Instead of posIncr == 0? I mean a broken synonym injector
          could conceivably provide the synonyms first (w/ first one having
          posIncr 1), followed by the real term (w/ posIncr 0)?

          BTW the cost of storing the stats isn't that bad – it increases index
          size by 1.5%, on a 10M wikipedia index where each doc is 1KB of text
          (~171 words per doc on avg).

          Show
          Michael McCandless added a comment - Mike, I don't think overlapTermCount should really exist in the Stats. OK I will remove it – I was unsure whether it was overkill So it's purely an index time decision, whether the posIncr 0 tokens "count". Hmm, but... we have a problem, which is that these posIncr 0 tokens are now counted in the unique token count. Have to mull how to avoid that...hmm... to do it correctly, I think means "count this token as +1 on the unique tokens for this doc if ever it has non-zero posIncr"? Really, maybe somehow we should be using at attr about the token itself? Instead of posIncr == 0? I mean a broken synonym injector could conceivably provide the synonyms first (w/ first one having posIncr 1), followed by the real term (w/ posIncr 0)? BTW the cost of storing the stats isn't that bad – it increases index size by 1.5%, on a 10M wikipedia index where each doc is 1KB of text (~171 words per doc on avg).
          Hide
          Shai Erera added a comment -

          Mike - it'll also be great if we can store the length of the document in a custom way. I think what I'm saying is that if we can open up the norms computation to custom code - that will do what I want, right? Maybe we can have a class like DocLengthProvider which apps can plug in if they want to customize how that length is computed. Wherever we write the norms, we'll call that impl, which by default will do what Lucene does today?
          I think though that it's not a field-level setting, but an IW one?

          Show
          Shai Erera added a comment - Mike - it'll also be great if we can store the length of the document in a custom way. I think what I'm saying is that if we can open up the norms computation to custom code - that will do what I want, right? Maybe we can have a class like DocLengthProvider which apps can plug in if they want to customize how that length is computed. Wherever we write the norms, we'll call that impl, which by default will do what Lucene does today? I think though that it's not a field-level setting, but an IW one?
          Hide
          Robert Muir added a comment -

          Mike, I don't think overlapTermCount should really exist in the Stats.
          Trying to implement some concrete FieldSimProviders, it starts getting messy.
          When using term unique pivoted length norm, i don't want to count these positionIncrement=0 terms either...
          so do we need a uniqueOverlapTermCount?
          Even when using non-unique (BM25) pivoted length norm, we don't want to count these when summing to calculate
          averages either.

          So i know you probably see this as 'baking something into the index' but I think positionIncrement=0 means
          "doesn't contribute to the document length" by definition, and the discountOverlaps=false (no longer the default)
          should be considered deprecated compatibility behavior

          Show
          Robert Muir added a comment - Mike, I don't think overlapTermCount should really exist in the Stats. Trying to implement some concrete FieldSimProviders, it starts getting messy. When using term unique pivoted length norm, i don't want to count these positionIncrement=0 terms either... so do we need a uniqueOverlapTermCount? Even when using non-unique (BM25) pivoted length norm, we don't want to count these when summing to calculate averages either. So i know you probably see this as 'baking something into the index' but I think positionIncrement=0 means "doesn't contribute to the document length" by definition, and the discountOverlaps=false (no longer the default) should be considered deprecated compatibility behavior
          Hide
          Michael McCandless added a comment -

          Rough first patch attached....

          Show
          Michael McCandless added a comment - Rough first patch attached....

            People

            • Assignee:
              Robert Muir
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development