Lucene - Core
  1. Lucene - Core
  2. LUCENE-3722

make similarities/term/collectionstats take long (for > 2B docs)

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      As noted by Yonik and Andrzej on SOLR-1632, this would be useful for distributed scoring.

      we can also add a sugar method add() to both of these to make it easier to sum.

      1. LUCENE-3722.patch
        18 kB
        Robert Muir
      2. LUCENE-3722.patch
        18 kB
        Robert Muir
      3. LUCENE-3722.patch
        17 kB
        Robert Muir
      4. LUCENE-3722.patch
        19 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        updated patch with improvements for
        the preflex case (when they return -1).

        Show
        Robert Muir added a comment - updated patch with improvements for the preflex case (when they return -1).
        Hide
        Andrzej Bialecki added a comment -

        I think the add() methods as implemented in this patch are of limited usefulness... The reason I created a pair of similar classes in SOLR-1632 was to avoid creating new objects by allowing modification of int/long members in place, which is useful when aggregating partial stats. So I think a more useful implementation of add() could look like this:

        public void add(CollectionStatistics other) {
          assert this.field.equals(other.field);
          this.maxDoc += other.maxDoc;
          this.docCount += other.docCount;
          this.sumTotalTermFreq += other.sumTotalTermFreq;
          this.sumDocFreq += other.sumDocFreq;
        }
        

        Regarding the handling of -1: this code doesn't handle the case when only one stats is -1, which may happen in distributed scenario. I think in such case a -1 value should be treated as 0, i.e. it should not "reset" the accumulated value from other shards to -1, right?

        Show
        Andrzej Bialecki added a comment - I think the add() methods as implemented in this patch are of limited usefulness... The reason I created a pair of similar classes in SOLR-1632 was to avoid creating new objects by allowing modification of int/long members in place, which is useful when aggregating partial stats. So I think a more useful implementation of add() could look like this: public void add(CollectionStatistics other) { assert this .field.equals(other.field); this .maxDoc += other.maxDoc; this .docCount += other.docCount; this .sumTotalTermFreq += other.sumTotalTermFreq; this .sumDocFreq += other.sumDocFreq; } Regarding the handling of -1: this code doesn't handle the case when only one stats is -1, which may happen in distributed scenario. I think in such case a -1 value should be treated as 0, i.e. it should not "reset" the accumulated value from other shards to -1, right?
        Hide
        Robert Muir added a comment -

        So I think a more useful implementation of add() could look like this:

        But I don't think we should just immediately make the stats classes mutable without thinking things through more, I'm not sure there is really any problem creating new objects (we should benchmark this, before making things mutable).

        maybe we should not have add() here then at all and let the consumer take care of this themselves.

        should be treated as 0, i.e. it should not "reset" the accumulated value from other shards to -1, right?

        i think it must not be treated as 0, the same way its handled by default in TermContext if some segments are 3.x: it means this statistic is simply not supported.

        In other words, if any segment across any shard returns -1, the cumulative stat should always be -1: it means its not supported by the codec.

        The similarities that use these new statistics already look for -1 and have fallback mechanisms for this case.

        Show
        Robert Muir added a comment - So I think a more useful implementation of add() could look like this: But I don't think we should just immediately make the stats classes mutable without thinking things through more, I'm not sure there is really any problem creating new objects (we should benchmark this, before making things mutable). maybe we should not have add() here then at all and let the consumer take care of this themselves. should be treated as 0, i.e. it should not "reset" the accumulated value from other shards to -1, right? i think it must not be treated as 0, the same way its handled by default in TermContext if some segments are 3.x: it means this statistic is simply not supported. In other words, if any segment across any shard returns -1, the cumulative stat should always be -1: it means its not supported by the codec. The similarities that use these new statistics already look for -1 and have fallback mechanisms for this case.
        Hide
        Andrzej Bialecki added a comment -

        maybe we should not have add() here then at all and let the consumer take care of this themselves.

        Fair point, I'd rather remove it.

        i think it must not be treated as 0

        Ok, it makes sense in local (multi reader) situations but in distributed scenario it may be still acceptable to lose just a part of the statistics from one shard while keeping the stats from other shards.

        Which would be yet another reason to drop the add() methods

        Show
        Andrzej Bialecki added a comment - maybe we should not have add() here then at all and let the consumer take care of this themselves. Fair point, I'd rather remove it. i think it must not be treated as 0 Ok, it makes sense in local (multi reader) situations but in distributed scenario it may be still acceptable to lose just a part of the statistics from one shard while keeping the stats from other shards. Which would be yet another reason to drop the add() methods
        Hide
        Robert Muir added a comment -

        Ok, it makes sense in local (multi reader) situations but in distributed scenario it may be still acceptable to lose just a part of the statistics from one shard while keeping the stats from other shards.

        I really think we should not do this: it can result in NaN/Inf/negative scores to have 'invalid'
        statistics (this causes serious problems!), but we have (and test) that all the sims fallback
        gracefully for the -1 case.

        -1 means 'preflex codec does not support the stat'. Once we no longer have to worry about 3.x
        indexes, we no longer need to worry about -1.

        Show
        Robert Muir added a comment - Ok, it makes sense in local (multi reader) situations but in distributed scenario it may be still acceptable to lose just a part of the statistics from one shard while keeping the stats from other shards. I really think we should not do this: it can result in NaN/Inf/negative scores to have 'invalid' statistics (this causes serious problems!), but we have (and test) that all the sims fallback gracefully for the -1 case. -1 means 'preflex codec does not support the stat'. Once we no longer have to worry about 3.x indexes, we no longer need to worry about -1.
        Hide
        Robert Muir added a comment -

        updated patch: this is just int->long conversion... at least we should be able to agree on this

        we should still figure out how to make this summation easier and safer (especially for the "preflex-returns-1" case)

        Show
        Robert Muir added a comment - updated patch: this is just int->long conversion... at least we should be able to agree on this we should still figure out how to make this summation easier and safer (especially for the "preflex-returns-1" case)
        Hide
        Robert Muir added a comment -

        just a simple example for what i meant about the -1 case:

        Lets assume you have two shards, and one returns -1 for totalTermFreq().

        If you were using BasicModelIF, which scores as:

        tf_norm * log(1 + (maxdoc + 1)/(totalTermFreq + 0.5))
        

        its far better to actually use -1 than a 'partial/incorrect' totalTermFreq,
        because in that case the formula will fall back to totalTermFreq=docFreq...
        it also must do this in case frequencies are omitted (omitTF), and for that
        case the formula is still correct: but either way its falling back nicely to
        IDF:

        tf_norm * log(1 + (maxdoc + 1)/(docFreq + 0.5))
        

        Yeah, i totally forgot about this being -1 in the omitTF case, so we should
        still really think this summation through and make it easy to prevent mistakes,
        because i gather omitTF isn't going anywhere... grrr

        Show
        Robert Muir added a comment - just a simple example for what i meant about the -1 case: Lets assume you have two shards, and one returns -1 for totalTermFreq(). If you were using BasicModelIF, which scores as: tf_norm * log(1 + (maxdoc + 1)/(totalTermFreq + 0.5)) its far better to actually use -1 than a 'partial/incorrect' totalTermFreq, because in that case the formula will fall back to totalTermFreq=docFreq... it also must do this in case frequencies are omitted (omitTF), and for that case the formula is still correct: but either way its falling back nicely to IDF: tf_norm * log(1 + (maxdoc + 1)/(docFreq + 0.5)) Yeah, i totally forgot about this being -1 in the omitTF case, so we should still really think this summation through and make it easy to prevent mistakes, because i gather omitTF isn't going anywhere... grrr
        Hide
        Robert Muir added a comment -

        Patch synced up to trunk.

        as far as the unrelated -1/add()/summation stuff i added assertions to do basic bounds checking of these stats (found a bug in simpletext), so I think thats a good step for now about concerns of summation: at least we will probably catch any bad bugs.

        But this patch just changes int->long and I think we can move forward with it.

        Show
        Robert Muir added a comment - Patch synced up to trunk. as far as the unrelated -1/add()/summation stuff i added assertions to do basic bounds checking of these stats (found a bug in simpletext), so I think thats a good step for now about concerns of summation: at least we will probably catch any bad bugs. But this patch just changes int->long and I think we can move forward with it.
        Hide
        Robert Muir added a comment -

        Committed the switch of int->long.

        For now the summation stuff is still up to the caller... seems unfriendly, but maybe it should stay that way, but at least we have asserts that the stats aren't out of bounds.

        Show
        Robert Muir added a comment - Committed the switch of int->long. For now the summation stuff is still up to the caller... seems unfriendly, but maybe it should stay that way, but at least we have asserts that the stats aren't out of bounds.

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development