Lucene - Core
  1. Lucene - Core
  2. LUCENE-2492

Make PulsingCodec (wrapping StandardCodec) the default codec

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.1
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      PulsingCodec can provides good gains, by inlining the postings into the terms dict for rare terms. This is especially helpful for primary key like fields, since every term is rare and batch lookups are common (see http://chbits.blogspot.com/2010/06/lucenes-pulsingcodec-on-primary-key.html for a simple perf test), but it should also be a gain for ordinary fields, thanks to Zipf's law.

      I think we should make it the default....

        Activity

        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.
        Hide
        Robert Muir added a comment -

        This is implemented over on LUCENE-4498 as an optimization, storing the docid directly instead of filepointer+docid when docfreq=1, and using the already-stored totalTermFreq as the freq for that singleton document, since its redundant. Any positions/payloads or anything else still go to their usual place, it just saves the wasted seek, the useless file pointer, and optimizes the primary key case.

        As a default it doesn't have the potential traps of Pulsing (see the issue). If someone wants more flexibility (e.g. they want to store positions and such in the term dictionary), they can still use Pulsing.

        Show
        Robert Muir added a comment - This is implemented over on LUCENE-4498 as an optimization, storing the docid directly instead of filepointer+docid when docfreq=1, and using the already-stored totalTermFreq as the freq for that singleton document, since its redundant. Any positions/payloads or anything else still go to their usual place, it just saves the wasted seek, the useless file pointer, and optimizes the primary key case. As a default it doesn't have the potential traps of Pulsing (see the issue). If someone wants more flexibility (e.g. they want to store positions and such in the term dictionary), they can still use Pulsing.
        Hide
        Michael McCandless added a comment -

        We can encode whether the posting is embedded or not by storing a byte or a negative pointer for example. There are ways to do it with minimal to no more space.

        Remember than vInt/Long don't handle negative numbers well (they take max # bytes, I think).

        The thing is - there is a performance penalty to storing too many bytes in the terms dict because it may affect terms lookup. docFreq may not be a very good decision.

        True, but I'd expect "typically" rare terms (occurring in 1 or 2 docs across the corpus) also generally tend to have low frequency within that document. Hmm, or maybe not – maybe there's only a single article about Dr. Froobalaz, but in that article Froobalaz is mentioned many many times.

        For example, a term may have one posting element with a huge payload.

        True, though such apps (the exception not the rule) could override the codec.

        Fixed #bytes might also allow for faster scanning, ie if we always leave a 20 byte slot we know we can then seek +20 bytes ahead, vs pulsing codec which must decode the postings for the term when scanning over it. (Though if we thought this mattered we could also write the #bytes up front).

        Net/net I think we should pursue this; we should probably keep both options available and then we can test.

        Show
        Michael McCandless added a comment - We can encode whether the posting is embedded or not by storing a byte or a negative pointer for example. There are ways to do it with minimal to no more space. Remember than vInt/Long don't handle negative numbers well (they take max # bytes, I think). The thing is - there is a performance penalty to storing too many bytes in the terms dict because it may affect terms lookup. docFreq may not be a very good decision. True, but I'd expect "typically" rare terms (occurring in 1 or 2 docs across the corpus) also generally tend to have low frequency within that document. Hmm, or maybe not – maybe there's only a single article about Dr. Froobalaz, but in that article Froobalaz is mentioned many many times. For example, a term may have one posting element with a huge payload. True, though such apps (the exception not the rule) could override the codec. Fixed #bytes might also allow for faster scanning, ie if we always leave a 20 byte slot we know we can then seek +20 bytes ahead, vs pulsing codec which must decode the postings for the term when scanning over it. (Though if we thought this mattered we could also write the #bytes up front). Net/net I think we should pursue this; we should probably keep both options available and then we can test.
        Hide
        Andrzej Bialecki added a comment -

        How about adding some metadata to SegmentInfos ... if we figure out how to proceed with LUCENE-2491 then SegmentInfos could keep the list of codecs per file plus their init args.

        Show
        Andrzej Bialecki added a comment - How about adding some metadata to SegmentInfos ... if we figure out how to proceed with LUCENE-2491 then SegmentInfos could keep the list of codecs per file plus their init args.
        Hide
        Shai Erera added a comment -

        The thing is - there is a performance penalty to storing too many bytes in the terms dict because it may affect terms lookup. docFreq may not be a very good decision. For example, a term may have one posting element with a huge payload. Or a term may be assoicated with few documents whose IDs are successive, thus they are compressed much better than a term with one doc whose ID is 1M.

        #bytes is also something you can measure. Lucene should behave the same if the entries are 20 bytes total, which is not a collection specific setting. Point is, if you've measured term dict lookup when entries Re 20 bytes in length, you know how it performs, and it will perform like that for every collection. But if you perf test with docFreq=3 it willperform differently on different collections ...

        Also #bytes limit makes it easy to compute the size consumed.

        Show
        Shai Erera added a comment - The thing is - there is a performance penalty to storing too many bytes in the terms dict because it may affect terms lookup. docFreq may not be a very good decision. For example, a term may have one posting element with a huge payload. Or a term may be assoicated with few documents whose IDs are successive, thus they are compressed much better than a term with one doc whose ID is 1M. #bytes is also something you can measure. Lucene should behave the same if the entries are 20 bytes total, which is not a collection specific setting. Point is, if you've measured term dict lookup when entries Re 20 bytes in length, you know how it performs, and it will perform like that for every collection. But if you perf test with docFreq=3 it willperform differently on different collections ... Also #bytes limit makes it easy to compute the size consumed.
        Hide
        Shai Erera added a comment -

        We can encode whether the posting is embedded or not by storing a byte or a negative pointer for example. There are ways to do it with minimal to no more space.

        Show
        Shai Erera added a comment - We can encode whether the posting is embedded or not by storing a byte or a negative pointer for example. There are ways to do it with minimal to no more space.
        Hide
        Michael McCandless added a comment -

        oday it defaults to wrapped StandardCodec and 1 as docFreq cutoff. These two should be parameters IMO.

        +1

        Also, the cut-off itself should allow to also base it on #bytes consumed, and not just doc-freq.

        So really the cutoff should be handled by means of extension, w/ a default impl DocFreqPulsing/Cutoff (whatever) that lets you specify the doc-freq cutoff, with the ability for someone else to extend and provide his own cutoff logic.

        That sounds interesting!

        Though one challenge now is the codec does not write into the terms dict whether the term was inlined or not; instead, it checks the docFreq (when reading). So if we let a subclass make this decision, somehow it'd have to store this bit into the terms dict (since "# bytes consumed" isn't available at read-time).

        Show
        Michael McCandless added a comment - oday it defaults to wrapped StandardCodec and 1 as docFreq cutoff. These two should be parameters IMO. +1 Also, the cut-off itself should allow to also base it on #bytes consumed, and not just doc-freq. So really the cutoff should be handled by means of extension, w/ a default impl DocFreqPulsing/Cutoff (whatever) that lets you specify the doc-freq cutoff, with the ability for someone else to extend and provide his own cutoff logic. That sounds interesting! Though one challenge now is the codec does not write into the terms dict whether the term was inlined or not; instead, it checks the docFreq (when reading). So if we let a subclass make this decision, somehow it'd have to store this bit into the terms dict (since "# bytes consumed" isn't available at read-time).
        Hide
        Shai Erera added a comment -

        +1 !

        I think though, we should make PC more extensible. Today it defaults to wrapped StandardCodec and 1 as docFreq cutoff. These two should be parameters IMO. Also, the cut-off itself should allow to also base it on #bytes consumed, and not just doc-freq.

        So really the cutoff should be handled by means of extension, w/ a default impl DocFreqPulsing/Cutoff (whatever) that lets you specify the doc-freq cutoff, with the ability for someone else to extend and provide his own cutoff logic.

        I'm still new to Codecs, so perhaps this doesn't make much sense.

        Show
        Shai Erera added a comment - +1 ! I think though, we should make PC more extensible. Today it defaults to wrapped StandardCodec and 1 as docFreq cutoff. These two should be parameters IMO. Also, the cut-off itself should allow to also base it on #bytes consumed, and not just doc-freq. So really the cutoff should be handled by means of extension, w/ a default impl DocFreqPulsing/Cutoff (whatever) that lets you specify the doc-freq cutoff, with the ability for someone else to extend and provide his own cutoff logic. I'm still new to Codecs, so perhaps this doesn't make much sense.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development