Details

    • Type: Wish Wish
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, Trunk
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently FieldsConsumer/PostingsConsumer/etc is a "push" oriented api, e.g. FreqProxTermsWriter streams the postings at flush, and the default merge() takes the incoming codec api and filters out deleted docs and "pushes" via same api (but that can be overridden).

      It could be cleaner if we allowed for a "pull" model instead (like DocValues). For example, maybe FreqProxTermsWriter could expose a Terms of itself and just passed this to the codec consumer.

      This would give the codec more flexibility to e.g. do multiple passes if it wanted to do things like encode high-frequency terms more efficiently with a bitset-like encoding or other things...

      A codec can try to do things like this to some extent today, but its very difficult (look at buffering in Pulsing). We made this change with DV and it made a lot of interesting optimizations easy to implement...

      1. LUCENE-5123.patch
        60 kB
        Michael McCandless
      2. LUCENE-5123.patch
        95 kB
        Michael McCandless
      3. LUCENE-5123.patch
        113 kB
        Michael McCandless
      4. LUCENE-5123.patch
        202 kB
        Michael McCandless
      5. LUCENE-5123.patch
        233 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.
          Hide
          ASF subversion and git services added a comment -

          Commit 1640808 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1640808 ]

          LUCENE-5123: fix changes

          Show
          ASF subversion and git services added a comment - Commit 1640808 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1640808 ] LUCENE-5123 : fix changes
          Hide
          ASF subversion and git services added a comment -

          Commit 1640807 from Michael McCandless in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1640807 ]

          LUCENE-5123: fix changes

          Show
          ASF subversion and git services added a comment - Commit 1640807 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1640807 ] LUCENE-5123 : fix changes
          Hide
          Michael McCandless added a comment -

          Thanks Rob!

          Show
          Michael McCandless added a comment - Thanks Rob!
          Hide
          ASF subversion and git services added a comment -

          Commit 1620252 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1620252 ]

          LUCENE-5123, LUCENE-5268: move CHANGES 5.0 -> 4.11

          Show
          ASF subversion and git services added a comment - Commit 1620252 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1620252 ] LUCENE-5123 , LUCENE-5268 : move CHANGES 5.0 -> 4.11
          Hide
          ASF subversion and git services added a comment -

          Commit 1620250 from Robert Muir in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1620250 ]

          LUCENE-5123, LUCENE-5268: invert codec postings api (backport from trunk)

          Show
          ASF subversion and git services added a comment - Commit 1620250 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1620250 ] LUCENE-5123 , LUCENE-5268 : invert codec postings api (backport from trunk)
          Hide
          Robert Muir added a comment -

          This has baked for a year. Lets backport it to 4.x for auto-prefix terms and other feature that can make use of it.

          Show
          Robert Muir added a comment - This has baked for a year. Lets backport it to 4.x for auto-prefix terms and other feature that can make use of it.
          Hide
          Robert Muir added a comment -

          The only reason merge() exists there is so they can implement some "bulk merging" optimizations?

          Can we remove these optimizations? Has there ever been a benchmark showing they help at all?

          We shouldnt have such scary code in lucene because it "looks faster". Every time I look at infostreams from merge, its completely dominated by postings and other things.

          Show
          Robert Muir added a comment - The only reason merge() exists there is so they can implement some "bulk merging" optimizations? Can we remove these optimizations? Has there ever been a benchmark showing they help at all? We shouldnt have such scary code in lucene because it "looks faster". Every time I look at infostreams from merge, its completely dominated by postings and other things.
          Hide
          Michael McCandless added a comment -

          Thanks Han, I do like the new API better ...

          I don't think we need go get rid of merge() for stored fields / term vectors, at least not yet ...

          Show
          Michael McCandless added a comment - Thanks Han, I do like the new API better ... I don't think we need go get rid of merge() for stored fields / term vectors, at least not yet ...
          Hide
          Han Jiang added a comment -

          Nice change! Although PushFieldsConsumer is still using the old API, I like the migrating
          of flush() logic from FreqProxTermsWriterPerField to PushFieldsConsumer, the calling chain is
          more clear in codec level now.

          Also, I'm quite curious whether StoredFields and TermVectors will get rid of 'merge()' later.

          Show
          Han Jiang added a comment - Nice change! Although PushFieldsConsumer is still using the old API, I like the migrating of flush() logic from FreqProxTermsWriterPerField to PushFieldsConsumer, the calling chain is more clear in codec level now. Also, I'm quite curious whether StoredFields and TermVectors will get rid of 'merge()' later.
          Hide
          ASF subversion and git services added a comment -

          Commit 1524840 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1524840 ]

          LUCENE-5123: invert postings writing API

          Show
          ASF subversion and git services added a comment - Commit 1524840 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1524840 ] LUCENE-5123 : invert postings writing API
          Hide
          Michael McCandless added a comment -

          Thanks Rob, I'll commit soon.

          The change to TestConcurrentMergeScheduler looks leftover/accidental?

          It is separate, but intentional: it adds some fangs to that test. I'll put a comment ...

          Show
          Michael McCandless added a comment - Thanks Rob, I'll commit soon. The change to TestConcurrentMergeScheduler looks leftover/accidental? It is separate, but intentional: it adds some fangs to that test. I'll put a comment ...
          Hide
          Robert Muir added a comment -

          +1 to commit: patch looks awesome!

          The change to TestConcurrentMergeScheduler looks leftover/accidental?

          Show
          Robert Muir added a comment - +1 to commit: patch looks awesome! The change to TestConcurrentMergeScheduler looks leftover/accidental?
          Hide
          Michael McCandless added a comment -

          New patch, resolving all nocommits. Tests and ant precommit pass. I
          think it's ready!

          I moved all the MappingMulti* from oal.codecs to oal.index and made
          them package private.

          We can later tackle cutting over different postings formats to the
          pull API...

          Show
          Michael McCandless added a comment - New patch, resolving all nocommits. Tests and ant precommit pass. I think it's ready! I moved all the MappingMulti* from oal.codecs to oal.index and made them package private. We can later tackle cutting over different postings formats to the pull API...
          Hide
          Michael McCandless added a comment -

          New patch, moving write method to FieldsConsumer, and creating a PushFieldsConsumer subclass that overrides write with a final impl that switches to the "push" API (Fields/Terms/PostingsConsumer).

          I changed the BloomPF to require a PushFieldsConsumer, and I also removed custom comparators from Terms/Enum.

          Still a few nocommits left but I think this is getting close-ish.

          Show
          Michael McCandless added a comment - New patch, moving write method to FieldsConsumer, and creating a PushFieldsConsumer subclass that overrides write with a final impl that switches to the "push" API (Fields/Terms/PostingsConsumer). I changed the BloomPF to require a PushFieldsConsumer, and I also removed custom comparators from Terms/Enum. Still a few nocommits left but I think this is getting close-ish.
          Hide
          Michael McCandless added a comment -

          1. move write() from PostingsFormat to FieldsConsumer
          2. make the "push" api a subclass of FieldsConsumer that has a final implementation of write() and exposes the abstract api it has today (e.g. addField)

          I started down this path (moved the write method to FieldsConsumer, and created a PushFieldsConsumer subclass that impls final write, exposing the current API) but ... this causes problems for wrapping/delegating PostingsConsumers (e.g. AssertingPF, BloomPF, PulsingPF) since suddenly they must be strongly typed to accept only PushFieldsConsumer. Either that or I guess we could cut each of these over to write().

          I mean, it exposes a real issue w/ the current patch: you cannot wrap SimpleTextPF (or any future PF that uses the pull API) inside these PFs that use the push API. Not sure what to do ...

          Show
          Michael McCandless added a comment - 1. move write() from PostingsFormat to FieldsConsumer 2. make the "push" api a subclass of FieldsConsumer that has a final implementation of write() and exposes the abstract api it has today (e.g. addField) I started down this path (moved the write method to FieldsConsumer, and created a PushFieldsConsumer subclass that impls final write, exposing the current API) but ... this causes problems for wrapping/delegating PostingsConsumers (e.g. AssertingPF, BloomPF, PulsingPF) since suddenly they must be strongly typed to accept only PushFieldsConsumer. Either that or I guess we could cut each of these over to write(). I mean, it exposes a real issue w/ the current patch: you cannot wrap SimpleTextPF (or any future PF that uses the pull API) inside these PFs that use the push API. Not sure what to do ...
          Hide
          Robert Muir added a comment -

          Maybe we should:

          1. move write() from PostingsFormat to FieldsConsumer
          2. make the "push" api a subclass of FieldsConsumer that has a final implementation of write() and exposes the abstract api it has today (e.g. addField)

          or we can keep the names the same and have the new one be "underneath" FieldsConsumer (e.g. RawFieldsConsumer).

          But I am not sure we should focus on nuking the old api or even cutting over impls that dont "need" the visitor api, since the existing api works pretty well
          for a lot of implementations and if you dont need to do fancy stuff it might be easier and less error-prone.

          Show
          Robert Muir added a comment - Maybe we should: move write() from PostingsFormat to FieldsConsumer make the "push" api a subclass of FieldsConsumer that has a final implementation of write() and exposes the abstract api it has today (e.g. addField) or we can keep the names the same and have the new one be "underneath" FieldsConsumer (e.g. RawFieldsConsumer). But I am not sure we should focus on nuking the old api or even cutting over impls that dont "need" the visitor api, since the existing api works pretty well for a lot of implementations and if you dont need to do fancy stuff it might be easier and less error-prone.
          Hide
          Michael McCandless added a comment -

          New patch, adding a test case that "exercises" this API a bit...

          Show
          Michael McCandless added a comment - New patch, adding a test case that "exercises" this API a bit...
          Hide
          Michael McCandless added a comment -

          New patch, cutting over SimpleText to the new inverted API.

          I also had to cutover PerFieldPF (otherwise it would not be able to embed SimpleText), and a couple of tests.

          Tests pass, at least once!

          Show
          Michael McCandless added a comment - New patch, cutting over SimpleText to the new inverted API. I also had to cutover PerFieldPF (otherwise it would not be able to embed SimpleText), and a couple of tests. Tests pass, at least once!
          Hide
          Robert Muir added a comment -

          This is exciting!

          One idea is to just keep the old API (at least for now)?
          Then, we dont have to cutover tons of code at once and we just have a new low level api (and back compat by accident).

          I think it would be good if we wrote or converted a 'demo' codec (simpletext is ok for example, or a new simple one) to the new api first, just to see if we are happy with it.

          Like maybe its just fine that if you are implementing the new API you have to compute the stats in your codec yourself, maybe its simple, or maybe we just plan on keeping the higher level API and not deprecating it.

          Show
          Robert Muir added a comment - This is exciting! One idea is to just keep the old API (at least for now)? Then, we dont have to cutover tons of code at once and we just have a new low level api (and back compat by accident). I think it would be good if we wrote or converted a 'demo' codec (simpletext is ok for example, or a new simple one) to the new api first, just to see if we are happy with it. Like maybe its just fine that if you are implementing the new API you have to compute the stats in your codec yourself, maybe its simple, or maybe we just plan on keeping the higher level API and not deprecating it.
          Hide
          Michael McCandless added a comment -

          Here's a starting patch with lots of nocommits, but tests are passing ...

          This patch adds a postings (Fields) impl during flush (FreqProxFields), reading the postings from FreqProxTermWriter's RAM buffers, and another for merging (MappedMultiFields).

          It also adds a new PostingsFormat.write to write the postings ... the idea is this will eventually replace PostingsFormat.fieldsConsumer, but for now, so we can cutover existing postings formats iteratively, I made a default impl for write that takes a Fields and steps through Fields/Terms/Postings for the FieldsConsumer API.

          Show
          Michael McCandless added a comment - Here's a starting patch with lots of nocommits, but tests are passing ... This patch adds a postings (Fields) impl during flush (FreqProxFields), reading the postings from FreqProxTermWriter's RAM buffers, and another for merging (MappedMultiFields). It also adds a new PostingsFormat.write to write the postings ... the idea is this will eventually replace PostingsFormat.fieldsConsumer, but for now, so we can cutover existing postings formats iteratively, I made a default impl for write that takes a Fields and steps through Fields/Terms/Postings for the FieldsConsumer API.
          Hide
          Michael McCandless added a comment -

          +1

          I think this would allow for more innovation in PostingsFormats ...

          Show
          Michael McCandless added a comment - +1 I think this would allow for more innovation in PostingsFormats ...

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development