Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7456

PerField(DocValues|Postings)Format do not call the per-field merge methods

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 6.2.1
    • Fix Version/s: 7.0, 6.3
    • Component/s: core/codecs
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      While porting some old codec code from Lucene 4.3.1, I couldn't get the per-field formats to call upon the per-field merge methods; the default merge method was always being called.

      I think this is a side-effect of LUCENE-5894.

      Attached is a patch with a test that reproduces the error and an associated fix that pass the unit tests.

      1. LUCENE-7456.patch
        25 kB
        Julien MASSENET
      2. LUCENE-7456-v2.patch
        28 kB
        Julien MASSENET

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Wow, nice catch! Thank you.

        Our default codec postings & doc values format doesn't override the merge method, so I guess the impact here on a default Lucene usage is minor. But if you customize your codec then it could matter if you have a better merge impl than the naive default.

        I'll review the patch.

        Show
        mikemccand Michael McCandless added a comment - Wow, nice catch! Thank you. Our default codec postings & doc values format doesn't override the merge method, so I guess the impact here on a default Lucene usage is minor. But if you customize your codec then it could matter if you have a better merge impl than the naive default. I'll review the patch.
        Hide
        mikemccand Michael McCandless added a comment -

        It's a little spooky how sneaky this patch needs to be, temporarily
        overwriting MergeState members, adding a FilterFieldsProducer,
        FilterFieldsInfos.

        Can we improve those latter two classes to e.g. reject a field that
        was not in the restricted set, if you call
        FilterFieldInfos.fieldInfo or FilterFieldsProducer.terms on an
        invalid field name?

        The FilterFieldsInfos is also in a precarious state, having to include
        all incoming FieldInfo instances so the numbers are consistent, yet
        only overriding the iterator.

        Methods like FilterFieldInfos.hasProx, etc., are also wrong, which can
        result in sneaky future bugs for codecs that rely on this.

        I don't really like the complexity in this patch: I think this is a little too much
        sneakiness. Yet, I don't know of a cleaner way to fix the bug.

        Stepping back a bit, can you describe the use case motivating allowing
        your custom codec to override the default merging for doc values /
        postings?

        Show
        mikemccand Michael McCandless added a comment - It's a little spooky how sneaky this patch needs to be, temporarily overwriting MergeState members, adding a FilterFieldsProducer, FilterFieldsInfos. Can we improve those latter two classes to e.g. reject a field that was not in the restricted set, if you call FilterFieldInfos.fieldInfo or FilterFieldsProducer.terms on an invalid field name? The FilterFieldsInfos is also in a precarious state, having to include all incoming FieldInfo instances so the numbers are consistent, yet only overriding the iterator. Methods like FilterFieldInfos.hasProx , etc., are also wrong, which can result in sneaky future bugs for codecs that rely on this. I don't really like the complexity in this patch: I think this is a little too much sneakiness. Yet, I don't know of a cleaner way to fix the bug. Stepping back a bit, can you describe the use case motivating allowing your custom codec to override the default merging for doc values / postings?
        Hide
        jmassenet-rakuten Julien MASSENET added a comment -

        I agree that being this sneaky is not ideal, but the only alternative I see would be change the Postings API.

        In this patch, I tried to keep modifications constrained to the org.apache.lucene.codecs.perfield package, but I can give a shot at trying to a come up with a cleaner implementation that updates the other APIs if you want.

        I've uploaded an updated version of the patch which does not remove the sneakiness but makes the PerFieldMergeState more robust:

        • The FilterFieldInfos class now correctly computes and exposes the hasXXX properties.
        • Calls to FilterFieldInfos.fieldInfo() and FilterFieldsProducer.terms() now throw IllegalArgumentException if called with fields outside if the current merge context
        • I've tweaked the unit tests to work with the latest 6.2.1 since the Legacy* field types are not accessible in this module anymore.

        As for why we're overriding merge() calls in our codec:

        • We're using a customized codec to emulate nested documents.
        • It works with the same idea as BlockJoin, but is less generic (it's tailored to our use case).
        • The main difference is that the maxDoc() for segments remain equal to the number of "parent" documents, with only the nested fields having larger posting lists.
        • For it to work, when merging, we need to rebase correctly the "docIds" for the nested documents (using the same idea as the docMap in the general use case).
        Show
        jmassenet-rakuten Julien MASSENET added a comment - I agree that being this sneaky is not ideal, but the only alternative I see would be change the Postings API. In this patch, I tried to keep modifications constrained to the org.apache.lucene.codecs.perfield package, but I can give a shot at trying to a come up with a cleaner implementation that updates the other APIs if you want. I've uploaded an updated version of the patch which does not remove the sneakiness but makes the PerFieldMergeState more robust: The FilterFieldInfos class now correctly computes and exposes the hasXXX properties. Calls to FilterFieldInfos.fieldInfo() and FilterFieldsProducer.terms() now throw IllegalArgumentException if called with fields outside if the current merge context I've tweaked the unit tests to work with the latest 6.2.1 since the Legacy* field types are not accessible in this module anymore. As for why we're overriding merge() calls in our codec: We're using a customized codec to emulate nested documents. It works with the same idea as BlockJoin, but is less generic (it's tailored to our use case). The main difference is that the maxDoc() for segments remain equal to the number of "parent" documents, with only the nested fields having larger posting lists. For it to work, when merging, we need to rebase correctly the "docIds" for the nested documents (using the same idea as the docMap in the general use case).
        Hide
        mikemccand Michael McCandless added a comment -

        Thanks Julien MASSENET, I think this patch is a good step forwards, and we can try to simplify the approach in future issues (progress not perfection!).

        I'll fixup the minor failures from ant precommit and push.

        Show
        mikemccand Michael McCandless added a comment - Thanks Julien MASSENET , I think this patch is a good step forwards, and we can try to simplify the approach in future issues (progress not perfection!). I'll fixup the minor failures from ant precommit and push.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit a6a8032c7f079ea59daea0c95e48f69b2986d918 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a6a8032 ]

        LUCENE-7456: PerFieldPostings/DocValuesFormat was failing to delegate the merge method

        Show
        jira-bot ASF subversion and git services added a comment - Commit a6a8032c7f079ea59daea0c95e48f69b2986d918 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a6a8032 ] LUCENE-7456 : PerFieldPostings/DocValuesFormat was failing to delegate the merge method
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 796ed508f39683c626d4870a7ab583a222b2c64c in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=796ed50 ]

        LUCENE-7456: PerFieldPostings/DocValuesFormat was failing to delegate the merge method

        Show
        jira-bot ASF subversion and git services added a comment - Commit 796ed508f39683c626d4870a7ab583a222b2c64c in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=796ed50 ] LUCENE-7456 : PerFieldPostings/DocValuesFormat was failing to delegate the merge method
        Hide
        mikemccand Michael McCandless added a comment -

        Thank you Julien MASSENET!

        Show
        mikemccand Michael McCandless added a comment - Thank you Julien MASSENET !

          People

          • Assignee:
            Unassigned
            Reporter:
            jmassenet-rakuten Julien MASSENET
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development