Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.9, 6.0
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I think Lucene's current indexing chain has too many classes /
      hierarchy / abstractions, making it look much more complex than it
      really should be, and discouraging users from experimenting/innovating
      with their own indexing chains.

      Also, if it were easier to understand/approach, then new developers
      would more likely try to improve it ... it really should be simpler.

      So I'm exploring a pared back indexing chain, and have a starting patch
      that I think is looking ok: it seems more approachable than the
      current indexing chain, or at least has fewer strange classes.

      I also thought this could give some speedup for tiny documents (a more
      common use of Lucene lately), and it looks like, with the evil
      optimizations, this is a ~25% speedup for Geonames docs. Even without
      those evil optos it's a bit faster.

      This is very much a work in progress / nocommits, and there are some
      behavior changes e.g. the new chain requires all fields to have the
      same TV options (rather than auto-upgrading all fields by the same
      name that the current chain does)...

      1. LUCENE-5611.patch
        189 kB
        Michael McCandless
      2. LUCENE-5611.patch
        176 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Patch... tests sort of pass except for nocommits ... but it's a work in progress.

        Show
        Michael McCandless added a comment - Patch... tests sort of pass except for nocommits ... but it's a work in progress.
        Hide
        Robert Muir added a comment -

        Overall I do like the simplification of the abstractions: some comments, a lot of which probably dont need to be dealt with on this issue, but stuff to think about.

        I think the specializations in the default chain just work around lack of field reuse? Maybe we should rethink this for Lucene 5, some way that makes it easier and more intuitive so that this reuse isn't necessary for good performance.

        As far as the LuceneTestCase nocommit, we have some similar situations elsewhere, like RandomPF/RandomCodec where we "remember" for a field for that test class and are consistent. I think thats enough for good coverage? If we want to mix things up, a test can do that manually.

        I keep going back and forth on the StoredFieldsWriter codec api change: I can live with it (assuming javadocs are fixed, heh), and I think its ok for a step (to prevent bogus passes on the fields), but it reminds me of the old postings API... perhaps a pull model is warranted, where the writer actually just uses the visitor API or something simple like that. It might actually make it cleaner, for example uncompressed stored fields wouldn't need to buffer up in a RAMOutputStream, it could just do the bogus pass IW was doing before.

        As far as the vectors change, I think its an ok tradeoff. If there are concerns maybe o.a.l.document could help. But i dont think it makes sense to use conflicting vectors values for the same field name... in the same doc.

        Are the new checks in field mandatory? What happens if a custom IndexableField does this (tries to index vectors when not indexed)?

        Show
        Robert Muir added a comment - Overall I do like the simplification of the abstractions: some comments, a lot of which probably dont need to be dealt with on this issue, but stuff to think about. I think the specializations in the default chain just work around lack of field reuse? Maybe we should rethink this for Lucene 5, some way that makes it easier and more intuitive so that this reuse isn't necessary for good performance. As far as the LuceneTestCase nocommit, we have some similar situations elsewhere, like RandomPF/RandomCodec where we "remember" for a field for that test class and are consistent. I think thats enough for good coverage? If we want to mix things up, a test can do that manually. I keep going back and forth on the StoredFieldsWriter codec api change: I can live with it (assuming javadocs are fixed, heh), and I think its ok for a step (to prevent bogus passes on the fields), but it reminds me of the old postings API... perhaps a pull model is warranted, where the writer actually just uses the visitor API or something simple like that. It might actually make it cleaner, for example uncompressed stored fields wouldn't need to buffer up in a RAMOutputStream, it could just do the bogus pass IW was doing before. As far as the vectors change, I think its an ok tradeoff. If there are concerns maybe o.a.l.document could help. But i dont think it makes sense to use conflicting vectors values for the same field name... in the same doc. Are the new checks in field mandatory? What happens if a custom IndexableField does this (tries to index vectors when not indexed)?
        Hide
        Michael McCandless added a comment -

        I think the specializations in the default chain just work around lack of field reuse? Maybe we should rethink this for Lucene 5, some way that makes it easier and more intuitive so that this reuse isn't necessary for good performance.

        Likely Field re-use would get most of that performance gain too? But
        Field reuse is a hassle for apps ...

        This patch is already big enough, and I'd like to focus on simplifying
        the indexing chain, so I'll remove the specializations here and open a
        followon issue ...

        As far as the LuceneTestCase nocommit, we have some similar situations elsewhere, like RandomPF/RandomCodec where we "remember" for a field for that test class and are consistent. I think thats enough for good coverage? If we want to mix things up, a test can do that manually.

        Ahh right, I'll pull that same logic over.

        I keep going back and forth on the StoredFieldsWriter codec api change: I can live with it (assuming javadocs are fixed, heh), and I think its ok for a step (to prevent bogus passes on the fields), but it reminds me of the old postings API... perhaps a pull model is warranted, where the writer actually just uses the visitor API or something simple like that. It might actually make it cleaner, for example uncompressed stored fields wouldn't need to buffer up in a RAMOutputStream, it could just do the bogus pass IW was doing before.

        OK I'll fix the javadocs and open a new issue that we should try the
        visitor API for stored fields?

        As far as the vectors change, I think its an ok tradeoff. If there are concerns maybe o.a.l.document could help. But i dont think it makes sense to use conflicting vectors values for the same field name... in the same doc.

        Yeah I think it's really strange how Lucene auto-upgrades all TV
        settings for all field instances by the same field name ... this is
        probably unexpected and users on upgrading would see this is happening
        and have to be explicit about it themselves. I think that's a good
        thing ...

        Are the new checks in field mandatory? What happens if a custom IndexableField does this (tries to index vectors when not indexed)?

        Good question, I'll add a test & make sure indexer catches it for
        a custom IF.

        Show
        Michael McCandless added a comment - I think the specializations in the default chain just work around lack of field reuse? Maybe we should rethink this for Lucene 5, some way that makes it easier and more intuitive so that this reuse isn't necessary for good performance. Likely Field re-use would get most of that performance gain too? But Field reuse is a hassle for apps ... This patch is already big enough, and I'd like to focus on simplifying the indexing chain, so I'll remove the specializations here and open a followon issue ... As far as the LuceneTestCase nocommit, we have some similar situations elsewhere, like RandomPF/RandomCodec where we "remember" for a field for that test class and are consistent. I think thats enough for good coverage? If we want to mix things up, a test can do that manually. Ahh right, I'll pull that same logic over. I keep going back and forth on the StoredFieldsWriter codec api change: I can live with it (assuming javadocs are fixed, heh), and I think its ok for a step (to prevent bogus passes on the fields), but it reminds me of the old postings API... perhaps a pull model is warranted, where the writer actually just uses the visitor API or something simple like that. It might actually make it cleaner, for example uncompressed stored fields wouldn't need to buffer up in a RAMOutputStream, it could just do the bogus pass IW was doing before. OK I'll fix the javadocs and open a new issue that we should try the visitor API for stored fields? As far as the vectors change, I think its an ok tradeoff. If there are concerns maybe o.a.l.document could help. But i dont think it makes sense to use conflicting vectors values for the same field name... in the same doc. Yeah I think it's really strange how Lucene auto-upgrades all TV settings for all field instances by the same field name ... this is probably unexpected and users on upgrading would see this is happening and have to be explicit about it themselves. I think that's a good thing ... Are the new checks in field mandatory? What happens if a custom IndexableField does this (tries to index vectors when not indexed)? Good question, I'll add a test & make sure indexer catches it for a custom IF.
        Hide
        Michael McCandless added a comment -

        New patch, I think it's ready. I fixed all nocommits, javadocs.

        I removed the specialization for String/NumericField; these gave decent performance gains, but we should pursue it separately.

        Show
        Michael McCandless added a comment - New patch, I think it's ready. I fixed all nocommits, javadocs. I removed the specialization for String/NumericField; these gave decent performance gains, but we should pursue it separately.
        Hide
        Robert Muir added a comment -

        In StoredFieldsWriter:

        - *   <li>For every document, {@link #startDocument(int)} is called,
        + *   <li>For every document, {@link #startDocument()} is called,
          *       informing the Codec how many fields will be written.
        

        This javadoc "compiles" but now does not make sense because we don't pass numFields as a parameter anymore.

        The attribute handling in the indexing chain got more confusing and complicated. Can we factor this into FieldInvertState?

        Its bogus we call hasAttribute + getAttribute, besides making the code more complicated, its two hashmap lookups for 2 atts. We should add a method to attribute source that acts like map.get (returns an attribute, or null if it doesnt exist). Or simple change the semantics of getAttribute to do that. This can be a followup issue.

        I will keep reviewing, i only got thru the first 3 or 4 files in the patch.

        Show
        Robert Muir added a comment - In StoredFieldsWriter: - * <li>For every document, {@link #startDocument(int)} is called, + * <li>For every document, {@link #startDocument()} is called, * informing the Codec how many fields will be written. This javadoc "compiles" but now does not make sense because we don't pass numFields as a parameter anymore. The attribute handling in the indexing chain got more confusing and complicated. Can we factor this into FieldInvertState? Its bogus we call hasAttribute + getAttribute, besides making the code more complicated, its two hashmap lookups for 2 atts. We should add a method to attribute source that acts like map.get (returns an attribute, or null if it doesnt exist). Or simple change the semantics of getAttribute to do that. This can be a followup issue. I will keep reviewing, i only got thru the first 3 or 4 files in the patch.
        Hide
        ASF subversion and git services added a comment -

        Commit 1590720 from Michael McCandless in branch 'dev/branches/lucene5611'
        [ https://svn.apache.org/r1590720 ]

        LUCENE-5611: make branch

        Show
        ASF subversion and git services added a comment - Commit 1590720 from Michael McCandless in branch 'dev/branches/lucene5611' [ https://svn.apache.org/r1590720 ] LUCENE-5611 : make branch
        Hide
        ASF subversion and git services added a comment -

        Commit 1590721 from Michael McCandless in branch 'dev/branches/lucene5611'
        [ https://svn.apache.org/r1590721 ]

        LUCENE-5611: put current patch on branch

        Show
        ASF subversion and git services added a comment - Commit 1590721 from Michael McCandless in branch 'dev/branches/lucene5611' [ https://svn.apache.org/r1590721 ] LUCENE-5611 : put current patch on branch
        Hide
        ASF subversion and git services added a comment -

        Commit 1590731 from Robert Muir in branch 'dev/branches/lucene5611'
        [ https://svn.apache.org/r1590731 ]

        LUCENE-5611: fix the crazy getAttribute API to prevent double lookups and extra code everywhere

        Show
        ASF subversion and git services added a comment - Commit 1590731 from Robert Muir in branch 'dev/branches/lucene5611' [ https://svn.apache.org/r1590731 ] LUCENE-5611 : fix the crazy getAttribute API to prevent double lookups and extra code everywhere
        Hide
        ASF subversion and git services added a comment -

        Commit 1590747 from Robert Muir in branch 'dev/branches/lucene5611'
        [ https://svn.apache.org/r1590747 ]

        LUCENE-5611: move attribute juggling to a fieldinvertstate setter

        Show
        ASF subversion and git services added a comment - Commit 1590747 from Robert Muir in branch 'dev/branches/lucene5611' [ https://svn.apache.org/r1590747 ] LUCENE-5611 : move attribute juggling to a fieldinvertstate setter
        Hide
        ASF subversion and git services added a comment -

        Commit 1590858 from Robert Muir in branch 'dev/branches/lucene5611'
        [ https://svn.apache.org/r1590858 ]

        LUCENE-5611: indexing optimizations, dont compute CRC for internal-use of RAMOutputStream, dont do heavy per-term stuff in skipper until we actually must buffer skipdata

        Show
        ASF subversion and git services added a comment - Commit 1590858 from Robert Muir in branch 'dev/branches/lucene5611' [ https://svn.apache.org/r1590858 ] LUCENE-5611 : indexing optimizations, dont compute CRC for internal-use of RAMOutputStream, dont do heavy per-term stuff in skipper until we actually must buffer skipdata
        Hide
        Robert Muir added a comment -

        I think another followup for this issue should be to do something about all the conflicting term vector option possibilities. Maybe it should have something more like IndexOptions. Just something to think about.

        Anyway I did benchmarking and reviewing, +1 to commit the change. its way simpler and easier to work with.

        Show
        Robert Muir added a comment - I think another followup for this issue should be to do something about all the conflicting term vector option possibilities. Maybe it should have something more like IndexOptions. Just something to think about. Anyway I did benchmarking and reviewing, +1 to commit the change. its way simpler and easier to work with.
        Hide
        Michael McCandless added a comment -

        Thanks Rob, I'll merge back to trunk and commit, and work on backporting to 4.x ... will take time because there is no separate StoredDocument there.

        Show
        Michael McCandless added a comment - Thanks Rob, I'll merge back to trunk and commit, and work on backporting to 4.x ... will take time because there is no separate StoredDocument there.
        Hide
        ASF subversion and git services added a comment -

        Commit 1591025 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1591025 ]

        LUCENE-5611: simplify the default indexing chain

        Show
        ASF subversion and git services added a comment - Commit 1591025 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1591025 ] LUCENE-5611 : simplify the default indexing chain
        Hide
        ASF subversion and git services added a comment -

        Commit 1591116 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1591116 ]

        LUCENE-5611: simplify the default indexing chain

        Show
        ASF subversion and git services added a comment - Commit 1591116 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1591116 ] LUCENE-5611 : simplify the default indexing chain
        Hide
        ASF subversion and git services added a comment -

        Commit 1591391 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1591391 ]

        LUCENE-5611: always call PerField.finish even on non-aborting exc

        Show
        ASF subversion and git services added a comment - Commit 1591391 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1591391 ] LUCENE-5611 : always call PerField.finish even on non-aborting exc
        Hide
        ASF subversion and git services added a comment -

        Commit 1591399 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1591399 ]

        LUCENE-5611: always call PerField.finish even on non-aborting exc

        Show
        ASF subversion and git services added a comment - Commit 1591399 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1591399 ] LUCENE-5611 : always call PerField.finish even on non-aborting exc
        Hide
        ASF subversion and git services added a comment -

        Commit 1591587 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1591587 ]

        LUCENE-5611: fix dv exceptions to be non-aborting

        Show
        ASF subversion and git services added a comment - Commit 1591587 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1591587 ] LUCENE-5611 : fix dv exceptions to be non-aborting
        Hide
        ASF subversion and git services added a comment -

        Commit 1591588 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1591588 ]

        LUCENE-5611: test that dv exceptions are non-aborting

        Show
        ASF subversion and git services added a comment - Commit 1591588 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1591588 ] LUCENE-5611 : test that dv exceptions are non-aborting
        Hide
        ASF subversion and git services added a comment -

        Commit 1591616 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1591616 ]

        LUCENE-5611: always abort if we hit exc in StoredFieldsWriter.start/FinishDocument

        Show
        ASF subversion and git services added a comment - Commit 1591616 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1591616 ] LUCENE-5611 : always abort if we hit exc in StoredFieldsWriter.start/FinishDocument
        Hide
        ASF subversion and git services added a comment -

        Commit 1591657 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1591657 ]

        LUCENE-5611: always abort if we hit exc in StoredFieldsWriter.start/FinishDocument

        Show
        ASF subversion and git services added a comment - Commit 1591657 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1591657 ] LUCENE-5611 : always abort if we hit exc in StoredFieldsWriter.start/FinishDocument
        Hide
        ASF subversion and git services added a comment -

        Commit 1591660 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1591660 ]

        LUCENE-5611: increment lastStoredDocID in one plase; add Asserting.toString

        Show
        ASF subversion and git services added a comment - Commit 1591660 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1591660 ] LUCENE-5611 : increment lastStoredDocID in one plase; add Asserting.toString
        Hide
        ASF subversion and git services added a comment -

        Commit 1591700 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1591700 ]

        LUCENE-5611: don't abort segment if term vector settings are wrong

        Show
        ASF subversion and git services added a comment - Commit 1591700 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1591700 ] LUCENE-5611 : don't abort segment if term vector settings are wrong
        Hide
        ASF subversion and git services added a comment -

        Commit 1591701 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1591701 ]

        LUCENE-5611: nuke dup buggy test

        Show
        ASF subversion and git services added a comment - Commit 1591701 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1591701 ] LUCENE-5611 : nuke dup buggy test
        Hide
        ASF subversion and git services added a comment -

        Commit 1591702 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1591702 ]

        LUCENE-5611: nuke dup buggy test

        Show
        ASF subversion and git services added a comment - Commit 1591702 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1591702 ] LUCENE-5611 : nuke dup buggy test
        Hide
        ASF subversion and git services added a comment -

        Commit 1591706 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1591706 ]

        LUCENE-5611: don't abort segment if term vector settings are wrong

        Show
        ASF subversion and git services added a comment - Commit 1591706 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1591706 ] LUCENE-5611 : don't abort segment if term vector settings are wrong
        Hide
        ASF subversion and git services added a comment -

        Commit 1591807 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1591807 ]

        LUCENE-5611: lazy-init the stored fields writer like before to prevent leaks

        Show
        ASF subversion and git services added a comment - Commit 1591807 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1591807 ] LUCENE-5611 : lazy-init the stored fields writer like before to prevent leaks
        Hide
        ASF subversion and git services added a comment -

        Commit 1591808 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1591808 ]

        LUCENE-5611: lazy-init the stored fields writer like before to prevent leaks

        Show
        ASF subversion and git services added a comment - Commit 1591808 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1591808 ] LUCENE-5611 : lazy-init the stored fields writer like before to prevent leaks

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development