Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.4, 2.4.1
    • Fix Version/s: 2.9
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      As noted in these threads...

      http://www.nabble.com/Order-of-fields-returned-by-Document.getFields%28%29-to21034652.html
      http://www.nabble.com/Order-of-fields-within-a-Document-in-Lucene-2.4%2B-to24210597.html

      somewhere prior to Lucene 2.4.1 a change was introduced that prevents the Stored fields of a Document from being returned in same order that they were originally added in. This can cause serious performance problems for people attempting to use LoadFirstFieldSelector or a custom FieldSelector with the LOAD_AND_BREAK, or the SIZE_AND_BREAK options (since the fields don't come back in the order they expect)

      Speculation in the email threads is that the origin of this bug is code introduced by LUCENE-1301 – but the purpose of that issue was refactoring, so if it really is the cause of the change this would seem to be a bug, and not a side affect of a conscious implementation change.

      Someone who understands indexing internals should investigate this. At a minimum, if it's decided that this is not actual a bug, then prior to resolving this bug the wiki docs and some of the FIeldSelector javadocs should be updated to make it clear what order Fields will be returned in.

      1. LUCENE-1727.patch
        15 kB
        Michael McCandless

        Activity

        Hoss Man created issue -
        Michael McCandless made changes -
        Field Original Value New Value
        Assignee Michael McCandless [ mikemccand ]
        Michael McCandless made changes -
        Fix Version/s 2.9 [ 12312682 ]
        Hide
        Yonik Seeley added a comment -

        If we start guaranteeing that fields get returned in the same order as they were added, what are the costs? (I'm not sure there ever was such a guarantee... just because someone added it to the FAQ doesn't make it so).

        AFAIK, sorting the fields is necessary to group multiple values for the same field, and it also ensured that segments with the same fields had the same field numbers, which enables faster segment merging?

        Show
        Yonik Seeley added a comment - If we start guaranteeing that fields get returned in the same order as they were added, what are the costs? (I'm not sure there ever was such a guarantee... just because someone added it to the FAQ doesn't make it so). AFAIK, sorting the fields is necessary to group multiple values for the same field, and it also ensured that segments with the same fields had the same field numbers, which enables faster segment merging?
        Hide
        Yonik Seeley added a comment -

        just because someone added it to the FAQ doesn't make it so

        Of course, I tried to find out when it was added, and it's been there since the original Apache version.... that's a long time. I guess we do need to treat it like a guarantee at this point (which we could break in 3.0 if there is a benefit to doing so).

        It's a good reminder to avoid documenting how something works as a guarantee that it will always work that way.

        Show
        Yonik Seeley added a comment - just because someone added it to the FAQ doesn't make it so Of course, I tried to find out when it was added, and it's been there since the original Apache version.... that's a long time. I guess we do need to treat it like a guarantee at this point (which we could break in 3.0 if there is a benefit to doing so). It's a good reminder to avoid documenting how something works as a guarantee that it will always work that way.
        Hide
        Hoss Man added a comment -

        It's a good reminder to avoid documenting how something works as a guarantee that it will always work that way.

        I remember it being pretty heavily advertised as a feature for as long as i can remember – it was the entire basis for adding the FieldSelector API.

        Based on McCandless comments in email, it sounds like order was only ever maintained for fields that don't use term vectors – in which case the documentation was only ever partially correct. If it's possible to fix it to work at least as well as it use to that seems worth while considering how much FieldSelector depends on it.

        (I admit however: it's kind of scary that it's been such an implicit assumption but apparently never had a unit test)

        Show
        Hoss Man added a comment - It's a good reminder to avoid documenting how something works as a guarantee that it will always work that way. I remember it being pretty heavily advertised as a feature for as long as i can remember – it was the entire basis for adding the FieldSelector API. Based on McCandless comments in email, it sounds like order was only ever maintained for fields that don't use term vectors – in which case the documentation was only ever partially correct. If it's possible to fix it to work at least as well as it use to that seems worth while considering how much FieldSelector depends on it. (I admit however: it's kind of scary that it's been such an implicit assumption but apparently never had a unit test)
        Hide
        Yonik Seeley added a comment -

        it was the entire basis for adding the FieldSelector API.

        Yes, something I'm not particularly fond of. IMO, speeding up loading certain fields should be left to Lucene. For example, one can think of a simple way to improve the performance of loading only certain fields... instead of

        [fieldnum][fieldlength][fieldvalue] [fieldnum][fieldlength][fieldvalue]

        store it instead as

        [fieldnum][fieldlength][fieldnum][fieldlength] [fieldvalue] [fieldvalue]

        Show
        Yonik Seeley added a comment - it was the entire basis for adding the FieldSelector API. Yes, something I'm not particularly fond of. IMO, speeding up loading certain fields should be left to Lucene. For example, one can think of a simple way to improve the performance of loading only certain fields... instead of [fieldnum] [fieldlength] [fieldvalue] [fieldnum] [fieldlength] [fieldvalue] store it instead as [fieldnum] [fieldlength] [fieldnum] [fieldlength] [fieldvalue] [fieldvalue]
        Hide
        Michael McCandless added a comment -

        If we start guaranteeing that fields get returned in the same order as they were added, what are the costs?

        I'm not yet sure, but I expect it to be a minor added cost; I'll know more as I dig in.

        AFAIK, sorting the fields is necessary to group multiple values for the same field, and it also ensured that segments with the same fields had the same field numbers, which enables faster segment merging?

        Actually the mapping of field name -> number happens before the sort, so presently we rely on the docs having the same order of fields, to enable bulk merging of stored fields & term vectors. Bulk merging is really a rather brittle optimization. Actually we could improve it by only checking for matched name -> numbers for fields that are stored or have term vectors enabled (right now we check that all fields match), and by pre-sorting the field names when doing the mapping to number.

        I plan to just move the stored fields writer up in the indexing chain, so that it receives the in-order list of fields, not the coalesced & sorted list.

        Based on McCandless comments in email, it sounds like order was only ever maintained for fields that don't use term vectors - in which case the documentation was only ever partially correct.

        Actually, order was correctly maintained prior to 2.3. In 2.3, it was maintained only if you had no term vectors fields (ie, we only sorted when there was at least 1 field w/ term vectors enabled). In 2.4 we always sort and order was never maintained. For 2.9 I think we should fix it again so that order is fully maintained.

        For example, one can think of a simple way to improve the performance of loading only certain fields

        I think that'd be a good improvement to how fields are stored!

        Show
        Michael McCandless added a comment - If we start guaranteeing that fields get returned in the same order as they were added, what are the costs? I'm not yet sure, but I expect it to be a minor added cost; I'll know more as I dig in. AFAIK, sorting the fields is necessary to group multiple values for the same field, and it also ensured that segments with the same fields had the same field numbers, which enables faster segment merging? Actually the mapping of field name -> number happens before the sort, so presently we rely on the docs having the same order of fields, to enable bulk merging of stored fields & term vectors. Bulk merging is really a rather brittle optimization. Actually we could improve it by only checking for matched name -> numbers for fields that are stored or have term vectors enabled (right now we check that all fields match), and by pre-sorting the field names when doing the mapping to number. I plan to just move the stored fields writer up in the indexing chain, so that it receives the in-order list of fields, not the coalesced & sorted list. Based on McCandless comments in email, it sounds like order was only ever maintained for fields that don't use term vectors - in which case the documentation was only ever partially correct. Actually, order was correctly maintained prior to 2.3. In 2.3, it was maintained only if you had no term vectors fields (ie, we only sorted when there was at least 1 field w/ term vectors enabled). In 2.4 we always sort and order was never maintained. For 2.9 I think we should fix it again so that order is fully maintained. For example, one can think of a simple way to improve the performance of loading only certain fields I think that'd be a good improvement to how fields are stored!
        Hide
        Michael McCandless added a comment -

        Attached patch.

        I moved StoredFieldsWriter up in the chain, by calling it directly
        from DocFieldProcessor/PerThread, and added a test case showing the
        problem & showing the fix.

        Show
        Michael McCandless added a comment - Attached patch. I moved StoredFieldsWriter up in the chain, by calling it directly from DocFieldProcessor/PerThread, and added a test case showing the problem & showing the fix.
        Michael McCandless made changes -
        Attachment LUCENE-1727.patch [ 12412662 ]
        Michael McCandless made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Mark Miller made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Mark Thomas made changes -
        Workflow jira [ 12467243 ] Default workflow, editable Closed status [ 12562908 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12562908 ] jira [ 12583958 ]

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Hoss Man
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development