Lucene - Core
  1. Lucene - Core
  2. LUCENE-5703

Don't allocate/copy bytes all the time in binary DV producers

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.9, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Our binary doc values producers keep on creating new byte[] arrays and copying bytes when a value is requested, which likely doesn't help performance. This has been done because of the way fieldcache consumers used the API, but we should try to fix it in 5.0.

      1. LUCENE-5703.patch
        220 kB
        Robert Muir
      2. LUCENE-5703.patch
        219 kB
        Robert Muir
      3. LUCENE-5703.patch
        204 kB
        Robert Muir
      4. LUCENE-5703.patch
        203 kB
        Robert Muir
      5. LUCENE-5703.patch
        180 kB
        Adrien Grand
      6. LUCENE-5703.patch
        181 kB
        Adrien Grand
      7. LUCENE-5703.patch
        178 kB
        Adrien Grand

        Activity

        Hide
        Adrien Grand added a comment -

        On a side note, I think that the fact that the buffer needs to be supplied by the consumer void get(int docId, BytesRef result) is a bit error-prone since the BytesRef is created by the consumer while its content is owned by the BinaryDocValues. Maybe we should make BinaryDocValues responsible for managing the BytesRef entirely by changing the method signature to BytesRef get(int docId) similarly to BytesRef next() in TermsEnum?

        Show
        Adrien Grand added a comment - On a side note, I think that the fact that the buffer needs to be supplied by the consumer void get(int docId, BytesRef result) is a bit error-prone since the BytesRef is created by the consumer while its content is owned by the BinaryDocValues . Maybe we should make BinaryDocValues responsible for managing the BytesRef entirely by changing the method signature to BytesRef get(int docId) similarly to BytesRef next() in TermsEnum ?
        Hide
        Michael McCandless added a comment -

        Maybe we should make BinaryDocValues responsible for managing the BytesRef entirely by changing the method signature to BytesRef get(int docId) similarly to BytesRef next() in TermsEnum?

        +1, also like DocsAndPositionsEnum.getPayload().

        Then it's the caller's job to make a deep copy of the BytesRef if they need to retain it.

        Show
        Michael McCandless added a comment - Maybe we should make BinaryDocValues responsible for managing the BytesRef entirely by changing the method signature to BytesRef get(int docId) similarly to BytesRef next() in TermsEnum? +1, also like DocsAndPositionsEnum.getPayload(). Then it's the caller's job to make a deep copy of the BytesRef if they need to retain it.
        Hide
        Adrien Grand added a comment -

        Here is a patch that switches BinaryDocValues to the discussed API, as well as Sorted(Set)DocValues.lookupOrd for consistency.

        • the default codec as well as memory, direct and disk don't allocate the byte[] anymore in BinaryDocValues.get.
        • the default codec takes advantage of the maximum length of binary terms, which is exposed in the metadata to never have to resize the BytesRef that stores the term.
        • old codecs (lucene40, lucene42) have moved to the new API but still allocate the byte[] on the fly
        • fixed grouping and comparators to not assume they own the bytes
        • removed the two tests from BaseDocValuesFormatTestCase that ensured that each return value had its own bytes

        Tests pass (I ran the whole suite 6 times already) and I'll run benchmarks soon to make sure that doesn't introduce a performance regression.

        Show
        Adrien Grand added a comment - Here is a patch that switches BinaryDocValues to the discussed API, as well as Sorted(Set)DocValues.lookupOrd for consistency. the default codec as well as memory, direct and disk don't allocate the byte[] anymore in BinaryDocValues.get. the default codec takes advantage of the maximum length of binary terms, which is exposed in the metadata to never have to resize the BytesRef that stores the term. old codecs (lucene40, lucene42) have moved to the new API but still allocate the byte[] on the fly fixed grouping and comparators to not assume they own the bytes removed the two tests from BaseDocValuesFormatTestCase that ensured that each return value had its own bytes Tests pass (I ran the whole suite 6 times already) and I'll run benchmarks soon to make sure that doesn't introduce a performance regression.
        Hide
        Robert Muir added a comment -

        Thanks for tackling this! I will help with the reviews, its a tricky one.

        I didn't yet look, do SORTED/SORTED_SET TermsEnums also have the new behavior? This was another source of stupid stuff, I think they should be consistent with other termsenums (and now lookupOrd).

        Show
        Robert Muir added a comment - Thanks for tackling this! I will help with the reviews, its a tricky one. I didn't yet look, do SORTED/SORTED_SET TermsEnums also have the new behavior? This was another source of stupid stuff, I think they should be consistent with other termsenums (and now lookupOrd).
        Hide
        Adrien Grand added a comment -

        The term returned by TermsEnum.next() or TermsEnum.term() always comes from Sorted(Set)DocValues.lookupOrd. It doesn't allocate its own terms.

        Show
        Adrien Grand added a comment - The term returned by TermsEnum.next() or TermsEnum.term() always comes from Sorted(Set)DocValues.lookupOrd. It doesn't allocate its own terms.
        Hide
        Robert Muir added a comment -

        The term returned by TermsEnum.next() or TermsEnum.term() always comes from Sorted(Set)DocValues.lookupOrd. It doesn't allocate its own terms.

        Well only in the base (simplistic) implementation. At least for the default codec, the codec overrides TermsEnum and implements it special for performance reasons. and lookupOrd in this case actually uses the termsenum (i think)

        Show
        Robert Muir added a comment - The term returned by TermsEnum.next() or TermsEnum.term() always comes from Sorted(Set)DocValues.lookupOrd. It doesn't allocate its own terms. Well only in the base (simplistic) implementation. At least for the default codec, the codec overrides TermsEnum and implements it special for performance reasons. and lookupOrd in this case actually uses the termsenum (i think)
        Hide
        Adrien Grand added a comment -

        I missed that one. Here is an updated patch that removes the setTerm paranoia from the terms enum of Lucene45DocValuesFormat.

        Show
        Adrien Grand added a comment - I missed that one. Here is an updated patch that removes the setTerm paranoia from the terms enum of Lucene45DocValuesFormat.
        Hide
        Robert Muir added a comment -

        This looks pretty good. Can we improve the Sorted/SortedSetTermsEnum.seek* ? I don't think its the right tradeoff at all to call lookupOrd() (which may incur I/O) to save copying bytes!

        Show
        Robert Muir added a comment - This looks pretty good. Can we improve the Sorted/SortedSetTermsEnum.seek* ? I don't think its the right tradeoff at all to call lookupOrd() (which may incur I/O) to save copying bytes!
        Hide
        Adrien Grand added a comment -

        Here is an updated patch. Sorted(Set)TermsEnum copies the supplied BytesRef when a match is found instead of looking up the ord.

        Show
        Adrien Grand added a comment - Here is an updated patch. Sorted(Set)TermsEnum copies the supplied BytesRef when a match is found instead of looking up the ord.
        Hide
        Robert Muir added a comment -

        +1 to commit, thank you for taking care of this!

        Show
        Robert Muir added a comment - +1 to commit, thank you for taking care of this!
        Hide
        Robert Muir added a comment -

        I kicked this around all i could with nightly tests, running tests over and over, etc. I'm seeing this reproducible fail:

        ant test  -Dtestcase=TestDistributedMissingSort -Dtests.method=testDistribSearch -Dtests.seed=6B475C36C0EF9CD5 -Dtests.nightly=true -Dtests.slow=true -Dtests.locale=ar -Dtests.timezone=Africa/Windhoek -Dtests.file.encoding=UTF-8
        
        Show
        Robert Muir added a comment - I kicked this around all i could with nightly tests, running tests over and over, etc. I'm seeing this reproducible fail: ant test -Dtestcase=TestDistributedMissingSort -Dtests.method=testDistribSearch -Dtests.seed=6B475C36C0EF9CD5 -Dtests.nightly=true -Dtests.slow=true -Dtests.locale=ar -Dtests.timezone=Africa/Windhoek -Dtests.file.encoding=UTF-8
        Hide
        Robert Muir added a comment -

        I have a fix, I will update the patch in a bit (also with a test).

        Show
        Robert Muir added a comment - I have a fix, I will update the patch in a bit (also with a test).
        Hide
        Robert Muir added a comment -

        Updated to trunk. Added TestFieldCacheSortRandom. Fixed bug in the original patch with FC (it cannot share here), so instead getTermsIndex returns a light iterator over the "real thing" just like docTermsOrds. Because of this, I had to also fix any bad test assumptions around 'same'. Also cleaned up the termsenum in the default codec a bit: the methods like doSeek/doNext are stupid and I removed them.

        Ill beast and review some more, but this is all looking good.

        Show
        Robert Muir added a comment - Updated to trunk. Added TestFieldCacheSortRandom. Fixed bug in the original patch with FC (it cannot share here), so instead getTermsIndex returns a light iterator over the "real thing" just like docTermsOrds. Because of this, I had to also fix any bad test assumptions around 'same'. Also cleaned up the termsenum in the default codec a bit: the methods like doSeek/doNext are stupid and I removed them. Ill beast and review some more, but this is all looking good.
        Hide
        Robert Muir added a comment -

        Updated patch: folds in an unrelated test bug fix from beasting slow+nightly...

        Show
        Robert Muir added a comment - Updated patch: folds in an unrelated test bug fix from beasting slow+nightly...
        Hide
        Robert Muir added a comment -

        Upon final review: I am unhappy about a few things with the latest patch, mostly doing with safety:

        • DocValues.EMPTY_XXX is now unsafe, it uses a static mutable thing (BytesRef). We should make these methods instead of constants. This won't ever be performance critical so its ok to me.
        • Memory and so on should do an array copy instead of returning singleton stuff. If there is a bug in someone's code, it could corrupt the data and get merged into index corruption.

        I'm ok with someone's bug in their code corrupting their threadlocal code-private byte[], but not the index. We have to draw the line there.

        Show
        Robert Muir added a comment - Upon final review: I am unhappy about a few things with the latest patch, mostly doing with safety: DocValues.EMPTY_XXX is now unsafe, it uses a static mutable thing (BytesRef). We should make these methods instead of constants. This won't ever be performance critical so its ok to me. Memory and so on should do an array copy instead of returning singleton stuff. If there is a bug in someone's code, it could corrupt the data and get merged into index corruption. I'm ok with someone's bug in their code corrupting their threadlocal code-private byte[], but not the index. We have to draw the line there.
        Hide
        Robert Muir added a comment -

        Ill start tackling the EMPTY issue. it shouldn't be controversial at all, but the safety here is mandatory because this constant is used by SegmentMerger.

        As far as the all-in-ram ones exposing the ability to corrupt the same stuff thats merged, we can think of a number of compromises / solutions, but something must be done:

        • System.arraycopy
        • big fat warnings on these that they are "unsafe" (they are not part of the official index format, so maybe thats ok).
        • they could keep ahold of their file descriptors and override merge() to stream the data from the file.
        Show
        Robert Muir added a comment - Ill start tackling the EMPTY issue. it shouldn't be controversial at all, but the safety here is mandatory because this constant is used by SegmentMerger. As far as the all-in-ram ones exposing the ability to corrupt the same stuff thats merged, we can think of a number of compromises / solutions, but something must be done: System.arraycopy big fat warnings on these that they are "unsafe" (they are not part of the official index format, so maybe thats ok). they could keep ahold of their file descriptors and override merge() to stream the data from the file.
        Hide
        Robert Muir added a comment -

        Updated patch fixing most EMPTY stuff.

        TermOrdValComparator.MISSING_BYTESREF and other unsafe stuff like that still needs to be fixed.

        And I did nothing with the in-RAM dv providers.

        Show
        Robert Muir added a comment - Updated patch fixing most EMPTY stuff. TermOrdValComparator.MISSING_BYTESREF and other unsafe stuff like that still needs to be fixed. And I did nothing with the in-RAM dv providers.
        Hide
        Robert Muir added a comment -

        I tried to just make TermOrdValComparator.MISSING_BYTESREF an instance variable instead of static as an easy-fix, but this uncovered a real bug: it break searchAfter. We have to assume users will want to deserialize results there... we can't rely upon them providing back that exact same bytesref as the missing value. Having it as a static only hid that in tests.

        Show
        Robert Muir added a comment - I tried to just make TermOrdValComparator.MISSING_BYTESREF an instance variable instead of static as an easy-fix, but this uncovered a real bug: it break searchAfter. We have to assume users will want to deserialize results there... we can't rely upon them providing back that exact same bytesref as the missing value. Having it as a static only hid that in tests.
        Hide
        Robert Muir added a comment -

        patch fixing the comparator (well, to be no worse than today). It uses MISSING_BYTES instead.

        Show
        Robert Muir added a comment - patch fixing the comparator (well, to be no worse than today). It uses MISSING_BYTES instead.
        Hide
        Adrien Grand added a comment -
        • System.arraycopy
        • big fat warnings on these that they are "unsafe" (they are not part of the official index format, so maybe thats ok).
        • they could keep ahold of their file descriptors and override merge() to stream the data from the file.

        I'm hesitating between the 1st and 2nd idea. I'm not a fan of the 3rd idea as it would make merging a bit more complex although I really like how simple it is today.

        Since the default codec doesn't have the issue and since this issue already exists in other index formats (eg. the payload of DirectPostingsFormat), I think it would be fine to just have a warning?

        Show
        Adrien Grand added a comment - System.arraycopy big fat warnings on these that they are "unsafe" (they are not part of the official index format, so maybe thats ok). they could keep ahold of their file descriptors and override merge() to stream the data from the file. I'm hesitating between the 1st and 2nd idea. I'm not a fan of the 3rd idea as it would make merging a bit more complex although I really like how simple it is today. Since the default codec doesn't have the issue and since this issue already exists in other index formats (eg. the payload of DirectPostingsFormat), I think it would be fine to just have a warning?
        Hide
        ASF subversion and git services added a comment -

        Commit 1600688 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1600688 ]

        LUCENE-5703: BinaryDocValues producers don't allocate or copy bytes on each access anymore

        Show
        ASF subversion and git services added a comment - Commit 1600688 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1600688 ] LUCENE-5703 : BinaryDocValues producers don't allocate or copy bytes on each access anymore
        Hide
        ASF subversion and git services added a comment -

        Commit 1600716 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1600716 ]

        LUCENE-5703: fix safety bug for FC's BINARY too

        Show
        ASF subversion and git services added a comment - Commit 1600716 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1600716 ] LUCENE-5703 : fix safety bug for FC's BINARY too
        Hide
        ASF subversion and git services added a comment -

        Commit 1600741 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1600741 ]

        LUCENE-5703: BinaryDocValues producers don't allocate or copy bytes on each access anymore

        Show
        ASF subversion and git services added a comment - Commit 1600741 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1600741 ] LUCENE-5703 : BinaryDocValues producers don't allocate or copy bytes on each access anymore
        Hide
        Robert Muir added a comment -

        Thanks Adrien!

        Show
        Robert Muir added a comment - Thanks Adrien!

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development