Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.5
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Environment:

      Linux

    • Lucene Fields:
      New

      Description

      We have a large 50 gig index which is optimized as one segment, with a 66 MEG .tii file. This index has no norms, and no field cache.

      It takes about 5 seconds to load this index, profiling reveals that 60% of the time is spent in GrowableWriter.set(index, value), and most of time in set(...) is spent resizing PackedInts.Mutatable current.

      In the constructor for TermInfosReaderIndex, you initialize the writer with the line,

      GrowableWriter indexToTerms = new GrowableWriter(4, indexSize, false);

      For our index using four as the bit estimate results in 27 resizes.

      The last value in indexToTerms is going to be ~ tiiFileLength, and if instead you use,

      int bitEstimate = (int) Math.ceil(Math.log10(tiiFileLength) / Math.log10(2));
      GrowableWriter indexToTerms = new GrowableWriter(bitEstimate, indexSize, false);

      Load time improves to ~ 2 seconds.

      1. LUCENE-3932.trunk.patch
        5 kB
        Michael McCandless
      2. perf.csv
        4 kB
        Sean Bridges

        Activity

        Sean Bridges created issue -
        Hide
        Michael McCandless added a comment -

        I agree net/net that change is good; we know the in-RAM image will be at least as large as the tii file so we should make a better guess up front.

        3.x is currently in code freeze (for the 3.6.0 release), but I'll commit to trunk's preflex codec.

        Can you describe more about your index...? If your tii fils is 66 MB, how many terms do you have...? 5 seconds is also a long startup time... what's the IO system like?

        Show
        Michael McCandless added a comment - I agree net/net that change is good; we know the in-RAM image will be at least as large as the tii file so we should make a better guess up front. 3.x is currently in code freeze (for the 3.6.0 release), but I'll commit to trunk's preflex codec. Can you describe more about your index...? If your tii fils is 66 MB, how many terms do you have...? 5 seconds is also a long startup time... what's the IO system like?
        Hide
        Sean Bridges added a comment -

        I was doing tests on my local machine with an ssd, and loading is definitely cpu bound.

        Our index has 600,000,000 terms. This is an index of 10,000,000 emails, with associated attachments. We generate a lot of garbage terms when parsing, things like time stamps, malformed attachments which parse badly, etc.

        After the change the big time waste is converting the terms from utf8 to utf16 when reading from the .tii file, and then back to utf8 when writing to the in memory store.

        Show
        Sean Bridges added a comment - I was doing tests on my local machine with an ssd, and loading is definitely cpu bound. Our index has 600,000,000 terms. This is an index of 10,000,000 emails, with associated attachments. We generate a lot of garbage terms when parsing, things like time stamps, malformed attachments which parse badly, etc. After the change the big time waste is converting the terms from utf8 to utf16 when reading from the .tii file, and then back to utf8 when writing to the in memory store.
        Hide
        Michael McCandless added a comment -

        Nice. I'd love to know how trunk handles all these terms (we have a more memory efficient terms dict/index in 4.0).

        After the change the big time waste is converting the terms from utf8 to utf16 when reading from the .tii file, and then back to utf8 when writing to the in memory store.

        What %tg of the time is spent on the decode/encode (after fixing the initial bitEstimate)?

        That is very silly... fixing that is a somewhat deeper change though. I guess we'd need to read the .tii file directly (not use SegmentTermEnum), and then copy the UTF8 bytes straight without going through UTF16...

        Do you have comparisons with pre-3.5 (before we cutover to this more RAM-efficient (but CPU heavy on load) terms index)? Probably that less CPU on init, but more RAM held for the lifetime of the reader...?

        Show
        Michael McCandless added a comment - Nice. I'd love to know how trunk handles all these terms (we have a more memory efficient terms dict/index in 4.0). After the change the big time waste is converting the terms from utf8 to utf16 when reading from the .tii file, and then back to utf8 when writing to the in memory store. What %tg of the time is spent on the decode/encode (after fixing the initial bitEstimate)? That is very silly... fixing that is a somewhat deeper change though. I guess we'd need to read the .tii file directly (not use SegmentTermEnum), and then copy the UTF8 bytes straight without going through UTF16... Do you have comparisons with pre-3.5 (before we cutover to this more RAM-efficient (but CPU heavy on load) terms index)? Probably that less CPU on init, but more RAM held for the lifetime of the reader...?
        Hide
        Robert Muir added a comment -

        Our index has 600,000,000 terms. This is an index of 10,000,000 emails, with associated attachments. We generate a lot of garbage terms when parsing, things like time stamps, malformed attachments which parse badly, etc.

        For an index like that, have you tried specifying termInfosIndexDivisor to your IndexReader as well?
        If it works with ok performance, then you could remove it adjust termIndexInterval at write-time to have a smaller .tii

        Show
        Robert Muir added a comment - Our index has 600,000,000 terms. This is an index of 10,000,000 emails, with associated attachments. We generate a lot of garbage terms when parsing, things like time stamps, malformed attachments which parse badly, etc. For an index like that, have you tried specifying termInfosIndexDivisor to your IndexReader as well? If it works with ok performance, then you could remove it adjust termIndexInterval at write-time to have a smaller .tii
        Sean Bridges made changes -
        Field Original Value New Value
        Attachment perf.csv [ 12520487 ]
        Hide
        Sean Bridges added a comment - - edited

        What %tg of the time is spent on the decode/encode (after fixing the initial bitEstimate)?

        I've attached a csv of a profiling session with the bitEstimateFix. The third column is the important one.

        utf8 -> utf 16 is 7% of the time
        utf 16 -> utf8 is 16% of the time

        writing vlong's is also 16% of the time,
        TermBufer.read() is 17% of the time (24% if you include the call to utf8ToUtf16)

        Show
        Sean Bridges added a comment - - edited What %tg of the time is spent on the decode/encode (after fixing the initial bitEstimate)? I've attached a csv of a profiling session with the bitEstimateFix. The third column is the important one. utf8 -> utf 16 is 7% of the time utf 16 -> utf8 is 16% of the time writing vlong's is also 16% of the time, TermBufer.read() is 17% of the time (24% if you include the call to utf8ToUtf16)
        Hide
        Sean Bridges added a comment -

        Do you have comparisons with pre-3.5 (before we cutover to this more RAM-efficient (but CPU heavy on load) terms index)? Probably that less CPU on init, but more RAM held for the lifetime of the reader...?

        Trying with 3.4 gives a 4 second load time, most of the time spent in SegmentTermEnum.next().

        For an index like that, have you tried specifying termInfosIndexDivisor to your IndexReader as well?
        If it works with ok performance, then you could remove it adjust termIndexInterval at write-time to have a smaller .tii

        Thanks, I will try that.

        Show
        Sean Bridges added a comment - Do you have comparisons with pre-3.5 (before we cutover to this more RAM-efficient (but CPU heavy on load) terms index)? Probably that less CPU on init, but more RAM held for the lifetime of the reader...? Trying with 3.4 gives a 4 second load time, most of the time spent in SegmentTermEnum.next(). For an index like that, have you tried specifying termInfosIndexDivisor to your IndexReader as well? If it works with ok performance, then you could remove it adjust termIndexInterval at write-time to have a smaller .tii Thanks, I will try that.
        Hide
        Michael McCandless added a comment -

        Patch for trunk; I factored out the int-math-only log function to new static class oal.util.MathUtil, and re-used from one other place.

        Show
        Michael McCandless added a comment - Patch for trunk; I factored out the int-math-only log function to new static class oal.util.MathUtil, and re-used from one other place.
        Michael McCandless made changes -
        Attachment LUCENE-3932.trunk.patch [ 12520602 ]
        Hide
        Sean Bridges added a comment -

        Using the patch on trunk, load time goes from ~5 to ~2 seconds.

        Show
        Sean Bridges added a comment - Using the patch on trunk, load time goes from ~5 to ~2 seconds.
        Hide
        Michael McCandless added a comment -

        utf8 -> utf 16 is 7% of the time
        utf 16 -> utf8 is 16% of the time

        writing vlong's is also 16% of the time,
        TermBufer.read() is 17% of the time (24% if you include the call to utf8ToUtf16)

        Seems like if we made a direct "decode tii file and write in-memory format" (instead of going through SegmentTermEnum), we could get some of this back. The vLongs unfortunately need to be decoded/re-encoded because they are deltas in the file but absolutes in memory. But, eg the vInt docFreq could be a "copyVInt" method instead of readVInt then writeVInt, which should save a bit.

        Trying with 3.4 gives a 4 second load time, most of the time spent in SegmentTermEnum.next().

        OK, a bit faster than 3.5. But presumably 3.4 uses much more RAM after startup...?

        Using the patch on trunk, load time goes from ~5 to ~2 seconds.

        Awesome, thanks for testing!

        Show
        Michael McCandless added a comment - utf8 -> utf 16 is 7% of the time utf 16 -> utf8 is 16% of the time writing vlong's is also 16% of the time, TermBufer.read() is 17% of the time (24% if you include the call to utf8ToUtf16) Seems like if we made a direct "decode tii file and write in-memory format" (instead of going through SegmentTermEnum), we could get some of this back. The vLongs unfortunately need to be decoded/re-encoded because they are deltas in the file but absolutes in memory. But, eg the vInt docFreq could be a "copyVInt" method instead of readVInt then writeVInt, which should save a bit. Trying with 3.4 gives a 4 second load time, most of the time spent in SegmentTermEnum.next(). OK, a bit faster than 3.5. But presumably 3.4 uses much more RAM after startup...? Using the patch on trunk, load time goes from ~5 to ~2 seconds. Awesome, thanks for testing!
        Hide
        Sean Bridges added a comment - - edited

        Seems like if we made a direct "decode tii file and write in-memory format" (instead of going through SegmentTermEnum), we could get some of this back. The vLongs unfortunately need to be decoded/re-encoded because they are deltas in the file but absolutes in memory. But, eg the vInt docFreq could be a "copyVInt" method instead of readVInt then writeVInt, which should save a bit.

        Is the space savings of delta encoding worth the processing time? You could write the .tii file to disk such that on open you could read it straight into a byte[]. As a test, reading a random 69 meg file into a byte[] takes ~250 ms.

        Show
        Sean Bridges added a comment - - edited Seems like if we made a direct "decode tii file and write in-memory format" (instead of going through SegmentTermEnum), we could get some of this back. The vLongs unfortunately need to be decoded/re-encoded because they are deltas in the file but absolutes in memory. But, eg the vInt docFreq could be a "copyVInt" method instead of readVInt then writeVInt, which should save a bit. Is the space savings of delta encoding worth the processing time? You could write the .tii file to disk such that on open you could read it straight into a byte[]. As a test, reading a random 69 meg file into a byte[] takes ~250 ms.
        Hide
        Michael McCandless added a comment -

        Is the space savings of delta encoding worth the processing time? You could write the .tii file to disk such that on open you could read it straight into a byte[].

        This is actually what we do in 4.0's default codec (the index is an FST).

        It is tempting to do that in 3.x (if we were to do another 3.x release after 3.6) ... we'd need to alter other things as well, eg the term bytes are also delta-coded in the file but not in RAM.

        I'm curious how much larger it'd be if we stopped delta coding... for your case, how large is the byte[] in RAM (just call dataPagedBytes.getPointer(), just before we freeze it, and print that result) vs the tii on disk...?

        Show
        Michael McCandless added a comment - Is the space savings of delta encoding worth the processing time? You could write the .tii file to disk such that on open you could read it straight into a byte[]. This is actually what we do in 4.0's default codec (the index is an FST). It is tempting to do that in 3.x (if we were to do another 3.x release after 3.6) ... we'd need to alter other things as well, eg the term bytes are also delta-coded in the file but not in RAM. I'm curious how much larger it'd be if we stopped delta coding... for your case, how large is the byte[] in RAM (just call dataPagedBytes.getPointer(), just before we freeze it, and print that result) vs the tii on disk...?
        Hide
        Sean Bridges added a comment - - edited

        I'm curious how much larger it'd be if we stopped delta coding... for your case, how large is the byte[] in RAM (just call dataPagedBytes.getPointer(), just before we freeze it, and print that result) vs the tii on disk...?

        dataPagedBytes.getPointer() == 124973970

        On disk the .tii file is 69508193 bytes

        The entire index is ~50 gigs.

        Show
        Sean Bridges added a comment - - edited I'm curious how much larger it'd be if we stopped delta coding... for your case, how large is the byte[] in RAM (just call dataPagedBytes.getPointer(), just before we freeze it, and print that result) vs the tii on disk...? dataPagedBytes.getPointer() == 124973970 On disk the .tii file is 69508193 bytes The entire index is ~50 gigs.
        Hide
        Michael McCandless added a comment -

        OK I committed this to trunk (thanks Sean!).

        dataPagedBytes.getPointer() == 124973970

        On disk the .tii file is 69508193 bytes

        OK, ~80% bigger... but in the overall index it's minor increase (~0.1%).

        But I think we should hold off on any more 3.x work until/unless we decide to do another release off of it....

        Show
        Michael McCandless added a comment - OK I committed this to trunk (thanks Sean!). dataPagedBytes.getPointer() == 124973970 On disk the .tii file is 69508193 bytes OK, ~80% bigger... but in the overall index it's minor increase (~0.1%). But I think we should hold off on any more 3.x work until/unless we decide to do another release off of it....
        Hide
        Sean Bridges added a comment -

        Thanks!

        Show
        Sean Bridges added a comment - Thanks!
        Michael McCandless made changes -
        Resolution Fixed [ 1 ]
        Status Open [ 1 ] Resolved [ 5 ]
        Assignee Michael McCandless [ mikemccand ]
        Fix Version/s 4.0 [ 12314025 ]
        Hide
        Sean Bridges added a comment -

        Can this be ported to 3.6.1

        Show
        Sean Bridges added a comment - Can this be ported to 3.6.1
        Hide
        Michael McCandless added a comment -

        Can this be ported to 3.6.1

        I don't think so: it should only be bug fixes in 3.6.x series...

        If we somehow did a 3.7 (which we're hoping not to: hopefully we get 4.0 alpha out instead) then this could be backported for that...

        Show
        Michael McCandless added a comment - Can this be ported to 3.6.1 I don't think so: it should only be bug fixes in 3.6.x series... If we somehow did a 3.7 (which we're hoping not to: hopefully we get 4.0 alpha out instead) then this could be backported for that...
        Uwe Schindler made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Sean Bridges
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development