Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1043

Speedup merging of stored fields when field mapping "matches"

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.2
    • Fix Version/s: 2.3
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Robert Engels suggested the following idea, here:

      http://www.gossamer-threads.com/lists/lucene/java-dev/54217

      When merging in the stored fields from a segment, if the field name ->
      number mapping is identical then we can simply bulk copy the entire
      entry for the document rather than re-interpreting and then re-writing
      the actual stored fields.

      I've pulled the code from the above thread and got it working on the
      current trunk.

      1. LUCENE-1043.patch
        8 kB
        Michael McCandless
      2. LUCENE-1043.take2.patch
        12 kB
        Michael McCandless

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Initial patch. All tests pass.

        Show
        mikemccand Michael McCandless added a comment - Initial patch. All tests pass.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        You're fast!

        Future optimizations could include bulk copying multiple documents at once (all ranges between deleted docs). The speedup would probably be greatest for small docs, but I'm not sure if it would be worth it or not.

        More controversial: maybe even expand the number of docs that can be bulk copied by not bothering removing deleted docs if it's some very small number (unless it's an optimize). This is probably not worth it.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - You're fast! Future optimizations could include bulk copying multiple documents at once (all ranges between deleted docs). The speedup would probably be greatest for small docs, but I'm not sure if it would be worth it or not. More controversial: maybe even expand the number of docs that can be bulk copied by not bothering removing deleted docs if it's some very small number (unless it's an optimize). This is probably not worth it.
        Hide
        mikemccand Michael McCandless added a comment -

        Future optimizations could include bulk copying multiple documents at once (all ranges between deleted docs). The speedup would probably be greatest for small docs, but I'm not sure if it would be worth it or not.

        Ooh, I like that idea! I'll explore that.

        More controversial: maybe even expand the number of docs that can be bulk copied by not bothering removing deleted docs if it's some very small number (unless it's an optimize). This is probably not worth it.

        That's a neat idea too but I agree likely not worth it.

        Another idea: we can almost just concatenate the posting lists
        (frq/prx) for each term, because they are "delta coded" (we write the
        delta between docIDs). The only catch is you have to "stitch up" the
        boundary: you have to read the docID from the start of the next
        segment, write the delta-code, then you can copy the remaining bytes.
        I think this could be a big win especially when merging larger
        segments.

        Show
        mikemccand Michael McCandless added a comment - Future optimizations could include bulk copying multiple documents at once (all ranges between deleted docs). The speedup would probably be greatest for small docs, but I'm not sure if it would be worth it or not. Ooh, I like that idea! I'll explore that. More controversial: maybe even expand the number of docs that can be bulk copied by not bothering removing deleted docs if it's some very small number (unless it's an optimize). This is probably not worth it. That's a neat idea too but I agree likely not worth it. Another idea: we can almost just concatenate the posting lists (frq/prx) for each term, because they are "delta coded" (we write the delta between docIDs). The only catch is you have to "stitch up" the boundary: you have to read the docID from the start of the next segment, write the delta-code, then you can copy the remaining bytes. I think this could be a big win especially when merging larger segments.
        Hide
        rengels@ix.netcom.com robert engels added a comment -

        You have to be careful with the bulk copy because you will need to also copy the offsets from the source, sticking them up, and then write them in bulk.

        With large enough input and output buffers I am not sure the bulk copy will make that large a difference.

        Show
        rengels@ix.netcom.com robert engels added a comment - You have to be careful with the bulk copy because you will need to also copy the offsets from the source, sticking them up, and then write them in bulk. With large enough input and output buffers I am not sure the bulk copy will make that large a difference.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        re bulk copying: Ideally, read a group of docs into a buffer big enough that it triggers the IndexInput to read directly into it, and write directly from it. The field index needs to be done int by int, but it's just adding a constant to all of them and probably isn't worth optimizing (trying to not fully encode/decode).... just loop over them ahead of time, fixing them up. The total size of the stored fields to write is simply the difference between the indicies (need to slightly special case the end of the index of course...)

        Another idea: we can almost just concatenate the posting lists
        (frq/prx) for each term, because they are "delta coded" (we write the
        delta between docIDs)

        Nice! new JIRA issue?

        Show
        yseeley@gmail.com Yonik Seeley added a comment - re bulk copying: Ideally, read a group of docs into a buffer big enough that it triggers the IndexInput to read directly into it, and write directly from it. The field index needs to be done int by int, but it's just adding a constant to all of them and probably isn't worth optimizing (trying to not fully encode/decode).... just loop over them ahead of time, fixing them up. The total size of the stored fields to write is simply the difference between the indicies (need to slightly special case the end of the index of course...) Another idea: we can almost just concatenate the posting lists (frq/prx) for each term, because they are "delta coded" (we write the delta between docIDs) Nice! new JIRA issue?
        Hide
        rengels@ix.netcom.com robert engels added a comment -

        When bulk copying the documents, I think you need to:

        read array of long from index (8 * (ndocs+1)) in long[ndocs+1] offsets;
        calculate length = offset[ndocs]-offset[0];
        read bytes of length from document file
        startoffset = current output document stream position
        write bytes to output document
        modify offset[] adding startoffset-offset[0] to each entry
        write offset[] in bulk to index output

        Show
        rengels@ix.netcom.com robert engels added a comment - When bulk copying the documents, I think you need to: read array of long from index (8 * (ndocs+1)) in long [ndocs+1] offsets; calculate length = offset [ndocs] -offset [0] ; read bytes of length from document file startoffset = current output document stream position write bytes to output document modify offset[] adding startoffset-offset [0] to each entry write offset[] in bulk to index output
        Hide
        mikemccand Michael McCandless added a comment -

        Nice! new JIRA issue?

        Yes, though I need to mull it over some ... I think it would require a
        change to the index file format to have each term record the last
        docID in its posting list, which then increases the size of the
        index. Maybe we could do it only when the posting list is long. So
        there are tricky tradeoffs....

        Show
        mikemccand Michael McCandless added a comment - Nice! new JIRA issue? Yes, though I need to mull it over some ... I think it would require a change to the index file format to have each term record the last docID in its posting list, which then increases the size of the index. Maybe we could do it only when the posting list is long. So there are tricky tradeoffs....
        Hide
        mikemccand Michael McCandless added a comment -

        Attached new rev of the patch, using the bulk copy approach instead.

        I changed the approach slightly: I allocate an array to store the
        length (in bytes) of each docs's stored fields, but then I do a low
        level byte copy (added IndexOutput.copyBytes) from the FieldsReader's
        fieldsStream to the FieldWriter's fieldsStream.

        All tests pass. I plan to commit in a day or two.

        Show
        mikemccand Michael McCandless added a comment - Attached new rev of the patch, using the bulk copy approach instead. I changed the approach slightly: I allocate an array to store the length (in bytes) of each docs's stored fields, but then I do a low level byte copy (added IndexOutput.copyBytes) from the FieldsReader's fieldsStream to the FieldWriter's fieldsStream. All tests pass. I plan to commit in a day or two.
        Hide
        mikemccand Michael McCandless added a comment -

        I just committed this. Thanks Robert!

        Show
        mikemccand Michael McCandless added a comment - I just committed this. Thanks Robert!

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development