Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      In order to offer user's near realtime search, without incurring
      an indexing performance penalty, we can implement search on
      IndexWriter's RAM buffer. This is the buffer that is filled in
      RAM as documents are indexed. Currently the RAM buffer is
      flushed to the underlying directory (usually disk) before being
      made searchable.

      Todays Lucene based NRT systems must incur the cost of merging
      segments, which can slow indexing.

      Michael Busch has good suggestions regarding how to handle deletes using max doc ids.
      https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923

      The area that isn't fully fleshed out is the terms dictionary,
      which needs to be sorted prior to queries executing. Currently
      IW implements a specialized hash table. Michael B has a
      suggestion here:
      https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

      1. LUCENE-2312-FC.patch
        4 kB
        Jason Rutherglen
      2. LUCENE-2312.patch
        71 kB
        Jason Rutherglen
      3. LUCENE-2312.patch
        102 kB
        Jason Rutherglen
      4. LUCENE-2312.patch
        149 kB
        Jason Rutherglen

        Issue Links

          Activity

          Hide
          Jason Rutherglen added a comment -

          In regards to the terms dictionary, keeping it sorted or not, I think it's best to sort it on demand because otherwise there will be yet another parameter to pass into IW (i.e. sortRAMBufTerms or something like that).

          Show
          Jason Rutherglen added a comment - In regards to the terms dictionary, keeping it sorted or not, I think it's best to sort it on demand because otherwise there will be yet another parameter to pass into IW (i.e. sortRAMBufTerms or something like that).
          Hide
          Jason Rutherglen added a comment -

          I set out implementing a simple method DocumentsWriter.getTerms
          which should return a sorted array of terms over the current RAM
          buffer. While I think this can be implemented, there's a lot of
          code in the index package to handle multiple threads, which is
          fine, except I'm concerned the interleaving of postings won't
          perform well. So I think we'd want to implement what's been
          discussed in LUCENE-2293, per thread ram buffers. With that
          change, it seems implementing this issue could be
          straightforward.

          Show
          Jason Rutherglen added a comment - I set out implementing a simple method DocumentsWriter.getTerms which should return a sorted array of terms over the current RAM buffer. While I think this can be implemented, there's a lot of code in the index package to handle multiple threads, which is fine, except I'm concerned the interleaving of postings won't perform well. So I think we'd want to implement what's been discussed in LUCENE-2293 , per thread ram buffers. With that change, it seems implementing this issue could be straightforward.
          Hide
          Jason Rutherglen added a comment -

          From LUCENE-2293:

          (b-tree, or, simply sort-on-demand the
          first time a query needs it, though that cost increases the
          larger your RAM segments get, ie, not incremental to the # docs
          you just added)

          For the terms dictionary, perhaps a terms array (this could be a
          RawPostingList[], or an array of objects with pointers to a
          RawPostingList with some helper methods like getTerm and
          compareTo), is kept in sorted order, we then binary search and
          insert new RawPostingLists/terms into the array. We could
          implement a 2 dimensional array, allowing us to make a per
          reader copy of the 1st dimension of array. This would maintain
          transactional consistency (ie, a reader's array isn't changing
          as a term enum is traversing in another thread).

          Also, we have to solve what happens to a reader using a
          RAM segment that's been flushed. Perhaps we don't reuse RAM at
          that point, ie, rely on GC to reclaim once all readers using
          that RAM segment have closed.

          I don't think we have a choice here?

          Show
          Jason Rutherglen added a comment - From LUCENE-2293 : (b-tree, or, simply sort-on-demand the first time a query needs it, though that cost increases the larger your RAM segments get, ie, not incremental to the # docs you just added) For the terms dictionary, perhaps a terms array (this could be a RawPostingList[], or an array of objects with pointers to a RawPostingList with some helper methods like getTerm and compareTo), is kept in sorted order, we then binary search and insert new RawPostingLists/terms into the array. We could implement a 2 dimensional array, allowing us to make a per reader copy of the 1st dimension of array. This would maintain transactional consistency (ie, a reader's array isn't changing as a term enum is traversing in another thread). Also, we have to solve what happens to a reader using a RAM segment that's been flushed. Perhaps we don't reuse RAM at that point, ie, rely on GC to reclaim once all readers using that RAM segment have closed. I don't think we have a choice here?
          Hide
          Michael McCandless added a comment -

          For the terms dictionary, perhaps a terms array (this could be a
          RawPostingList[], or an array of objects with pointers to a
          RawPostingList with some helper methods like getTerm and
          compareTo), is kept in sorted order, we then binary search and
          insert new RawPostingLists/terms into the array. We could
          implement a 2 dimensional array, allowing us to make a per
          reader copy of the 1st dimension of array. This would maintain
          transactional consistency (ie, a reader's array isn't changing
          as a term enum is traversing in another thread).

          I don't think we can do term insertion into an array – that's O(N^2)
          insertion cost – we should use a btree instead.

          Also, we could store the first docID stored into the term, too – this
          way we could have a ordered collection of terms, that's shared across
          several open readers even as changes are still being made, but each
          reader skips a given term if its first docID is greater than the
          maxDoc it's searching. That'd give us point in time searching even
          while we add terms with time...

          Also, we have to solve what happens to a reader using a RAM segment that's been flushed. Perhaps we don't reuse RAM at that point, ie, rely on GC to reclaim once all readers using that RAM segment have closed.

          I don't think we have a choice here?

          I think we do have a choice.

          EG we could force the reader to cutover to the newly flushed segment
          (which should be identical to the RAM segment), eg by making [say] a
          DelegatingSegmentReader.

          Still... we'd probably have to not re-use in that case, since there
          can be queries in-flight stepping through the RAM postings, and, we
          have no way to accurately detect they are done. But at least with
          this approach we wouldn't tie up RAM indefinitely...

          Or maybe we simply state that the APP must aggressively close NRT
          readers with time else memory use grows and grows... but I don't
          really like that. We don't have such a restriction today...

          Show
          Michael McCandless added a comment - For the terms dictionary, perhaps a terms array (this could be a RawPostingList[], or an array of objects with pointers to a RawPostingList with some helper methods like getTerm and compareTo), is kept in sorted order, we then binary search and insert new RawPostingLists/terms into the array. We could implement a 2 dimensional array, allowing us to make a per reader copy of the 1st dimension of array. This would maintain transactional consistency (ie, a reader's array isn't changing as a term enum is traversing in another thread). I don't think we can do term insertion into an array – that's O(N^2) insertion cost – we should use a btree instead. Also, we could store the first docID stored into the term, too – this way we could have a ordered collection of terms, that's shared across several open readers even as changes are still being made, but each reader skips a given term if its first docID is greater than the maxDoc it's searching. That'd give us point in time searching even while we add terms with time... Also, we have to solve what happens to a reader using a RAM segment that's been flushed. Perhaps we don't reuse RAM at that point, ie, rely on GC to reclaim once all readers using that RAM segment have closed. I don't think we have a choice here? I think we do have a choice. EG we could force the reader to cutover to the newly flushed segment (which should be identical to the RAM segment), eg by making [say] a DelegatingSegmentReader. Still... we'd probably have to not re-use in that case, since there can be queries in-flight stepping through the RAM postings, and, we have no way to accurately detect they are done. But at least with this approach we wouldn't tie up RAM indefinitely... Or maybe we simply state that the APP must aggressively close NRT readers with time else memory use grows and grows... but I don't really like that. We don't have such a restriction today...
          Hide
          Jason Rutherglen added a comment -

          Mike, Why does DocFieldConsumers have DocFieldConsumer one and two? How is this class used? Thanks.

          Show
          Jason Rutherglen added a comment - Mike, Why does DocFieldConsumers have DocFieldConsumer one and two? How is this class used? Thanks.
          Hide
          Michael Busch added a comment -

          Also, we could store the first docID stored into the term, too - this
          way we could have a ordered collection of terms, that's shared across
          several open readers even as changes are still being made, but each
          reader skips a given term if its first docID is greater than the
          maxDoc it's searching. That'd give us point in time searching even
          while we add terms with time...

          Exactly. This is what I meant in my comment:
          https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

          But I mistakenly said lastDocID; of course firstDocID is correct.

          Show
          Michael Busch added a comment - Also, we could store the first docID stored into the term, too - this way we could have a ordered collection of terms, that's shared across several open readers even as changes are still being made, but each reader skips a given term if its first docID is greater than the maxDoc it's searching. That'd give us point in time searching even while we add terms with time... Exactly. This is what I meant in my comment: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 But I mistakenly said lastDocID; of course firstDocID is correct.
          Hide
          Michael Busch added a comment -

          I'll try to tackle this one!

          Show
          Michael Busch added a comment - I'll try to tackle this one!
          Hide
          Jason Rutherglen added a comment -

          A few notes so far:

          • IW flush could become thread dependent (eg, it'll only flush
            for the current doc writer) or maybe it should flush all doc
            writers? Close will shut down and flush all doc writers.
          • A new term will first check the hash table for existence (as
            currently), if it's not in the term hash table only then will it
            be added to the btree (btw, binary search is O(log N) on
            average?) This way we're avoiding the somewhat costlier btree
            existence check per token.
          • The algorithm for flushing doc writers based on RAM
            consumption can simply be, on exceed, flush the doc writer
            consuming the most RAM?
          • I gutted the PerThread classes, then realized, it's all too
            intertwined. I'd rather get something working, than spend an
            excessive amount of time rearranging code that already works.
          Show
          Jason Rutherglen added a comment - A few notes so far: IW flush could become thread dependent (eg, it'll only flush for the current doc writer) or maybe it should flush all doc writers? Close will shut down and flush all doc writers. A new term will first check the hash table for existence (as currently), if it's not in the term hash table only then will it be added to the btree (btw, binary search is O(log N) on average?) This way we're avoiding the somewhat costlier btree existence check per token. The algorithm for flushing doc writers based on RAM consumption can simply be, on exceed, flush the doc writer consuming the most RAM? I gutted the PerThread classes, then realized, it's all too intertwined. I'd rather get something working, than spend an excessive amount of time rearranging code that already works.
          Hide
          Jason Rutherglen added a comment -
          • IW commitMerge calls docWriter's remapDeletes, a synchronized method to prevent concurrent updates. I'm not sure how we should efficiently block calls to the different DW's.
          • _mergeInit calls docWriter getDocStoreSegment - unsure what to change
          • Some of the config settings (such as maxBufferedDocs) can simply be removed from DW, and instead accessed via WriterConfig
          Show
          Jason Rutherglen added a comment - IW commitMerge calls docWriter's remapDeletes, a synchronized method to prevent concurrent updates. I'm not sure how we should efficiently block calls to the different DW's. _mergeInit calls docWriter getDocStoreSegment - unsure what to change Some of the config settings (such as maxBufferedDocs) can simply be removed from DW, and instead accessed via WriterConfig
          Hide
          Michael McCandless added a comment -

          Michael are you also going to [first] tackle truly separating the RAM segments? I think we need this first ...

          Mike, Why does DocFieldConsumers have DocFieldConsumer one and two? How is this class used? Thanks.

          This is so we can make a "tee" in the indexing chain. Here's the default chain (copied out of comment in DW):

          DocConsumer / DocConsumerPerThread
            --> code: DocFieldProcessor / DocFieldProcessorPerThread
              --> DocFieldConsumer / DocFieldConsumerPerThread / DocFieldConsumerPerField
                --> code: DocFieldConsumers / DocFieldConsumersPerThread / DocFieldConsumersPerField
                  --> code: DocInverter / DocInverterPerThread / DocInverterPerField
                    --> InvertedDocConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField
                      --> code: TermsHash / TermsHashPerThread / TermsHashPerField
                        --> TermsHashConsumer / TermsHashConsumerPerThread / TermsHashConsumerPerField
                          --> code: FreqProxTermsWriter / FreqProxTermsWriterPerThread / FreqProxTermsWriterPerField
                          --> code: TermVectorsTermsWriter / TermVectorsTermsWriterPerThread / TermVectorsTermsWriterPerField
                    --> InvertedDocEndConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField
                      --> code: NormsWriter / NormsWriterPerThread / NormsWriterPerField
                  --> code: StoredFieldsWriter / StoredFieldsWriterPerThread / StoredFieldsWriterPerField
          

          The tee is so the doc fields can go to both DocInvert (for creating postings & term vectors) and to stored fields writer.

          Show
          Michael McCandless added a comment - Michael are you also going to [first] tackle truly separating the RAM segments? I think we need this first ... Mike, Why does DocFieldConsumers have DocFieldConsumer one and two? How is this class used? Thanks. This is so we can make a "tee" in the indexing chain. Here's the default chain (copied out of comment in DW): DocConsumer / DocConsumerPerThread --> code: DocFieldProcessor / DocFieldProcessorPerThread --> DocFieldConsumer / DocFieldConsumerPerThread / DocFieldConsumerPerField --> code: DocFieldConsumers / DocFieldConsumersPerThread / DocFieldConsumersPerField --> code: DocInverter / DocInverterPerThread / DocInverterPerField --> InvertedDocConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField --> code: TermsHash / TermsHashPerThread / TermsHashPerField --> TermsHashConsumer / TermsHashConsumerPerThread / TermsHashConsumerPerField --> code: FreqProxTermsWriter / FreqProxTermsWriterPerThread / FreqProxTermsWriterPerField --> code: TermVectorsTermsWriter / TermVectorsTermsWriterPerThread / TermVectorsTermsWriterPerField --> InvertedDocEndConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField --> code: NormsWriter / NormsWriterPerThread / NormsWriterPerField --> code: StoredFieldsWriter / StoredFieldsWriterPerThread / StoredFieldsWriterPerField The tee is so the doc fields can go to both DocInvert (for creating postings & term vectors) and to stored fields writer.
          Hide
          Michael McCandless added a comment -

          IW flush could become thread dependent

          Right, we want this – different RAM segments should be flushed at different times. This gives us better concurrency since IO/CPU resource consumption will now be more interleaved. While one RAM segment is flushing, the others are still indexing.

          A new term will first check the hash table for existence (as
          currently), if it's not in the term hash table only then will it
          be added to the btree (btw, binary search is O(log N) on
          average?) This way we're avoiding the somewhat costlier btree
          existence check per token.

          Yes, we could have btree on-the-side but still use hash for mapping (vs using btree alone). Hash will be faster lookups... btree could be created/updated on demand first time something needs to .next() through the TermsEnum.

          {quote
          The algorithm for flushing doc writers based on RAM
          consumption can simply be, on exceed, flush the doc writer
          consuming the most RAM

          Sounds good The challenge will be balancing things... eg if during the time 1 RAM segment is flushed, the others are able to consume more RAM that was freed up by flushing this one RAM segment, you've got a problem... or maybe at that point you go and flush the next one now using the most RAM, so it'd self balance with time.

          This will mean the RAM usage is able to flare up above the high water mark...

          I gutted the PerThread classes, then realized, it's all too
          intertwined. I'd rather get something working, than spend an
          excessive amount of time rearranging code that already works.

          For starters I would keep the *PerThread, but create multiple DWs? Ie, removing the PerThread layer doesn't have to happen at first.

          Or we could do the nuclear option – make a new indexing chain.

          Show
          Michael McCandless added a comment - IW flush could become thread dependent Right, we want this – different RAM segments should be flushed at different times. This gives us better concurrency since IO/CPU resource consumption will now be more interleaved. While one RAM segment is flushing, the others are still indexing. A new term will first check the hash table for existence (as currently), if it's not in the term hash table only then will it be added to the btree (btw, binary search is O(log N) on average?) This way we're avoiding the somewhat costlier btree existence check per token. Yes, we could have btree on-the-side but still use hash for mapping (vs using btree alone). Hash will be faster lookups... btree could be created/updated on demand first time something needs to .next() through the TermsEnum. {quote The algorithm for flushing doc writers based on RAM consumption can simply be, on exceed, flush the doc writer consuming the most RAM Sounds good The challenge will be balancing things... eg if during the time 1 RAM segment is flushed, the others are able to consume more RAM that was freed up by flushing this one RAM segment, you've got a problem... or maybe at that point you go and flush the next one now using the most RAM, so it'd self balance with time. This will mean the RAM usage is able to flare up above the high water mark... I gutted the PerThread classes, then realized, it's all too intertwined. I'd rather get something working, than spend an excessive amount of time rearranging code that already works. For starters I would keep the *PerThread, but create multiple DWs? Ie, removing the PerThread layer doesn't have to happen at first. Or we could do the nuclear option – make a new indexing chain.
          Hide
          Michael McCandless added a comment -

          IW commitMerge calls docWriter's remapDeletes, a synchronized method to prevent concurrent updates. I'm not sure how we should efficiently block calls to the different DW's.

          Yeah this is because when we buffer a delete Term/Query, the docID we store against it is absolute. It seems like it could/should be relative (ie, within the RAM segment), then remapping wouldn't be needed when a merge commits. I think?

          _mergeInit calls docWriter getDocStoreSegment - unsure what to change

          It wouldn't anymore once we have private RAM segments: we would no longer share doc stores across segments, meaning merging will always merge doc stores and there's no need to call that method nor have all the logic in SegmentMerger to determine whether doc store merging is required.

          This will necessarily be a perf hit when up and building a large index from scratch in a single IW session. Today that index creates one large set of doc stores and never has to merge it while building. This is the biggest perf downside to this change, I think.

          But maybe the perf loss will not be so bad, because of bulk merging, in the case when all docs always add the same fields in the same order. Or... if we could fix lucene to always bind the same field name to the same field number (LUCENE-1737) then we'd always bulk-merge regardless of which & which order app adds fields to docs.

          Some of the config settings (such as maxBufferedDocs) can simply be removed from DW, and instead accessed via WriterConfig

          Ahh, you mean push IWC down to DW? That sounds great.

          Show
          Michael McCandless added a comment - IW commitMerge calls docWriter's remapDeletes, a synchronized method to prevent concurrent updates. I'm not sure how we should efficiently block calls to the different DW's. Yeah this is because when we buffer a delete Term/Query, the docID we store against it is absolute. It seems like it could/should be relative (ie, within the RAM segment), then remapping wouldn't be needed when a merge commits. I think? _mergeInit calls docWriter getDocStoreSegment - unsure what to change It wouldn't anymore once we have private RAM segments: we would no longer share doc stores across segments, meaning merging will always merge doc stores and there's no need to call that method nor have all the logic in SegmentMerger to determine whether doc store merging is required. This will necessarily be a perf hit when up and building a large index from scratch in a single IW session. Today that index creates one large set of doc stores and never has to merge it while building. This is the biggest perf downside to this change, I think. But maybe the perf loss will not be so bad, because of bulk merging, in the case when all docs always add the same fields in the same order. Or... if we could fix lucene to always bind the same field name to the same field number ( LUCENE-1737 ) then we'd always bulk-merge regardless of which & which order app adds fields to docs. Some of the config settings (such as maxBufferedDocs) can simply be removed from DW, and instead accessed via WriterConfig Ahh, you mean push IWC down to DW? That sounds great.
          Hide
          Jason Rutherglen added a comment -

          Mike, rollback is pausing all threads and calling doc writer abort. This should probably happen across all (per thread) doc writers?

          Show
          Jason Rutherglen added a comment - Mike, rollback is pausing all threads and calling doc writer abort. This should probably happen across all (per thread) doc writers?
          Hide
          Michael Busch added a comment -

          Well, we need to keep our transactional semantics. So I assume while a flush will happen per doc writer independently, a commit will trigger all (per thread) doc writers to flush. Then a rollback also has to abort all per thread doc writers.

          Show
          Michael Busch added a comment - Well, we need to keep our transactional semantics. So I assume while a flush will happen per doc writer independently, a commit will trigger all (per thread) doc writers to flush. Then a rollback also has to abort all per thread doc writers.
          Hide
          Michael Busch added a comment -

          Michael are you also going to [first] tackle truly separating the RAM segments? I think we need this first ...

          Yeah I agree. I started working on a patch for separating the doc writers already.

          I also have a separate indexing chain prototype working with searchable RAM buffer (single-threaded), but slightly different postinglist format (some docs nowadays only have 140 characters ). It seems really fast. I spent a long time thinking about lock-free algorithms and data structures, so indexing performance should be completely independent of the search load (in theory). I need to think a bit more about how to make it work with "normal" documents and Lucene's current in-memory format.

          Show
          Michael Busch added a comment - Michael are you also going to [first] tackle truly separating the RAM segments? I think we need this first ... Yeah I agree. I started working on a patch for separating the doc writers already. I also have a separate indexing chain prototype working with searchable RAM buffer (single-threaded), but slightly different postinglist format (some docs nowadays only have 140 characters ). It seems really fast. I spent a long time thinking about lock-free algorithms and data structures, so indexing performance should be completely independent of the search load (in theory). I need to think a bit more about how to make it work with "normal" documents and Lucene's current in-memory format.
          Hide
          Jason Rutherglen added a comment -

          I got the basics of the term enum working, it can be completed
          fairly easily. So I moved on to term docs... There we got some
          work to do? Because we're not storing the skip lists in the ram
          buffer, currently. I guess we'll need a new
          FreqProxTermsWriterPerField that stores the skip lists as
          they're being written? How will that work? Doesn't the
          multi-level skip list assume a set number of docs?

          Show
          Jason Rutherglen added a comment - I got the basics of the term enum working, it can be completed fairly easily. So I moved on to term docs... There we got some work to do? Because we're not storing the skip lists in the ram buffer, currently. I guess we'll need a new FreqProxTermsWriterPerField that stores the skip lists as they're being written? How will that work? Doesn't the multi-level skip list assume a set number of docs?
          Hide
          Michael McCandless added a comment -

          Yes, commit should flush & sync all doc writers, and rollback must abort all of them.

          I also have a separate indexing chain prototype working with searchable RAM buffer (single-threaded)

          Yay!

          but slightly different postinglist format (some docs nowadays only have 140 characters ).

          New sponsor, eh?

          But, yes, I suspect an indexer chain optimized to tiny docs can get sizable gains.

          What change to the postings format? Is the change only in the RAM
          buffer or also in the index? If it's in the index... we should
          probably do this under flex.

          It seems really fast. I spent a long time thinking about lock-free algorithms and data structures, so indexing performance should be completely independent of the search load (in theory). I need to think a bit more about how to make it work with "normal" documents and Lucene's current in-memory format.

          Sounds like awesome progress!! Want some details over here

          Show
          Michael McCandless added a comment - Yes, commit should flush & sync all doc writers, and rollback must abort all of them. I also have a separate indexing chain prototype working with searchable RAM buffer (single-threaded) Yay! but slightly different postinglist format (some docs nowadays only have 140 characters ). New sponsor, eh? But, yes, I suspect an indexer chain optimized to tiny docs can get sizable gains. What change to the postings format? Is the change only in the RAM buffer or also in the index? If it's in the index... we should probably do this under flex. It seems really fast. I spent a long time thinking about lock-free algorithms and data structures, so indexing performance should be completely independent of the search load (in theory). I need to think a bit more about how to make it work with "normal" documents and Lucene's current in-memory format. Sounds like awesome progress!! Want some details over here
          Hide
          Michael McCandless added a comment -

          I got the basics of the term enum working, it can be completed
          fairly easily. So I moved on to term docs... There we got some
          work to do? Because we're not storing the skip lists in the ram
          buffer, currently. I guess we'll need a new
          FreqProxTermsWriterPerField that stores the skip lists as
          they're being written? How will that work? Doesn't the
          multi-level skip list assume a set number of docs?

          Sounds like you & Michael should sync up!

          Good question on skipping – for first cut we can have no skipping
          (and just scan)? Skipping may not be that important in practice,
          unless RAM buffer becomes truly immense. Of course, the tinier the
          docs the more important skipping will be...

          Show
          Michael McCandless added a comment - I got the basics of the term enum working, it can be completed fairly easily. So I moved on to term docs... There we got some work to do? Because we're not storing the skip lists in the ram buffer, currently. I guess we'll need a new FreqProxTermsWriterPerField that stores the skip lists as they're being written? How will that work? Doesn't the multi-level skip list assume a set number of docs? Sounds like you & Michael should sync up! Good question on skipping – for first cut we can have no skipping (and just scan)? Skipping may not be that important in practice, unless RAM buffer becomes truly immense. Of course, the tinier the docs the more important skipping will be...
          Hide
          Jason Rutherglen added a comment -

          Good question on skipping - for first cut we can have no
          skipping (and just scan)?

          True.

          One immediate thought is to have a set skip interval (what was
          it before when we had single level?), and for now at least have
          a single level skip list. That we can grow the posting list with
          docs, and the skip list at the same time. If the interval is
          constant there won't be a need to rebuild the skip list.

          Show
          Jason Rutherglen added a comment - Good question on skipping - for first cut we can have no skipping (and just scan)? True. One immediate thought is to have a set skip interval (what was it before when we had single level?), and for now at least have a single level skip list. That we can grow the posting list with docs, and the skip list at the same time. If the interval is constant there won't be a need to rebuild the skip list.
          Hide
          Jason Rutherglen added a comment -

          Pre-advanced apology for permanently damaging (well I guess it
          can be deleted) the look and feel of this issue with a thwack of
          code, however I don't want to post the messy patch, and I'm
          guessing there's something small as to why the postings
          iteration on the freq byte slice reader isn't happening
          correctly (ie, it's returning 0).

          public class DWTermDocs implements TermDocs {
              final FreqProxTermsWriterPerField field;
              final int numPostings;
              final CharBlockPool charPool;
              FreqProxTermsWriter.PostingList posting;
              char[] text;
              int textOffset;
              private int postingUpto = -1;
              final ByteSliceReader freq = new ByteSliceReader();
              final ByteSliceReader prox = new ByteSliceReader();
          
              int docID;
              int termFreq;
              
              DWTermDocs(FreqProxTermsWriterPerField field, FreqProxTermsWriter.PostingList posting) throws IOException {
                this.field = field;
                this.charPool = field.perThread.termsHashPerThread.charPool;
                //this.numPostings = field.termsHashPerField.numPostings;
                this.numPostings = 1;
                this.posting = posting;
                // nextTerm is called only once to 
                // set the term docs pointer at the 
                // correct position
                nextTerm();
              }
              
              boolean nextTerm() throws IOException {
                postingUpto++;
                if (postingUpto == numPostings)
                  return false;
          
                docID = 0;
          
                text = charPool.buffers[posting.textStart >> DocumentsWriter.CHAR_BLOCK_SHIFT];
                textOffset = posting.textStart & DocumentsWriter.CHAR_BLOCK_MASK;
          
                field.termsHashPerField.initReader(freq, posting, 0);
                if (!field.fieldInfo.omitTermFreqAndPositions)
                  field.termsHashPerField.initReader(prox, posting, 1);
          
                // Should always be true
                boolean result = nextDoc();
                assert result;
          
                return true;
              }
              
              public boolean nextDoc() throws IOException {
                if (freq.eof()) {
                  if (posting.lastDocCode != -1) {
                    // Return last doc
                    docID = posting.lastDocID;
                    if (!field.omitTermFreqAndPositions)
                      termFreq = posting.docFreq;
                    posting.lastDocCode = -1;
                    return true;
                  } else
                    // EOF
                    return false;
                }
                final int code = freq.readVInt();
                if (field.omitTermFreqAndPositions)
                  docID += code;
                else {
                  docID += code >>> 1;
                  if ((code & 1) != 0)
                    termFreq = 1;
                  else
                    termFreq = freq.readVInt();
                }
                assert docID != posting.lastDocID;
                return true;
              }
          
          Show
          Jason Rutherglen added a comment - Pre-advanced apology for permanently damaging (well I guess it can be deleted) the look and feel of this issue with a thwack of code, however I don't want to post the messy patch, and I'm guessing there's something small as to why the postings iteration on the freq byte slice reader isn't happening correctly (ie, it's returning 0). public class DWTermDocs implements TermDocs { final FreqProxTermsWriterPerField field; final int numPostings; final CharBlockPool charPool; FreqProxTermsWriter.PostingList posting; char [] text; int textOffset; private int postingUpto = -1; final ByteSliceReader freq = new ByteSliceReader(); final ByteSliceReader prox = new ByteSliceReader(); int docID; int termFreq; DWTermDocs(FreqProxTermsWriterPerField field, FreqProxTermsWriter.PostingList posting) throws IOException { this .field = field; this .charPool = field.perThread.termsHashPerThread.charPool; // this .numPostings = field.termsHashPerField.numPostings; this .numPostings = 1; this .posting = posting; // nextTerm is called only once to // set the term docs pointer at the // correct position nextTerm(); } boolean nextTerm() throws IOException { postingUpto++; if (postingUpto == numPostings) return false ; docID = 0; text = charPool.buffers[posting.textStart >> DocumentsWriter.CHAR_BLOCK_SHIFT]; textOffset = posting.textStart & DocumentsWriter.CHAR_BLOCK_MASK; field.termsHashPerField.initReader(freq, posting, 0); if (!field.fieldInfo.omitTermFreqAndPositions) field.termsHashPerField.initReader(prox, posting, 1); // Should always be true boolean result = nextDoc(); assert result; return true ; } public boolean nextDoc() throws IOException { if (freq.eof()) { if (posting.lastDocCode != -1) { // Return last doc docID = posting.lastDocID; if (!field.omitTermFreqAndPositions) termFreq = posting.docFreq; posting.lastDocCode = -1; return true ; } else // EOF return false ; } final int code = freq.readVInt(); if (field.omitTermFreqAndPositions) docID += code; else { docID += code >>> 1; if ((code & 1) != 0) termFreq = 1; else termFreq = freq.readVInt(); } assert docID != posting.lastDocID; return true ; }
          Hide
          Michael McCandless added a comment -

          I don't see anything obviously wrong – you excised this code from the same code that's used when merging the postings during flush?

          Show
          Michael McCandless added a comment - I don't see anything obviously wrong – you excised this code from the same code that's used when merging the postings during flush?
          Hide
          Jason Rutherglen added a comment -

          The code is from FreqProxFieldMergeState which accepts in it's
          constructor FreqProxTermsWriterPerField. One difference is
          instead of operating on an array of posting lists, the code
          above assumes one posting list.

          The numPostings was always 0 when testing

          this.numPostings = field.termsHashPerField.numPostings;

          In the code above it's hard coded to 1.

          Maybe there's some initialization that's not happening correctly?

          Show
          Jason Rutherglen added a comment - The code is from FreqProxFieldMergeState which accepts in it's constructor FreqProxTermsWriterPerField. One difference is instead of operating on an array of posting lists, the code above assumes one posting list. The numPostings was always 0 when testing this .numPostings = field.termsHashPerField.numPostings; In the code above it's hard coded to 1. Maybe there's some initialization that's not happening correctly?
          Hide
          Michael McCandless added a comment -

          Ahh, I think it's because you're not calling compactPostings/sortPostings in the THPF, right?

          Those methods collapse the hash table in-place (ie move all the nulls out), and sort.

          So you have to re-work the code to not do that and instead use whatever structure you have for visiting terms in sorted order. Then stepping through the docs should just work, but, you gotta stop at the max docID, right?

          Hmm... what does JMM say about byte arrays? If one thread is writing to the byte array, can any other thread see those changes?

          Show
          Michael McCandless added a comment - Ahh, I think it's because you're not calling compactPostings/sortPostings in the THPF, right? Those methods collapse the hash table in-place (ie move all the nulls out), and sort. So you have to re-work the code to not do that and instead use whatever structure you have for visiting terms in sorted order. Then stepping through the docs should just work, but, you gotta stop at the max docID, right? Hmm... what does JMM say about byte arrays? If one thread is writing to the byte array, can any other thread see those changes?
          Hide
          Jason Rutherglen added a comment -

          Ahh, I think it's because you're not calling
          compactPostings/sortPostings in the THPF, right?

          Those methods collapse the hash table in-place (ie move all the
          nulls out), and sort.

          Yep, got that part.

          So you have to re-work the code to not do that and
          instead use whatever structure you have for visiting terms in
          sorted order. Then stepping through the docs should just work,
          but, you gotta stop at the max docID, right?

          Right, the terms in sorted order is working... The freq
          ByteSliceReader is reading nothing however (zeroes). Either it's
          init'ed to the wrong position, or there's nothing in there? Or
          something else.

          Show
          Jason Rutherglen added a comment - Ahh, I think it's because you're not calling compactPostings/sortPostings in the THPF, right? Those methods collapse the hash table in-place (ie move all the nulls out), and sort. Yep, got that part. So you have to re-work the code to not do that and instead use whatever structure you have for visiting terms in sorted order. Then stepping through the docs should just work, but, you gotta stop at the max docID, right? Right, the terms in sorted order is working... The freq ByteSliceReader is reading nothing however (zeroes). Either it's init'ed to the wrong position, or there's nothing in there? Or something else.
          Hide
          Jason Rutherglen added a comment -

          Also wanted to add that the PostingList lastDocID is correct.

          Show
          Jason Rutherglen added a comment - Also wanted to add that the PostingList lastDocID is correct.
          Hide
          Jason Rutherglen added a comment -

          I have a test case showing the term docs working... I'm going to try to add the term positions methods.

          Show
          Jason Rutherglen added a comment - I have a test case showing the term docs working... I'm going to try to add the term positions methods.
          Hide
          Jason Rutherglen added a comment -

          Basic term positions working, need to figure out how to do lazy loading payloads...

          Show
          Jason Rutherglen added a comment - Basic term positions working, need to figure out how to do lazy loading payloads...
          Hide
          Jason Rutherglen added a comment -

          In thinking about the terms dictionary, we're going to run into concurrency issues right if we just use TreeMap? Can't we simply use the lock free ConcurrentSkipListMap? Yeah it's a part of Java6 however why reinvent the wheel?

          Show
          Jason Rutherglen added a comment - In thinking about the terms dictionary, we're going to run into concurrency issues right if we just use TreeMap? Can't we simply use the lock free ConcurrentSkipListMap? Yeah it's a part of Java6 however why reinvent the wheel?
          Hide
          Jason Rutherglen added a comment -

          Payloads works (non-lazy loading), however ByteSliceReader doesn't implement a seek method so I think we simply need to load each payload as we increment nextPosition? The cost shouldn't be too much because we're simply copying small byte arrays (in the heap).

          Show
          Jason Rutherglen added a comment - Payloads works (non-lazy loading), however ByteSliceReader doesn't implement a seek method so I think we simply need to load each payload as we increment nextPosition? The cost shouldn't be too much because we're simply copying small byte arrays (in the heap).
          Hide
          Michael Busch added a comment -

          Sounds like awesome progress!! Want some details over here

          Sorry for not being very specific. The prototype I'm experimenting with has a fixed length postings format for the in-memory representation (in TermsHash). Basically every posting has 4 bytes, so I can use int[] arrays (instead of the byte[] pools). The first 3 bytes are used for an absolute docID (not delta-encoded). This limits the max in-memory segment size to 2^24 docs. The 1 remaining byte is used for the position. With a max doc length of 140 characters you can fit every possible position in a byte - what a luxury! If a term occurs multiple times in the same doc, then the TermDocs just skips multiple occurrences with the same docID and increments the freq. Again, the same term doesn't occur often in super short docs.

          The int[] slices also don't have forward pointers, like in Lucene's TermsHash, but backwards pointers. In real-time search you often want a strongly time-biased ranking. A PostingList object has a pointer that points to the last posting (this statement is not 100% correct for visibility reasons across threads, but we can imagine it this way for now). A TermDocs can now traverse the postinglists in opposite order. Skipping can be done by following pointers to previous slices directly, or by binary search within a slice.

          Show
          Michael Busch added a comment - Sounds like awesome progress!! Want some details over here Sorry for not being very specific. The prototype I'm experimenting with has a fixed length postings format for the in-memory representation (in TermsHash). Basically every posting has 4 bytes, so I can use int[] arrays (instead of the byte[] pools). The first 3 bytes are used for an absolute docID (not delta-encoded). This limits the max in-memory segment size to 2^24 docs. The 1 remaining byte is used for the position. With a max doc length of 140 characters you can fit every possible position in a byte - what a luxury! If a term occurs multiple times in the same doc, then the TermDocs just skips multiple occurrences with the same docID and increments the freq. Again, the same term doesn't occur often in super short docs. The int[] slices also don't have forward pointers, like in Lucene's TermsHash, but backwards pointers. In real-time search you often want a strongly time-biased ranking. A PostingList object has a pointer that points to the last posting (this statement is not 100% correct for visibility reasons across threads, but we can imagine it this way for now). A TermDocs can now traverse the postinglists in opposite order. Skipping can be done by following pointers to previous slices directly, or by binary search within a slice.
          Hide
          Michael Busch added a comment -

          Hmm... what does JMM say about byte arrays? If one thread is writing
          to the byte array, can any other thread see those changes?

          This is the very right question to ask here. Thread-safety is really the by
          far most complicated aspect of this feature. Jason, I'm not sure if you
          already figured out how to ensure visibility of changes made by the writer
          thread to the reader threads?

          Thread-safety in our case boils down to safe publication. We don't need
          locking to coordinate writing of multiple threads, because of LUCENE-2324. But
          we need to make sure that the reader threads see all changes they need to see
          at the right time, in the right order. This is IMO very hard, but we all like
          challenges

          The JMM gives no guarantee whatsover what changes a thread will see that
          another thread made - or if it will ever see the changes, unless proper
          publication is ensured by either synchronization or volatile/atomic variables.

          So e.g. if a writer thread executes the following statements:

          public static int a, b;
          
          ...
          
          a = 1; b = 2;
          
          a = 5; b = 6;
          

          and a reader threads does:

          System.out.println(a + "," + b);
          

          The thing to remember is that the output might be: 1,6! Another reader thread
          with the following code:

          while (b != 6) {
            .. do something 
          }
          

          might further NEVER terminate without synchronization/volatile/atomic.

          The reason is that the JVM is allowed to perform any reorderings to utilize
          modern CPUs, memory, caches, etc. if not forced otherwise.

          To ensure safe publication of data written by a thread we could do
          synchronization, but my goal is it here to implement a non-blocking and
          lock-free algorithm. So my idea was it to make use of a very subtle behavior
          of volatile variables. I will take a simple explanation of the JMM from Brian
          Goetz' awesome book "Java concurrency in practice", in which he describes the
          JMM in simple happens-before rules. I will mention only three of those rules,
          because they are enough to describe the volatile behavior I'd like to mention
          here (p. 341)

          Program order rule: Each action in a thread happens-before every action in
          that thread that comes later in the program order.

          Volatile variable rule: A write to a volatile field happens-before every
          subsequent read of that same field.

          Transitivity: If A happens-before B, and B happens-before C, then A
          happens-before C.

          Based on these three rules you can see that writing to a volatile variable v
          by one thread t1 and subsequent reading of the same volatile variable v by
          another thread t2 publishes ALL changes of t1 that happened-before the write
          to v and the change of v itself. So this write/read of v means crossing a
          memory barrier and forcing everything that t1 might have written to caches to
          be flushed to the RAM. That's why a volatile write can actually be pretty
          expensive.

          Note that this behavior is actually only working like I just described since
          Java 1.5. Behavior of volatile variables was a very very subtle change from
          1.4->1.5!

          The way I'm trying to make use of this behavior is actually similar to how we
          lazily sync Lucene's files with the filesystem: I want to delay the cache->RAM
          write-through as much as possible, which increases the probability of getting
          the sync for free! Still fleshing out the details, but I wanted to share these
          infos with you guys already, because it might invalidate a lot of assumptions
          you might have when developing the code. Some of this stuff was actually new
          to me, maybe you all know it already. And if anything that I wrote here is
          incorrect, please let me know!

          Btw: IMO, if there's only one java book you can ever read, then read Goetz'
          book! It's great. He also says in the book somewhere about lock-free
          algorithms: "Don't try this at home!" - so, let's do it!

          Show
          Michael Busch added a comment - Hmm... what does JMM say about byte arrays? If one thread is writing to the byte array, can any other thread see those changes? This is the very right question to ask here. Thread-safety is really the by far most complicated aspect of this feature. Jason, I'm not sure if you already figured out how to ensure visibility of changes made by the writer thread to the reader threads? Thread-safety in our case boils down to safe publication. We don't need locking to coordinate writing of multiple threads, because of LUCENE-2324 . But we need to make sure that the reader threads see all changes they need to see at the right time, in the right order. This is IMO very hard, but we all like challenges The JMM gives no guarantee whatsover what changes a thread will see that another thread made - or if it will ever see the changes, unless proper publication is ensured by either synchronization or volatile/atomic variables. So e.g. if a writer thread executes the following statements: public static int a, b; ... a = 1; b = 2; a = 5; b = 6; and a reader threads does: System .out.println(a + "," + b); The thing to remember is that the output might be: 1,6! Another reader thread with the following code: while (b != 6) { .. do something } might further NEVER terminate without synchronization/volatile/atomic. The reason is that the JVM is allowed to perform any reorderings to utilize modern CPUs, memory, caches, etc. if not forced otherwise. To ensure safe publication of data written by a thread we could do synchronization, but my goal is it here to implement a non-blocking and lock-free algorithm. So my idea was it to make use of a very subtle behavior of volatile variables. I will take a simple explanation of the JMM from Brian Goetz' awesome book "Java concurrency in practice", in which he describes the JMM in simple happens-before rules. I will mention only three of those rules, because they are enough to describe the volatile behavior I'd like to mention here (p. 341) Program order rule: Each action in a thread happens-before every action in that thread that comes later in the program order. Volatile variable rule: A write to a volatile field happens-before every subsequent read of that same field. Transitivity: If A happens-before B, and B happens-before C, then A happens-before C. Based on these three rules you can see that writing to a volatile variable v by one thread t1 and subsequent reading of the same volatile variable v by another thread t2 publishes ALL changes of t1 that happened-before the write to v and the change of v itself. So this write/read of v means crossing a memory barrier and forcing everything that t1 might have written to caches to be flushed to the RAM. That's why a volatile write can actually be pretty expensive. Note that this behavior is actually only working like I just described since Java 1.5. Behavior of volatile variables was a very very subtle change from 1.4->1.5! The way I'm trying to make use of this behavior is actually similar to how we lazily sync Lucene's files with the filesystem: I want to delay the cache->RAM write-through as much as possible, which increases the probability of getting the sync for free! Still fleshing out the details, but I wanted to share these infos with you guys already, because it might invalidate a lot of assumptions you might have when developing the code. Some of this stuff was actually new to me, maybe you all know it already. And if anything that I wrote here is incorrect, please let me know! Btw: IMO, if there's only one java book you can ever read, then read Goetz' book! It's great. He also says in the book somewhere about lock-free algorithms: "Don't try this at home!" - so, let's do it!
          Hide
          Jason Rutherglen added a comment -

          Just to clarify, I think Mike's referring to ParallelArray?

          http://gee.cs.oswego.edu/dl/jsr166/dist/extra166ydocs/extra166y/P
          arallelArray.html

          There's AtomicIntegerArray:
          http://www.melclub.net/java/_atomic_integer_array_8java_source.html
          which underneath uses the sun.Unsafe class for volatile array
          access. Could this be reused for an AtomicByteArray class (why
          isn't there one of these already?).

          A quick and easy way to solve this is to use a read write lock
          on the byte pool? Remember when we'd sync on each read bytes
          call to the underlying random access file in FSDirectory (eg,
          now we're using NIOFSDir which can be a good concurrent
          throughput improvement). Lets try the RW lock and examine the
          results? I guess the issue is we're not writing in blocks of
          bytes, we're actually writing byte by byte and need to read byte
          by byte concurrently? This sounds like a fairy typical thing to
          do?

          Show
          Jason Rutherglen added a comment - Just to clarify, I think Mike's referring to ParallelArray? http://gee.cs.oswego.edu/dl/jsr166/dist/extra166ydocs/extra166y/P arallelArray.html There's AtomicIntegerArray: http://www.melclub.net/java/_atomic_integer_array_8java_source.html which underneath uses the sun.Unsafe class for volatile array access. Could this be reused for an AtomicByteArray class (why isn't there one of these already?). A quick and easy way to solve this is to use a read write lock on the byte pool? Remember when we'd sync on each read bytes call to the underlying random access file in FSDirectory (eg, now we're using NIOFSDir which can be a good concurrent throughput improvement). Lets try the RW lock and examine the results? I guess the issue is we're not writing in blocks of bytes, we're actually writing byte by byte and need to read byte by byte concurrently? This sounds like a fairy typical thing to do?
          Hide
          Michael Busch added a comment -

          A quick and easy way to solve this is to use a read write lock
          on the byte pool?

          If you use a RW lock then the writer thread will block all reader threads while it's making changes. The writer thread will be making changes all the time in a real-time search environment. The contention will kill performance I'm sure. RW lock is only faster than mutual exclusion lock if writes are infrequent, as mentioned in the javadocs of ReadWriteLock.java

          Show
          Michael Busch added a comment - A quick and easy way to solve this is to use a read write lock on the byte pool? If you use a RW lock then the writer thread will block all reader threads while it's making changes. The writer thread will be making changes all the time in a real-time search environment. The contention will kill performance I'm sure. RW lock is only faster than mutual exclusion lock if writes are infrequent, as mentioned in the javadocs of ReadWriteLock.java
          Hide
          Jason Rutherglen added a comment -

          but my goal is it here to implement a non-blocking and
          lock-free algorithm. So my idea was it to make use of a very
          subtle behavior of volatile variables.

          You're talking about having a per thread write buffer byte
          array, that on search gets copied into a read only array, or
          gets transformed magically into a volatile byte array? (Do
          volatile byte arrays work? I couldn't find a clear answer on the
          net, maybe it's stated in the Goetz book). If volatile byte
          arrays do work, an option to test would be a byte buffer pool
          that uses volatile byte arrays?

          Show
          Jason Rutherglen added a comment - but my goal is it here to implement a non-blocking and lock-free algorithm. So my idea was it to make use of a very subtle behavior of volatile variables. You're talking about having a per thread write buffer byte array, that on search gets copied into a read only array, or gets transformed magically into a volatile byte array? (Do volatile byte arrays work? I couldn't find a clear answer on the net, maybe it's stated in the Goetz book). If volatile byte arrays do work, an option to test would be a byte buffer pool that uses volatile byte arrays?
          Hide
          Michael Busch added a comment - - edited

          Do volatile byte arrays work

          I'm not sure what you mean by volatile byte arrays?

          Do you mean this?

          volatile byte[] array;
          

          This makes the reference to the array volatile, not the slots in the array.

          Show
          Michael Busch added a comment - - edited Do volatile byte arrays work I'm not sure what you mean by volatile byte arrays? Do you mean this? volatile byte [] array; This makes the reference to the array volatile, not the slots in the array.
          Hide
          Jason Rutherglen added a comment -

          This makes the reference to the array volatile, not the
          slots in the array

          That's no good!

          If you use a RW lock then the writer thread will block
          all reader threads while it's making changes

          We probably need to implement more fine grained locking, perhaps
          using volatile booleans instead of RW locks. Fine grained
          meaning on the byte array/block level. I think this would imply
          that changes are not visible until a given byte block is more or
          less "flushed"? This is different than the design that's been
          implicated, that we'd read from byte arrays as their being
          written to. We probably don't need to read from and write to the
          same byte array concurrently (that might not be feasible?).

          The performance win here is probably going to be the fact that
          we avoid segment merges.

          Show
          Jason Rutherglen added a comment - This makes the reference to the array volatile, not the slots in the array That's no good! If you use a RW lock then the writer thread will block all reader threads while it's making changes We probably need to implement more fine grained locking, perhaps using volatile booleans instead of RW locks. Fine grained meaning on the byte array/block level. I think this would imply that changes are not visible until a given byte block is more or less "flushed"? This is different than the design that's been implicated, that we'd read from byte arrays as their being written to. We probably don't need to read from and write to the same byte array concurrently (that might not be feasible?). The performance win here is probably going to be the fact that we avoid segment merges.
          Hide
          Michael Busch added a comment - - edited

          The tricky part is to make sure that a reader always sees a consistent snapshot of the index. At the same time a reader must not follow pointers to non-published locations (e.g. array blocks).

          I think I have a lock-free solution working, which only syncs (i.e. does volatile writes) in certain intervals to not prevent JVM optimizations - but I need more time for thinking about all the combinations and corner cases.

          It's getting late now - need to sleep!

          Show
          Michael Busch added a comment - - edited The tricky part is to make sure that a reader always sees a consistent snapshot of the index. At the same time a reader must not follow pointers to non-published locations (e.g. array blocks). I think I have a lock-free solution working, which only syncs (i.e. does volatile writes) in certain intervals to not prevent JVM optimizations - but I need more time for thinking about all the combinations and corner cases. It's getting late now - need to sleep!
          Hide
          Michael McCandless added a comment -

          The tricky part is to make sure that a reader always sees a consistent snapshot of the index. At the same time a reader must not follow pointers to non-published locations (e.g. array blocks).

          Right, I'm just not familiar specifically with what JMM says about one thread writing to a byte[] and another thread reading it.

          In general, for our usage, the reader threads will never read into an area that has not yet been written to. So that works in our favor (they can't cache those bytes if they didn't read them). EXCEPT the CPU will have loaded the bytes on a word boundary and so if our reader thread reads only 1 byte, and no more (because this is now the end of the posting), the CPU may very well have pulled in the following 7 bytes (for example) and then illegally (according to our needs) cache them.

          We better make some serious tests for this... including reader threads that just enum the postings for a single rarish term over and over while writer threads are indexing docs that occasionally have that term. I think that's the worst case for JMM violation since the #bytes cached is small.

          It's too bad there isn't higher level control on the CPU caching via java. EG, in our usage, if we could call a System.flushCPUCache whenever a thread enters a newly reopened reader.... because, when accessing postings via a given Reader we want point-in-time searching anyway and so any bytes cached by the CPU are perfectly fine. We only need CPU cache flush when a reader is reopened....

          Show
          Michael McCandless added a comment - The tricky part is to make sure that a reader always sees a consistent snapshot of the index. At the same time a reader must not follow pointers to non-published locations (e.g. array blocks). Right, I'm just not familiar specifically with what JMM says about one thread writing to a byte[] and another thread reading it. In general, for our usage, the reader threads will never read into an area that has not yet been written to. So that works in our favor (they can't cache those bytes if they didn't read them). EXCEPT the CPU will have loaded the bytes on a word boundary and so if our reader thread reads only 1 byte, and no more (because this is now the end of the posting), the CPU may very well have pulled in the following 7 bytes (for example) and then illegally (according to our needs) cache them. We better make some serious tests for this... including reader threads that just enum the postings for a single rarish term over and over while writer threads are indexing docs that occasionally have that term. I think that's the worst case for JMM violation since the #bytes cached is small. It's too bad there isn't higher level control on the CPU caching via java. EG, in our usage, if we could call a System.flushCPUCache whenever a thread enters a newly reopened reader.... because, when accessing postings via a given Reader we want point-in-time searching anyway and so any bytes cached by the CPU are perfectly fine. We only need CPU cache flush when a reader is reopened....
          Hide
          Michael McCandless added a comment -

          The prototype I'm experimenting with has a fixed length postings format for the in-memory representation (in TermsHash). Basically every posting has 4 bytes, so I can use int[] arrays (instead of the byte[] pools). The first 3 bytes are used for an absolute docID (not delta-encoded). This limits the max in-memory segment size to 2^24 docs. The 1 remaining byte is used for the position. With a max doc length of 140 characters you can fit every possible position in a byte - what a luxury! If a term occurs multiple times in the same doc, then the TermDocs just skips multiple occurrences with the same docID and increments the freq. Again, the same term doesn't occur often in super short docs.

          The int[] slices also don't have forward pointers, like in Lucene's TermsHash, but backwards pointers. In real-time search you often want a strongly time-biased ranking. A PostingList object has a pointer that points to the last posting (this statement is not 100% correct for visibility reasons across threads, but we can imagine it this way for now). A TermDocs can now traverse the postinglists in opposite order. Skipping can be done by following pointers to previous slices directly, or by binary search within a slice.

          This sounds nice!

          This would be a custom indexing chain for docs guaranteed not to be over 255 positions in length right?

          Show
          Michael McCandless added a comment - The prototype I'm experimenting with has a fixed length postings format for the in-memory representation (in TermsHash). Basically every posting has 4 bytes, so I can use int[] arrays (instead of the byte[] pools). The first 3 bytes are used for an absolute docID (not delta-encoded). This limits the max in-memory segment size to 2^24 docs. The 1 remaining byte is used for the position. With a max doc length of 140 characters you can fit every possible position in a byte - what a luxury! If a term occurs multiple times in the same doc, then the TermDocs just skips multiple occurrences with the same docID and increments the freq. Again, the same term doesn't occur often in super short docs. The int[] slices also don't have forward pointers, like in Lucene's TermsHash, but backwards pointers. In real-time search you often want a strongly time-biased ranking. A PostingList object has a pointer that points to the last posting (this statement is not 100% correct for visibility reasons across threads, but we can imagine it this way for now). A TermDocs can now traverse the postinglists in opposite order. Skipping can be done by following pointers to previous slices directly, or by binary search within a slice. This sounds nice! This would be a custom indexing chain for docs guaranteed not to be over 255 positions in length right?
          Hide
          Michael McCandless added a comment -

          In thinking about the terms dictionary, we're going to run into concurrency issues right if we just use TreeMap?

          Right, we need a concurrent data structure here. It's OK if there've been changes to this shared data structure since a reader was opened – that reader knows its max doc id and so it can skip a term if the first doc id in that term is > that max.

          Show
          Michael McCandless added a comment - In thinking about the terms dictionary, we're going to run into concurrency issues right if we just use TreeMap? Right, we need a concurrent data structure here. It's OK if there've been changes to this shared data structure since a reader was opened – that reader knows its max doc id and so it can skip a term if the first doc id in that term is > that max.
          Hide
          Jason Rutherglen added a comment -

          I thought we're moving away from byte block pooling and we're
          going to try relying on garbage collection? Does a volatile
          object[] publish changes to all threads? Probably not, again
          it'd just be the pointer.

          In the case of posting/termdocs iteration, I'm more concerned
          that the lastDocID be volatile than the with the byte array
          containing extra data. Extra docs is OK in the byte array
          because we'll simply stop iterating when we've reached the last
          doc. Though with our system, we shouldn't even run into this
          either, meaning a byte array is copied and published, perhaps
          the master byte array is still being written to and the same
          byte array (by id or something) is published again? Then we'd
          have multiple versions of byte arrays. That could be bad.

          Because there is one DW per thread, there's only one document
          being indexed at a time. There's no writer concurrency. This
          leaves reader concurrency. However after each doc, we could
          simply flush all bytes related to the doc. Any new docs must
          simply start writing to new byte arrays? The problem with this
          is, unless the byte arrays are really small, we'll have a lot of
          extra data around, well, unless the byte arrays are trimmed
          before publication. Or we can simply RW lock (or some other
          analogous thing) individual byte arrays, not publish them after
          each doc, then only publish them when get reader is called. To
          clarify, the RW lock (or flag) would only be per byte array, in
          fact, all writing to the byte array could necessarily cease on
          flush, and new byte arrays allocated. The published byte array
          could point to the next byte array.

          I think we simply need a way to publish byte arrays to all
          threads? Michael B. can you post something of what you have so
          we can get an idea of how your system will work (ie, mainly what
          the assumptions are)?

          We do need to strive for correctness of data, and perhaps
          performance will be slightly impacted (though compared with our
          current NRT we'll have an overall win).

          Show
          Jason Rutherglen added a comment - I thought we're moving away from byte block pooling and we're going to try relying on garbage collection? Does a volatile object[] publish changes to all threads? Probably not, again it'd just be the pointer. In the case of posting/termdocs iteration, I'm more concerned that the lastDocID be volatile than the with the byte array containing extra data. Extra docs is OK in the byte array because we'll simply stop iterating when we've reached the last doc. Though with our system, we shouldn't even run into this either, meaning a byte array is copied and published, perhaps the master byte array is still being written to and the same byte array (by id or something) is published again? Then we'd have multiple versions of byte arrays. That could be bad. Because there is one DW per thread, there's only one document being indexed at a time. There's no writer concurrency. This leaves reader concurrency. However after each doc, we could simply flush all bytes related to the doc. Any new docs must simply start writing to new byte arrays? The problem with this is, unless the byte arrays are really small, we'll have a lot of extra data around, well, unless the byte arrays are trimmed before publication. Or we can simply RW lock (or some other analogous thing) individual byte arrays, not publish them after each doc, then only publish them when get reader is called. To clarify, the RW lock (or flag) would only be per byte array, in fact, all writing to the byte array could necessarily cease on flush, and new byte arrays allocated. The published byte array could point to the next byte array. I think we simply need a way to publish byte arrays to all threads? Michael B. can you post something of what you have so we can get an idea of how your system will work (ie, mainly what the assumptions are)? We do need to strive for correctness of data, and perhaps performance will be slightly impacted (though compared with our current NRT we'll have an overall win).
          Hide
          Jason Rutherglen added a comment -

          The tricky part is to make sure that a reader always sees
          a consistent snapshot of the index. At the same time a reader
          must not follow pointers to non-published locations (e.g. array
          blocks).

          Right. In what case in the term enum, term docs chain of doc
          scoring would a reader potentially try to follow a pointer to a
          byte array that doesn't exist? I think we're strictly preventing
          it via last doc ids? Also, when we flush, I think we need to
          block further doc writing (via an RW lock?) and wait for any
          currently writing docs to complete, then forcibly publish the
          byte arrays, then release the write lock? This way we always
          have published data that's consistent for readers (eg, the
          inverted index can be read completely, and there won't be any
          wild writes still occurring to a byte array that's been
          published).

          Show
          Jason Rutherglen added a comment - The tricky part is to make sure that a reader always sees a consistent snapshot of the index. At the same time a reader must not follow pointers to non-published locations (e.g. array blocks). Right. In what case in the term enum, term docs chain of doc scoring would a reader potentially try to follow a pointer to a byte array that doesn't exist? I think we're strictly preventing it via last doc ids? Also, when we flush, I think we need to block further doc writing (via an RW lock?) and wait for any currently writing docs to complete, then forcibly publish the byte arrays, then release the write lock? This way we always have published data that's consistent for readers (eg, the inverted index can be read completely, and there won't be any wild writes still occurring to a byte array that's been published).
          Hide
          Michael Busch added a comment -

          I thought we're moving away from byte block pooling and we're
          going to try relying on garbage collection? Does a volatile
          object[] publish changes to all threads? Probably not, again
          it'd just be the pointer.

          We were so far only considering moving away from pooling of (Raw)PostingList objects. Pooling byte blocks might have more performance impact - they're more heavy-weight.

          Show
          Michael Busch added a comment - I thought we're moving away from byte block pooling and we're going to try relying on garbage collection? Does a volatile object[] publish changes to all threads? Probably not, again it'd just be the pointer. We were so far only considering moving away from pooling of (Raw)PostingList objects. Pooling byte blocks might have more performance impact - they're more heavy-weight.
          Hide
          Jason Rutherglen added a comment -

          To clarify the above comment, DW's update doc method would acquire a mutex. The flush bytes method would also acquire that mutex when it copies existing writeable bytes over to the readable bytes thing (pool?).

          Show
          Jason Rutherglen added a comment - To clarify the above comment, DW's update doc method would acquire a mutex. The flush bytes method would also acquire that mutex when it copies existing writeable bytes over to the readable bytes thing (pool?).
          Hide
          Michael Busch added a comment -

          think we simply need a way to publish byte arrays to all
          threads? Michael B. can you post something of what you have so
          we can get an idea of how your system will work (ie, mainly what
          the assumptions are)?

          It's kinda complicated to explain and currently differs from Lucene's TermHash classes a lot. I'd prefer to wait a little bit until I have verified that my solution works.

          I think here we should really tackle LUCENE-2324 first - it's a prereq. Wanna help with that, Jason?

          Show
          Michael Busch added a comment - think we simply need a way to publish byte arrays to all threads? Michael B. can you post something of what you have so we can get an idea of how your system will work (ie, mainly what the assumptions are)? It's kinda complicated to explain and currently differs from Lucene's TermHash classes a lot. I'd prefer to wait a little bit until I have verified that my solution works. I think here we should really tackle LUCENE-2324 first - it's a prereq. Wanna help with that, Jason?
          Hide
          Jason Rutherglen added a comment -

          I think the easiest way to test out the concurrency is to add a
          flush method to ByteBlockPool. Then allocate a read only version
          of the buffers array (not copying the byte arrays, just the 1st
          dimension pointers). The only issue is to rework the code to
          read from the read only array, and write to the write only
          array...

          Show
          Jason Rutherglen added a comment - I think the easiest way to test out the concurrency is to add a flush method to ByteBlockPool. Then allocate a read only version of the buffers array (not copying the byte arrays, just the 1st dimension pointers). The only issue is to rework the code to read from the read only array, and write to the write only array...
          Hide
          Jason Rutherglen added a comment -

          Mike, can you clarify why intUptos and intUptoStart are member
          variables in TermsHashPerField? Can't the accessors simply refer
          to IntBlockPool for these? I'm asking because in IntBlockPool flush,
          for now I'm simply calling nextBuffer to shuffle the current
          writable array into a read only state (ie, all of the arrays
          being written to prior to flush will now be readonly).

          Show
          Jason Rutherglen added a comment - Mike, can you clarify why intUptos and intUptoStart are member variables in TermsHashPerField? Can't the accessors simply refer to IntBlockPool for these? I'm asking because in IntBlockPool flush, for now I'm simply calling nextBuffer to shuffle the current writable array into a read only state (ie, all of the arrays being written to prior to flush will now be readonly).
          Hide
          Jason Rutherglen added a comment -

          I think the DW index reader needs to create a new fields reader on demand if the field infos have changed since the last field reader instantiation.

          Show
          Jason Rutherglen added a comment - I think the DW index reader needs to create a new fields reader on demand if the field infos have changed since the last field reader instantiation.
          Hide
          Michael McCandless added a comment -

          intUptoStart is used in THPF.writeByte which is very much a hotspot when indexing, so I added it as a direct member in THPF to avoid an extra deref through the intPool. Could be this is harmless in practice though...

          Show
          Michael McCandless added a comment - intUptoStart is used in THPF.writeByte which is very much a hotspot when indexing, so I added it as a direct member in THPF to avoid an extra deref through the intPool. Could be this is harmless in practice though...
          Hide
          Jason Rutherglen added a comment -

          Previously there was a discussion about DW index readers that
          stay open, but could refer to byte arrays that are
          recycled? Can't we simply throw away the doc writer after a
          successful segment flush (the IRs would refer to it, however
          once they're closed, the DW would close as well)? Then start
          with a new DW for the next batch of indexing for that thread?

          Show
          Jason Rutherglen added a comment - Previously there was a discussion about DW index readers that stay open, but could refer to byte arrays that are recycled? Can't we simply throw away the doc writer after a successful segment flush (the IRs would refer to it, however once they're closed, the DW would close as well)? Then start with a new DW for the next batch of indexing for that thread?
          Hide
          Michael McCandless added a comment -

          Can't we simply throw away the doc writer after a
          successful segment flush (the IRs would refer to it, however
          once they're closed, the DW would close as well)?

          I think that should be our first approach. It means no pooling whatsoever. And it means that an app that doesn't aggressively close its old NRT readers will consume more RAM.

          Though... the NRT readers will be able to search an active DW right? Ie, it's only when that DW needs to flush, when the NRT readers would be tying up the RAM.

          So, when a flush happens, existing NRT readers will hold a reference to that now-flushed DW, but when they reopen they will cutover to the on-disk segment.

          I think this will be an OK limitation in practice. Once NRT readers can search a live (still being written) DW, flushing of a DW will be a relatively rare event (unlike today where we must flush every time an NRT reader is opened).

          Show
          Michael McCandless added a comment - Can't we simply throw away the doc writer after a successful segment flush (the IRs would refer to it, however once they're closed, the DW would close as well)? I think that should be our first approach. It means no pooling whatsoever. And it means that an app that doesn't aggressively close its old NRT readers will consume more RAM. Though... the NRT readers will be able to search an active DW right? Ie, it's only when that DW needs to flush, when the NRT readers would be tying up the RAM. So, when a flush happens, existing NRT readers will hold a reference to that now-flushed DW, but when they reopen they will cutover to the on-disk segment. I think this will be an OK limitation in practice. Once NRT readers can search a live (still being written) DW, flushing of a DW will be a relatively rare event (unlike today where we must flush every time an NRT reader is opened).
          Hide
          Jason Rutherglen added a comment -

          For the skip list, we could reuse what we have (ie,
          DefaultSkipListReader), though, we'd need to choose a default
          number of docs, pulled out of thin air, as there's no way to
          guesstimate per term before hand. Or we can have a single level
          skip list (more like an index) and binary search to find the
          value (assuming we have an int array instead of storing vints)
          in the skip list we're looking for.

          Show
          Jason Rutherglen added a comment - For the skip list, we could reuse what we have (ie, DefaultSkipListReader), though, we'd need to choose a default number of docs, pulled out of thin air, as there's no way to guesstimate per term before hand. Or we can have a single level skip list (more like an index) and binary search to find the value (assuming we have an int array instead of storing vints) in the skip list we're looking for.
          Hide
          Brian Goetz added a comment -

          @JasonR: There is a good reason there is no AtomicByteArray. Atomic classes require the underlying hardware to provide a CAS-like operation, and real-world hardware does not provide CAS on bytes. So an AtomicByteArray class would almost certainly have to declare an underlying array of ints anyway.

          Show
          Brian Goetz added a comment - @JasonR: There is a good reason there is no AtomicByteArray. Atomic classes require the underlying hardware to provide a CAS-like operation, and real-world hardware does not provide CAS on bytes. So an AtomicByteArray class would almost certainly have to declare an underlying array of ints anyway.
          Hide
          Jason Rutherglen added a comment -

          Brian, thanks for the explanation!

          Show
          Jason Rutherglen added a comment - Brian, thanks for the explanation!
          Hide
          Michael Busch added a comment -

          Hi Brian - good to see you on this list!

          In my previous comment I actually quoted some sections of the concurrency book:
          https://issues.apache.org/jira/browse/LUCENE-2312?focusedCommentId=12845712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12845712

          Did I understand correctly that a volatile write can be used to enforce a cache->RAM write-through of all updates a thread made that came before the volatile write in the thread's program order?

          My idea here was to use this behavior to avoid volatile writes for every document, but instead to periodically do such a volatile write (say e.g. every 100 documents). I implemented a class called MemoryBarrier, which keeps track of when the last volatile write happened. A reader thread can ask the MemoryBarrier what the last successfully processed docID before crossing the barrier was. The reader will then never attempt to read beyond that document.

          Of course there are tons of details regarding safe publication of all involved fields and objects. I was just wondering if this general "memory barrier" approach seems right and if indeed performance gains can be expected compared to doing volatile writes for every document?

          Show
          Michael Busch added a comment - Hi Brian - good to see you on this list! In my previous comment I actually quoted some sections of the concurrency book: https://issues.apache.org/jira/browse/LUCENE-2312?focusedCommentId=12845712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12845712 Did I understand correctly that a volatile write can be used to enforce a cache->RAM write-through of all updates a thread made that came before the volatile write in the thread's program order? My idea here was to use this behavior to avoid volatile writes for every document, but instead to periodically do such a volatile write (say e.g. every 100 documents). I implemented a class called MemoryBarrier, which keeps track of when the last volatile write happened. A reader thread can ask the MemoryBarrier what the last successfully processed docID before crossing the barrier was. The reader will then never attempt to read beyond that document. Of course there are tons of details regarding safe publication of all involved fields and objects. I was just wondering if this general "memory barrier" approach seems right and if indeed performance gains can be expected compared to doing volatile writes for every document?
          Hide
          Michael McCandless added a comment -

          So.. what does this mean for allowing an IR impl to directly search IW's RAM buffer?

          Show
          Michael McCandless added a comment - So.. what does this mean for allowing an IR impl to directly search IW's RAM buffer?
          Hide
          Michael Busch added a comment - - edited

          So.. what does this mean for allowing an IR impl to directly search IW's RAM buffer?

          The main idea is to have an approach that is lock-free. Then write performance will not suffer no matter how big your query load is.

          When you open/reopen a RAMReader it would first ask the MemoryBarrier for the last sync'ed docID (volatile read). This would be the maxDoc for that reader and it's safe for the reader to read up to that id, because it can be sure that all changes the writer thread made up to that maxDoc are visible to the reader.

          If we called MemoryBarrier.sync() let's say every 100 docs, then the max. search latency would be the amount of time it takes to index 100 docs. Doing no volatile/atomic writes and not going through explicit locks for 100 docs will allow the JVM to do all its nice optimizations. I think this will work, but honestly I have not really a good feeling for how much performance this approach would gain compared to writing to volatile variables for every document.

          Show
          Michael Busch added a comment - - edited So.. what does this mean for allowing an IR impl to directly search IW's RAM buffer? The main idea is to have an approach that is lock-free. Then write performance will not suffer no matter how big your query load is. When you open/reopen a RAMReader it would first ask the MemoryBarrier for the last sync'ed docID (volatile read). This would be the maxDoc for that reader and it's safe for the reader to read up to that id, because it can be sure that all changes the writer thread made up to that maxDoc are visible to the reader. If we called MemoryBarrier.sync() let's say every 100 docs, then the max. search latency would be the amount of time it takes to index 100 docs. Doing no volatile/atomic writes and not going through explicit locks for 100 docs will allow the JVM to do all its nice optimizations. I think this will work, but honestly I have not really a good feeling for how much performance this approach would gain compared to writing to volatile variables for every document.
          Hide
          Michael Busch added a comment -

          I think sync'ing after every doc is probably the better option. We'll still avoid the need to make all variables downstream of DocumentsWriter volatile/atomic, which should be a nice performance gain.

          The problem with the delayed sync'ing (after e.g. 100 docs) is that if you don't have a never-ending stream of twee... err documents, then you might want to force an explicit sync at some point. But that's very hard, because you would have to force the writer thread to make e.g. a volatile write via an API call. And if that's an IndexWriter writer API that has to trigger the sync on multiple DocumentsWriter instances (i.e. multiple writer threads) I don't see how that's possible unless Lucene manages it's own thread of pools.

          Show
          Michael Busch added a comment - I think sync'ing after every doc is probably the better option. We'll still avoid the need to make all variables downstream of DocumentsWriter volatile/atomic, which should be a nice performance gain. The problem with the delayed sync'ing (after e.g. 100 docs) is that if you don't have a never-ending stream of twee... err documents, then you might want to force an explicit sync at some point. But that's very hard, because you would have to force the writer thread to make e.g. a volatile write via an API call. And if that's an IndexWriter writer API that has to trigger the sync on multiple DocumentsWriter instances (i.e. multiple writer threads) I don't see how that's possible unless Lucene manages it's own thread of pools.
          Hide
          Jason Rutherglen added a comment -

          We need to fill in the blanks on the terms dictionary
          implementation. Michael B. has some good ideas on implementing it
          using parallel arrays and dynamically updating a linked list
          implemented as a parallel AtomicIntegerArray. A question we have
          is regarding the use of a btree to quickly find the point of
          insertion for a new term. The btree would replace the term index
          which is binary searched and the term dictionary linearly
          scanned. Perhaps there's a better data structure for concurrent
          update and lookups?

          Another use of the AtomicIntegerArray could be the deletes
          sequence id int[]. However is it needed and would the lookup be
          fast enough?

          Show
          Jason Rutherglen added a comment - We need to fill in the blanks on the terms dictionary implementation. Michael B. has some good ideas on implementing it using parallel arrays and dynamically updating a linked list implemented as a parallel AtomicIntegerArray. A question we have is regarding the use of a btree to quickly find the point of insertion for a new term. The btree would replace the term index which is binary searched and the term dictionary linearly scanned. Perhaps there's a better data structure for concurrent update and lookups? Another use of the AtomicIntegerArray could be the deletes sequence id int[]. However is it needed and would the lookup be fast enough?
          Hide
          Jason Rutherglen added a comment -

          In regards to implementing the terms dictionary index using a btree, here is a useful link on lock-free btree/esque solutions.

          http://stackoverflow.com/questions/256511/skip-list-vs-binary-tree

          Show
          Jason Rutherglen added a comment - In regards to implementing the terms dictionary index using a btree, here is a useful link on lock-free btree/esque solutions. http://stackoverflow.com/questions/256511/skip-list-vs-binary-tree
          Hide
          Jason Rutherglen added a comment -

          We could use ConcurrentSkipListMap however there's a bit of
          overhead for the node object pointers. Michael had mentioned
          implementing the sorted terms using parallel arrays. Maybe for
          the first implementation we can store a sorted parallel array of
          BytesRefs? Or flatten a concurrent skip list into parallel
          arrays?

          Show
          Jason Rutherglen added a comment - We could use ConcurrentSkipListMap however there's a bit of overhead for the node object pointers. Michael had mentioned implementing the sorted terms using parallel arrays. Maybe for the first implementation we can store a sorted parallel array of BytesRefs? Or flatten a concurrent skip list into parallel arrays?
          Hide
          Jason Rutherglen added a comment -

          This wikipedia article illustrates the use of parallel arrays we can use for the terms dictionary: http://en.wikipedia.org/wiki/Linked_list#Linked_lists_using_arrays_of_nodes

          Where the next and previous arrays would be AtomicIntegerArrays for concurrency (we may not need to implement the previous array because the TermsEnum only goes forward). If we can somehow implement the concurrent skip list on top of the concurrent linked list, we're probably good to go.

          Show
          Jason Rutherglen added a comment - This wikipedia article illustrates the use of parallel arrays we can use for the terms dictionary: http://en.wikipedia.org/wiki/Linked_list#Linked_lists_using_arrays_of_nodes Where the next and previous arrays would be AtomicIntegerArrays for concurrency (we may not need to implement the previous array because the TermsEnum only goes forward). If we can somehow implement the concurrent skip list on top of the concurrent linked list, we're probably good to go.
          Hide
          Jason Rutherglen added a comment -

          We need to figure out a way to concurrently read and write from the *BlockPool classes. This will block implementing the reading of the primary index information (ie, postings).

          Show
          Jason Rutherglen added a comment - We need to figure out a way to concurrently read and write from the *BlockPool classes. This will block implementing the reading of the primary index information (ie, postings).
          Hide
          Jason Rutherglen added a comment -

          I think we can change the *BlockPool classes to not expose the
          underlying array and instead hide the implementation, perhaps
          even implement a random access file like api. At the end of a
          document add, we'd flush the RAMRandomAccessFile like class, at
          which point the underlying data would become available for
          reading. Unless there's something I'm missing?

          Show
          Jason Rutherglen added a comment - I think we can change the *BlockPool classes to not expose the underlying array and instead hide the implementation, perhaps even implement a random access file like api. At the end of a document add, we'd flush the RAMRandomAccessFile like class, at which point the underlying data would become available for reading. Unless there's something I'm missing?
          Hide
          Michael McCandless added a comment -

          Note that many queries don't need sorted terms (TermQuery, PhraseQuery, SpanTermQuery). Others do (NumericRangeQuery, TermRangeQuery, AutomatonQuery (wildcard, fuzzy, regexp), PrefixQuery). At flush time we also need the terms sorted.

          Show
          Michael McCandless added a comment - Note that many queries don't need sorted terms (TermQuery, PhraseQuery, SpanTermQuery). Others do (NumericRangeQuery, TermRangeQuery, AutomatonQuery (wildcard, fuzzy, regexp), PrefixQuery). At flush time we also need the terms sorted.
          Hide
          Jason Rutherglen added a comment - - edited

          Note that many queries don't need sorted terms

          Right, for starters we could offer a limited terms enum, terms docs API that only allows seeking to a specific term, without iteration of the terms dictionary, and iteration of the term's postings. Then some of these basic queries could be enabled.

          Show
          Jason Rutherglen added a comment - - edited Note that many queries don't need sorted terms Right, for starters we could offer a limited terms enum, terms docs API that only allows seeking to a specific term, without iteration of the terms dictionary, and iteration of the term's postings. Then some of these basic queries could be enabled.
          Hide
          Jason Rutherglen added a comment -

          The patches I've been submitting to LUCENE-2575 probably should go here. Once the new byte block pool that records the slice level at the beginning of the slice is finished, the skip list can be completed, and then the basic functionality for searching on the RAM buffer will be done. At that point the concurrency and memory efficiency may be focused on and tested. In addition the deletes must be implemented.

          Show
          Jason Rutherglen added a comment - The patches I've been submitting to LUCENE-2575 probably should go here. Once the new byte block pool that records the slice level at the beginning of the slice is finished, the skip list can be completed, and then the basic functionality for searching on the RAM buffer will be done. At that point the concurrency and memory efficiency may be focused on and tested. In addition the deletes must be implemented.
          Hide
          Jason Rutherglen added a comment -

          The term dictionary's sorted int[] term ids can periodically be merged to create larger arrays. We can use a log merge policy like algorithm to merge the arrays, while the iteratively added terms will be placed into a concurrent skip list. Once the terms CSL reaches a given threshold (ie, 10,000) then it'll be converted into an int[] and periodically merged using log level steps.

          Show
          Jason Rutherglen added a comment - The term dictionary's sorted int[] term ids can periodically be merged to create larger arrays. We can use a log merge policy like algorithm to merge the arrays, while the iteratively added terms will be placed into a concurrent skip list. Once the terms CSL reaches a given threshold (ie, 10,000) then it'll be converted into an int[] and periodically merged using log level steps.
          Hide
          Jason Rutherglen added a comment -

          An interesting question for this issue is how will norms be handled? It's an issue because norms requires the total number of terms, which we can compute per reader, however as we add readers, we can easily generate too much garbage (ie, a new norms byte[] per field per reader). Perhaps we can relax the accuracy of the calculation for RT?

          Show
          Jason Rutherglen added a comment - An interesting question for this issue is how will norms be handled? It's an issue because norms requires the total number of terms, which we can compute per reader, however as we add readers, we can easily generate too much garbage (ie, a new norms byte[] per field per reader). Perhaps we can relax the accuracy of the calculation for RT?
          Hide
          Jason Rutherglen added a comment -

          I wanted to getting everything built to prove out some of the ideas. We're ready to concurrency test this. We're nearly memory efficient except for terms dict and term freqs.

          • A single pass test of most basic RAM reader functionality
          • Working deletes via sequence IDs
          • We're only copying the spine of the byte block buffers and the int pool buffers per reader.
          • Flushing and loading of doc stores and term vectors
          • We're only making a fully copy per reader of the term freqs in FreqProxPostingsArray. This'll need to used paged ints.
          • Terms dictionary using a CSL. Each terms enum is unique to a reader.
          • The RAM reader implements all IndexReader methods except norms.
          • Not committable
          Show
          Jason Rutherglen added a comment - I wanted to getting everything built to prove out some of the ideas. We're ready to concurrency test this. We're nearly memory efficient except for terms dict and term freqs. A single pass test of most basic RAM reader functionality Working deletes via sequence IDs We're only copying the spine of the byte block buffers and the int pool buffers per reader. Flushing and loading of doc stores and term vectors We're only making a fully copy per reader of the term freqs in FreqProxPostingsArray. This'll need to used paged ints. Incorporated LUCENE-2575 's slice allocation Terms dictionary using a CSL. Each terms enum is unique to a reader. The RAM reader implements all IndexReader methods except norms. Not committable
          Hide
          Jason Rutherglen added a comment -

          We don't need to create a new spine array for the byte and int blocks per reader, these are append only spine arrays so we'll only copy a reference to them per reader.

          Show
          Jason Rutherglen added a comment - We don't need to create a new spine array for the byte and int blocks per reader, these are append only spine arrays so we'll only copy a reference to them per reader.
          Hide
          Jason Rutherglen added a comment -

          FreqProxTermsWriterPerField writes the prox posting data as terms are seen, however for the freq data we wait until we're on to the next doc (to accurately record the doc freq), and seeing a previously analyzed term before writing to the freq stream. Because the last doc code array in the posting array should not be copied per reader, when a document is finished, we need to flush the freq info out per term seen for that doc. This way, on reader instigated flush, the reader may always read all necessary posting data from the byte slices, and not rely partially on the posting array. I don't think this will affect indexing performance.

          Show
          Jason Rutherglen added a comment - FreqProxTermsWriterPerField writes the prox posting data as terms are seen, however for the freq data we wait until we're on to the next doc (to accurately record the doc freq), and seeing a previously analyzed term before writing to the freq stream. Because the last doc code array in the posting array should not be copied per reader, when a document is finished, we need to flush the freq info out per term seen for that doc. This way, on reader instigated flush, the reader may always read all necessary posting data from the byte slices, and not rely partially on the posting array. I don't think this will affect indexing performance.
          Hide
          Jason Rutherglen added a comment -

          I think this can be implemented on the single DWPT case we have today. I'm not
          sure why it needs to wait on DWPT other than we thought DWPT would make the
          synchronization easier. Given that we're simply making pointer references in
          sync'd blocks to achieve basic RT, the single DW in 3.x and trunk should work
          fine.

          I'll start by making this patch to apply cleanly to trunk. There are two issues
          which will prevent all tests from passing, field caches and norms. Field caches
          will require custom dynamic growing of the field cache arrays, perhaps we'll
          need to add a .grow method to them? We don't really have a good way to address
          norms and for now I'll leave them faked.

          Show
          Jason Rutherglen added a comment - I think this can be implemented on the single DWPT case we have today. I'm not sure why it needs to wait on DWPT other than we thought DWPT would make the synchronization easier. Given that we're simply making pointer references in sync'd blocks to achieve basic RT, the single DW in 3.x and trunk should work fine. I'll start by making this patch to apply cleanly to trunk. There are two issues which will prevent all tests from passing, field caches and norms. Field caches will require custom dynamic growing of the field cache arrays, perhaps we'll need to add a .grow method to them? We don't really have a good way to address norms and for now I'll leave them faked.
          Hide
          Jason Rutherglen added a comment -

          I thought about opening a separate issue for RT FieldCache. In trunk it looks
          like we need to add a grow method to CachedArray. However this doesn't quite
          solve the problem of filling in the field cache values either on demand or as
          documents are added. I'm not yet sure how the on-demand case'll work, which is
          probably the most logical to implement.

          The difficult use case is terms index, as it returns the term for a docid from
          an ord value. I don't think maintaining a terms index is possible on a rapidly
          changing index, eg, efficiently keeping an ordered terms array. Additionally,
          we can't easily tap into the CSML terms dictionary as it'll be changing and
          doesn't offer ord access.

          Perhaps we'd need to hardcode the terms index to return the term for a docid,
          ie, force calls to getTermsIndex to return DocTerms. The underlying doc terms
          field cache can be iteratively built on-demand. I wonder if there are
          compatibility issues with returning a DocTermsIndex that only effectively
          implements DocTerms?

          Show
          Jason Rutherglen added a comment - I thought about opening a separate issue for RT FieldCache. In trunk it looks like we need to add a grow method to CachedArray. However this doesn't quite solve the problem of filling in the field cache values either on demand or as documents are added. I'm not yet sure how the on-demand case'll work, which is probably the most logical to implement. The difficult use case is terms index, as it returns the term for a docid from an ord value. I don't think maintaining a terms index is possible on a rapidly changing index, eg, efficiently keeping an ordered terms array. Additionally, we can't easily tap into the CSML terms dictionary as it'll be changing and doesn't offer ord access. Perhaps we'd need to hardcode the terms index to return the term for a docid, ie, force calls to getTermsIndex to return DocTerms. The underlying doc terms field cache can be iteratively built on-demand. I wonder if there are compatibility issues with returning a DocTermsIndex that only effectively implements DocTerms?
          Hide
          Jason Rutherglen added a comment -

          After further thought, there really isn't a choice here, the field cache values
          must be added as documents are added, or inefficiently generated from scratch
          per reader. The latter could be used to allow all tests to pass, however the
          excessive garbage generated could even preclude that.

          I think this'll mean introspecting for existing RT field cache entries and add
          values to them as documents are added.

          Show
          Jason Rutherglen added a comment - After further thought, there really isn't a choice here, the field cache values must be added as documents are added, or inefficiently generated from scratch per reader. The latter could be used to allow all tests to pass, however the excessive garbage generated could even preclude that. I think this'll mean introspecting for existing RT field cache entries and add values to them as documents are added.
          Hide
          Jason Rutherglen added a comment -

          Simple patch that's added a new CachedArray setValue method. It can be used to specify field cache values while indexing with RT turned on modally. A potential problem is whether the public values variable needs to be volatile due to multithreaded access?

          Show
          Jason Rutherglen added a comment - Simple patch that's added a new CachedArray setValue method. It can be used to specify field cache values while indexing with RT turned on modally. A potential problem is whether the public values variable needs to be volatile due to multithreaded access?
          Hide
          Jason Rutherglen added a comment -

          For TermOrdValComparator we'll have an issue when comparing the other segment's DocTermsIndex against the RAM reader's DocTerms for a given field, eg, it's going to fail because we won't have the ordinal. We could add a flag that forces TOVC to compare RAM reader(s)' field caches by value?

          Show
          Jason Rutherglen added a comment - For TermOrdValComparator we'll have an issue when comparing the other segment's DocTermsIndex against the RAM reader's DocTerms for a given field, eg, it's going to fail because we won't have the ordinal. We could add a flag that forces TOVC to compare RAM reader(s)' field caches by value?
          Hide
          Michael McCandless added a comment -

          We could add a flag that forces TOVC to compare RAM reader(s)' field caches by value?

          Looks like we'll have to do something like that. It's sort of like mixing StringOrdValComp with StringValComp on a segment by segment basis. Ie for flushed segments it can use ord but for still-in-RAM segments it cannot. Maybe we add a supportsOrd() to the FC entry?

          Show
          Michael McCandless added a comment - We could add a flag that forces TOVC to compare RAM reader(s)' field caches by value? Looks like we'll have to do something like that. It's sort of like mixing StringOrdValComp with StringValComp on a segment by segment basis. Ie for flushed segments it can use ord but for still-in-RAM segments it cannot. Maybe we add a supportsOrd() to the FC entry?
          Hide
          Michael McCandless added a comment -

          Simple patch that's added a new CachedArray setValue method

          Hmmm – how will we incrementally update the FC entry on opening a new RT reader? Don't we have to fully re-invert?

          Maybe... we only support comparators using docvalues on RT segments, instead? DocValues are much easier to update incrementally since they are set during indexing. (But: we still need to figure out how DocValues intersects w/ RT!).

          Show
          Michael McCandless added a comment - Simple patch that's added a new CachedArray setValue method Hmmm – how will we incrementally update the FC entry on opening a new RT reader? Don't we have to fully re-invert? Maybe... we only support comparators using docvalues on RT segments, instead? DocValues are much easier to update incrementally since they are set during indexing. (But: we still need to figure out how DocValues intersects w/ RT!).
          Hide
          Jason Rutherglen added a comment -

          how will we incrementally update the FC entry on opening a new RT reader? Don't we have to fully re-invert?

          Right, fully re-inverting is inefficient so we need to add field cache values as we're indexing, which itself could be inefficient in some other way, though I think this is somewhat unlikely, maybe the conversion from string -> numbers?

          DocValues are much easier to update incrementally since they are set during indexing. (But: we still need to figure out how DocValues intersects w/ RT!).

          What's wrong with incrementally adding to the FC? Also where are DocValues set during indexing? In trunk they're only referenced in the o.a.l.s.function package.

          Show
          Jason Rutherglen added a comment - how will we incrementally update the FC entry on opening a new RT reader? Don't we have to fully re-invert? Right, fully re-inverting is inefficient so we need to add field cache values as we're indexing, which itself could be inefficient in some other way, though I think this is somewhat unlikely, maybe the conversion from string -> numbers? DocValues are much easier to update incrementally since they are set during indexing. (But: we still need to figure out how DocValues intersects w/ RT!). What's wrong with incrementally adding to the FC? Also where are DocValues set during indexing? In trunk they're only referenced in the o.a.l.s.function package.
          Hide
          Jason Rutherglen added a comment -

          Maybe we add a supportsOrd() to the FC entry?

          Or mark the reader as requiring ord only in the non-RAM reader case. Either that or we'd have an instanceof in FieldCache which is perhaps ugly.

          Show
          Jason Rutherglen added a comment - Maybe we add a supportsOrd() to the FC entry? Or mark the reader as requiring ord only in the non-RAM reader case. Either that or we'd have an instanceof in FieldCache which is perhaps ugly.
          Hide
          Yonik Seeley added a comment -

          We shouldn't put to much magic down in the guts of anything here. Obtaining a normal fieldcache entry should work the same on an RT reader as any other reader.

          If we want a hybrid that can work either over docvalues or a normal FieldCache entry, then that should be a different type of FieldComparatorSource. TOVC should continue to work as it does today.

          Show
          Yonik Seeley added a comment - We shouldn't put to much magic down in the guts of anything here. Obtaining a normal fieldcache entry should work the same on an RT reader as any other reader. If we want a hybrid that can work either over docvalues or a normal FieldCache entry, then that should be a different type of FieldComparatorSource. TOVC should continue to work as it does today.
          Hide
          Jason Rutherglen added a comment -

          Obtaining a normal fieldcache entry should work the same on an RT reader as any other reader

          Yes. I'm still confused as to how DocValues fits into all of this.

          TOVC should continue to work as it does today

          It should, otherwise there'll be performance considerations. The main proposal here is incrementally updating FC values and how to continue to use DocTermsIndex for non-RT readers mixed with DocTerms for RT readers, either in TOVC or somewhere else.

          Show
          Jason Rutherglen added a comment - Obtaining a normal fieldcache entry should work the same on an RT reader as any other reader Yes. I'm still confused as to how DocValues fits into all of this. TOVC should continue to work as it does today It should, otherwise there'll be performance considerations. The main proposal here is incrementally updating FC values and how to continue to use DocTermsIndex for non-RT readers mixed with DocTerms for RT readers, either in TOVC or somewhere else.
          Hide
          Simon Willnauer added a comment -

          Yes. I'm still confused as to how DocValues fits into all of this.

          DocValues == column stride fields

          does that help ?

          simon

          Show
          Simon Willnauer added a comment - Yes. I'm still confused as to how DocValues fits into all of this. DocValues == column stride fields does that help ? simon
          Hide
          Jason Rutherglen added a comment -

          DocValues == column stride fields

          Ok, that makes sense!

          I'm going to leave this alone for now, however I agree that ideally we'd leave
          TOVC alone and at a higher level intermix the ord and non-ord doc terms. It's
          hard to immediately determine how that'd work given the slot concept, which
          seems to be an ord or value per reader that's directly comparable? Is there an
          example of mixing multiple comparators for a given field?

          Show
          Jason Rutherglen added a comment - DocValues == column stride fields Ok, that makes sense! I'm going to leave this alone for now, however I agree that ideally we'd leave TOVC alone and at a higher level intermix the ord and non-ord doc terms. It's hard to immediately determine how that'd work given the slot concept, which seems to be an ord or value per reader that's directly comparable? Is there an example of mixing multiple comparators for a given field?
          Hide
          Earwin Burrfoot added a comment -

          Some questions to align myself with impending reality.

          Is that right that future RT readers are no longer immutable snapshots (in a sense that they have variable maxDoc)?
          If it is so, are you keeping current NRT mode, with fast turnaround, yet immutable readers?

          Show
          Earwin Burrfoot added a comment - Some questions to align myself with impending reality. Is that right that future RT readers are no longer immutable snapshots (in a sense that they have variable maxDoc)? If it is so, are you keeping current NRT mode, with fast turnaround, yet immutable readers?
          Hide
          Michael McCandless added a comment -

          I believe the goal for RT readers is still "point in time" reader semantics.

          Show
          Michael McCandless added a comment - I believe the goal for RT readers is still "point in time" reader semantics.
          Hide
          Jason Rutherglen added a comment -

          Is that right that future RT readers are no longer immutable snapshots
          (in a sense that they have variable maxDoc)?

          The RT readers'll be point-in-time. There are many mechanisms to make this
          happen that mainly revolve around a static maxDoc per reader while allowing
          some of the underlying data structures to change during indexing. There are two
          overall design issues right now and that is how to handle norms and the
          system.arraycopy per getReader to create static read only parallel upto arrays.

          I think system.arraycopy should be fast enough given it's a native instruction
          on Intel. And for norms we may need to relax their accuracy in order to create
          less garbage. That would involve either using a byte[][] for point-in-timeness
          or a byte[] that is recalculated only as it's grown (meaning newer readers
          created since the last array growth may see a slightly inaccurate norm value).
          The norm byte[] would essentially be grown every N docs.

          Show
          Jason Rutherglen added a comment - Is that right that future RT readers are no longer immutable snapshots (in a sense that they have variable maxDoc)? The RT readers'll be point-in-time. There are many mechanisms to make this happen that mainly revolve around a static maxDoc per reader while allowing some of the underlying data structures to change during indexing. There are two overall design issues right now and that is how to handle norms and the system.arraycopy per getReader to create static read only parallel upto arrays. I think system.arraycopy should be fast enough given it's a native instruction on Intel. And for norms we may need to relax their accuracy in order to create less garbage. That would involve either using a byte[][] for point-in-timeness or a byte[] that is recalculated only as it's grown (meaning newer readers created since the last array growth may see a slightly inaccurate norm value). The norm byte[] would essentially be grown every N docs.
          Hide
          Michael Busch added a comment -

          I believe the goal for RT readers is still "point in time" reader semantics.

          True. At twitter our RT solution also guarantees point-in-time readers (with one exception; see below). We have to provide at least a fixed macDoc per-query to guarantee consistency across terms (posting lists). Eg. imagine your query is 'a AND NOT b'. Say a occurs in doc 100. Now you don't find a posting in b's posting list for doc 100. Did doc 100 not have term b, or is doc 100 still being processed and that particular posting hasn't been written yet? If the reader's maxDoc however is set to 99 (the last completely indexed document) you can't get into this situation.

          Before every query we reopen the readers, which effectively simply updates the maxDoc.

          The one exception to point-in-time-ness are the df values in the dictionary, which for obvious reasons is tricky. I think a straightforward way to solve this problem is to count the df by iterating the corresponding posting list when requested. We could add a special counting method that just uses the skip lists to perform this task. Here the term buffer becomes even more important, and also documenting that docFreq() can be expensive in RT mode, ie. not O(1) as in non-RT mode, but rather O(log indexSize) in case we can get multi-level skip lists working in RT.

          Show
          Michael Busch added a comment - I believe the goal for RT readers is still "point in time" reader semantics. True. At twitter our RT solution also guarantees point-in-time readers (with one exception; see below). We have to provide at least a fixed macDoc per-query to guarantee consistency across terms (posting lists). Eg. imagine your query is 'a AND NOT b'. Say a occurs in doc 100. Now you don't find a posting in b's posting list for doc 100. Did doc 100 not have term b, or is doc 100 still being processed and that particular posting hasn't been written yet? If the reader's maxDoc however is set to 99 (the last completely indexed document) you can't get into this situation. Before every query we reopen the readers, which effectively simply updates the maxDoc. The one exception to point-in-time-ness are the df values in the dictionary, which for obvious reasons is tricky. I think a straightforward way to solve this problem is to count the df by iterating the corresponding posting list when requested. We could add a special counting method that just uses the skip lists to perform this task. Here the term buffer becomes even more important, and also documenting that docFreq() can be expensive in RT mode, ie. not O(1) as in non-RT mode, but rather O(log indexSize) in case we can get multi-level skip lists working in RT.
          Hide
          Jason Rutherglen added a comment -

          The one exception to point-in-time-ness are the df values in the
          dictionary, which for obvious reasons is tricky.

          Right, forgot about those. I think we'd planned on using a multi-dimensional
          array, eg int[][]. However we'd need to test how they'll affect indexing
          performance. If that doesn't work then we'll need to think about other
          solutions like building them on demand, which is offloading the problem
          somewhere else. It looks like docFreq is used only for phrase queries? However
          I think paying a potentially small penalty during indexing (only when RT is on)
          is better than a somewhat random penalty during querying.

          Show
          Jason Rutherglen added a comment - The one exception to point-in-time-ness are the df values in the dictionary, which for obvious reasons is tricky. Right, forgot about those. I think we'd planned on using a multi-dimensional array, eg int[][]. However we'd need to test how they'll affect indexing performance. If that doesn't work then we'll need to think about other solutions like building them on demand, which is offloading the problem somewhere else. It looks like docFreq is used only for phrase queries? However I think paying a potentially small penalty during indexing (only when RT is on) is better than a somewhat random penalty during querying.
          Hide
          Jason Rutherglen added a comment -

          In the current patch, I'm copying the parallel array for the end of a term's postings per reader [re]open. However in the case where we're opening a reader after each document is indexed, this is wasteful. We can simply queue the term ids from the last indexed document, and only copy the newly updated values over to the 'read' only consistent parallel array.

          Show
          Jason Rutherglen added a comment - In the current patch, I'm copying the parallel array for the end of a term's postings per reader [re] open. However in the case where we're opening a reader after each document is indexed, this is wasteful. We can simply queue the term ids from the last indexed document, and only copy the newly updated values over to the 'read' only consistent parallel array.
          Hide
          Jason Rutherglen added a comment -

          I had been testing out an alternative skip list. I think it's a bit too esoteric at this point. I'm resuming work on this issue, using Java's CSLM for the terms dict. There really isn't a good way to break up the patch, it's just going to be large, eg, we can't separate out the terms dict from the RT postings etc.

          Show
          Jason Rutherglen added a comment - I had been testing out an alternative skip list. I think it's a bit too esoteric at this point. I'm resuming work on this issue, using Java's CSLM for the terms dict. There really isn't a good way to break up the patch, it's just going to be large, eg, we can't separate out the terms dict from the RT postings etc.
          Hide
          Jason Rutherglen added a comment -

          This is a revised version of the LUCENE-2312 patch. The following are various and miscelaneous notes pertaining to the patch and where it needs to go to be committed.

          Feel free to review the approach taken, eg, we're getting around non-realtime structures through the usage of array copies (of which the arrays can be pooled at some point).

          • A copy of FreqProxPostingsArray.termFreqs is made per new reader. That array can be pooled. This is no different than the deleted docs BitVector which is created anew per-segment for any deletes that have occurred.
          • FreqProxPostingsArray freqUptosRT, proxUptosRT, lastDocIDsRT, lastDocFreqsRT is copied into, per new reader (as opposed to an entirely new array instantiated for each new reader), this is a slight optimization in object allocation.
          • For deleting, a DWPT is clothed in an abstract class that exposes the necessary methods from segment info, so that deletes may be applied to the RT RAM reader. The deleting is still performed in BufferedDeletesStream. BitVectors are cloned as well. There is room for improvement, eg, pooling the BV byte[]’s.
          • Documents (FieldsWriter) and term vectors are flushed on each get reader call, so that reading will be able to load the data. We will need to test if this is performant. We are not creating new files so this way of doing things may well be efficient.
          • We need to measure the cost of the native system array copy. It could very well be quite fast / enough.
          • Full posting functionality should be working including payloads
          • Field caching may be implemented as a new field cache that is growable and enables lock’d replacement of the underlying array
          • String to string ordinal comparison caches needs to be figured out. The RAM readers cannot maintain a sorted terms index the way statically sized segments do
          • When a field cache value is first being created, it needs to obtain the indexing lock on the DWPT. Otherwise documents will continue to be indexed, new values created, while the array will miss the new values. The downside is that while the array is initially being created, indexing will stop. This can probably be solved at some point by only locking during the creation of the field cache array, and then notifying the DWPT of the new array. New values would then accumulate into the array from the point of the max doc of the reader the values creator is working from.
          • The terms dictionary is a ConcurrentSkipListMap. We can periodically convert it into a sorted [by term] int[], that has an FST on top.

          Have fun reviewing!

          Show
          Jason Rutherglen added a comment - This is a revised version of the LUCENE-2312 patch. The following are various and miscelaneous notes pertaining to the patch and where it needs to go to be committed. Feel free to review the approach taken, eg, we're getting around non-realtime structures through the usage of array copies (of which the arrays can be pooled at some point). A copy of FreqProxPostingsArray.termFreqs is made per new reader. That array can be pooled. This is no different than the deleted docs BitVector which is created anew per-segment for any deletes that have occurred. FreqProxPostingsArray freqUptosRT, proxUptosRT, lastDocIDsRT, lastDocFreqsRT is copied into, per new reader (as opposed to an entirely new array instantiated for each new reader), this is a slight optimization in object allocation. For deleting, a DWPT is clothed in an abstract class that exposes the necessary methods from segment info, so that deletes may be applied to the RT RAM reader. The deleting is still performed in BufferedDeletesStream. BitVectors are cloned as well. There is room for improvement, eg, pooling the BV byte[]’s. Documents (FieldsWriter) and term vectors are flushed on each get reader call, so that reading will be able to load the data. We will need to test if this is performant. We are not creating new files so this way of doing things may well be efficient. We need to measure the cost of the native system array copy. It could very well be quite fast / enough. Full posting functionality should be working including payloads Field caching may be implemented as a new field cache that is growable and enables lock’d replacement of the underlying array String to string ordinal comparison caches needs to be figured out. The RAM readers cannot maintain a sorted terms index the way statically sized segments do When a field cache value is first being created, it needs to obtain the indexing lock on the DWPT. Otherwise documents will continue to be indexed, new values created, while the array will miss the new values. The downside is that while the array is initially being created, indexing will stop. This can probably be solved at some point by only locking during the creation of the field cache array, and then notifying the DWPT of the new array. New values would then accumulate into the array from the point of the max doc of the reader the values creator is working from. The terms dictionary is a ConcurrentSkipListMap. We can periodically convert it into a sorted [by term] int[], that has an FST on top. Have fun reviewing!
          Hide
          Jason Rutherglen added a comment -

          A benchmark plan is, compare the speed of NRT vs. RT.

          Index documents in a single thread, in a 2nd thread open a reader and perform a query. It would be nice to synchronize the point / max doc at which RT and NRT open new readers to additionally verify the correctness of the directly comparable search results. To make the test fair, concurrent merge scheduler should be turned off in the NRT test.

          The hypothesis is that array copying, even on large [RT] indexes is no big deal compared with the excessive segment merging with NRT.

          Show
          Jason Rutherglen added a comment - A benchmark plan is, compare the speed of NRT vs. RT. Index documents in a single thread, in a 2nd thread open a reader and perform a query. It would be nice to synchronize the point / max doc at which RT and NRT open new readers to additionally verify the correctness of the directly comparable search results. To make the test fair, concurrent merge scheduler should be turned off in the NRT test. The hypothesis is that array copying, even on large [RT] indexes is no big deal compared with the excessive segment merging with NRT.
          Hide
          Jason Rutherglen added a comment -

          Here's a new patch that incrementally adds field cache and norms values. Meaning that as documents are added / indexed, norms and field cache values are automatically created. The field cache values are only added to if they have already been created.

          The field cache functionality needs to be completed for all types.

          We probably need to get the indexing lock while the field cache value is initially being created (eg, the terms enumeration).

          We're more or less feature complete now.

          Show
          Jason Rutherglen added a comment - Here's a new patch that incrementally adds field cache and norms values. Meaning that as documents are added / indexed, norms and field cache values are automatically created. The field cache values are only added to if they have already been created. The field cache functionality needs to be completed for all types. We probably need to get the indexing lock while the field cache value is initially being created (eg, the terms enumeration). We're more or less feature complete now.
          Hide
          Simon Willnauer added a comment -

          jason, I will look at this patch soon I hope. Busy times here right now so gimme some time.

          thanks

          Show
          Simon Willnauer added a comment - jason, I will look at this patch soon I hope. Busy times here right now so gimme some time. thanks
          Hide
          Jason Rutherglen added a comment -

          I'll post a new patch shortly that fixes bugs and adds a bit more to the
          functionality.

          The benchmark results are interesting. Array copies are very fast, I don't see
          any problems with that, the median time is 2 ms. The concurrent skip list map
          is expensive to add numerous 10s of thousands of terms to. I think that is to
          be expected. The strategy of amortizing that cost by creating sorted by term
          int[]s will probably be more performant than CSLM.

          The sorted int[] terms can be merged just like segments, thus RT becomes a way
          to remove the [NRT] cost of merging [numerous] postings lists. The int[] terms
          can be merged in the background so that raw indexing speed is not affected.

          Show
          Jason Rutherglen added a comment - I'll post a new patch shortly that fixes bugs and adds a bit more to the functionality. The benchmark results are interesting. Array copies are very fast, I don't see any problems with that, the median time is 2 ms. The concurrent skip list map is expensive to add numerous 10s of thousands of terms to. I think that is to be expected. The strategy of amortizing that cost by creating sorted by term int[]s will probably be more performant than CSLM. The sorted int[] terms can be merged just like segments, thus RT becomes a way to remove the [NRT] cost of merging [numerous] postings lists. The int[] terms can be merged in the background so that raw indexing speed is not affected.
          Hide
          Jason Rutherglen added a comment -

          There are many important use cases for immediate / zero delay index readers.

          I'm not sure if people realize it, but one of the major gains from this issue, is the ability to obtain a reader after every indexed document.

          In this case, instead of performing an array copy of the RT data structures, we will queue the changes, and then apply to the new reader. For arrays like term freqs, we will use a temp hash map of the changes made since the main array was created (when the hash map grows too large we can perform a full array copy).

          Show
          Jason Rutherglen added a comment - There are many important use cases for immediate / zero delay index readers. I'm not sure if people realize it, but one of the major gains from this issue, is the ability to obtain a reader after every indexed document. In this case, instead of performing an array copy of the RT data structures, we will queue the changes, and then apply to the new reader. For arrays like term freqs, we will use a temp hash map of the changes made since the main array was created (when the hash map grows too large we can perform a full array copy).

            People

            • Assignee:
              Michael Busch
              Reporter:
              Jason Rutherglen
            • Votes:
              1 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:

                Development