|
Jason Rutherglen made changes - 22/Jun/08 05:06 PM
lucene-1313.patch
Removed http://reader.imagero.com/uio/
Jason Rutherglen made changes - 24/Jun/08 12:21 AM
lucene-1313.patch
Depends on Will remove writing a snapshot.xml file per transaction in favor of a human readable log. Creating and deleting these small files is a bottleneck for update speed. This way a transaction writes to 2 files only. The merges happen in the background and so never affect the transaction update speed. I am not sure how useful it would be, but it is possible to have a priority based IO system that favors transactions over merges. If a transaction is coming in and a merge is happening to disk, the merge is stopped and the transaction IO runs, then the merge IO continues. I am not sure how to handle Documents with Fields that have a TokenStream as the value as I believe these cannot be serialized. For now I assume it will be unsupported. Also not sure how to handle analyzers, are these generally serializable? It would be useful to serialize them for a more automated log recovery process.
Jason Rutherglen made changes - 24/Jun/08 09:31 PM
lucene-1313.patch
Began work on LargeBatch functionality, needs test case. Large batches allow adding documents in bulk (also performing deletes) in a transaction that goes straight to an index bypassing the transaction log. This provides the same speed as using IndexWriter to perform bulk Document processing in Ocean. Started OceanDatabase which will offer a Java API inspired by GData. Will offer optimistic concurrency (something required in a realtime search system) and dynamic object mapping (meaning types such as long, date, double will be mapped to a string term using some Solr code). A file sync is performed after each transaction, will add an option to allow syncing after N transactions like mysql. This will improve realtime update speeds. Future:
Jason Rutherglen made changes - 17/Jul/08 02:35 PM
Hi Jason,
I took an inital look at your code last night. Didn't actually execute anything, just followed method calls around to see what it was up to. My first comment is sort of boring, but there are virtually no javadocs for the core classes such as TransactionSystem, Batch and Index. It would be great if there was a bit at the class level exaplaining what classes they interact with and how. It would also be very helpful if there was method level javadocs for at least the top level commit related logic. One thing that early cought my attention is this method in TransactionSystem: public OceanSearcher getSearcher() throws IOException { Snapshot snapshot = snapshots.getLatestSnapshot(); if (searcherPolicy instanceof SingleThreadSearcherPolicy) { return new OceanSearcher(snapshot); } else { return new OceanMultiThreadSearcher(snapshot, searchThreadPool); } } Am I supposed to call this method for each query (as suggested by the method name) or is this a factory method used to update my own Searcher instance after committing documents to the index (as suggested by the code)? It's not such a big deal, but I personally think you should refactor the instanceOf to a Policy.searcherFactory method, or perhaps even a SearcherPolicyVisitor. Actually, this goes for a few other places in the module too: you have used instanceOf and unchecked casting a bit more extensive to solve problems than what I would have. But as it does not seem to be used in places where it would be a costrly thing to do these comments are mearly about code readability and gut feelings about future problems. I'm a bit concerned about the potential loss of data while documents only resides in InstantiatedIndex or RAMDirectory. I think I'd like an option on some sort of transaction log that could be played up in case of a crash. I think the easiset way would be to convert all documents to be pre-analyzed (field.tokenStream) before passing them on to the instantiated writer. I don't know how much resources that might consume, but it would make me feel safer. karl Hi Karl,
Thanks for taking a look at the code! Yes the methods need javadoc, I was waiting to see if I had settled on them, and because I started building new code on top I guess the methods have settled so I need to add javadoc to them. If you are using TransactionSystem then the getSearcher method would be called for each query. I have developed OceanDatabase which makes the searching transparent and implements optimistic concurrency (version number stored in the document). I believe most systems will want to use OceanDatabase, however the raw TransactionSystem which is more like IndexWriter will be left as well. I have been working on OceanDatabase and have neglected the javadocs of TransactionSystem. I modeled the searcherPolicy instanceof code on the MergeScheduler type of system where there is a marker interface that the subclasses implement. I don't mind changing it, or if you want to you can as well. I considered it a minor detail though and admittedly did not spend much time on it. You are welcome to change it. The transaction log is replayed on a restart of the system. It repopulates a RamIndex (uses RAMDirectory) on startup based on the max snapshot id of the existing indexes, and replays the transaction log from there. I looked at converting documents to a token stream, the problem is, if the field is stored, it creates redundant storage of the data in the transaction log. Ultimately I could not find anything to be gained from storing a token stream. Also if it was converted, what would happen with stored fields? The issue with replaying the document later though is not having the Analyzer. In the distributed object code patch LUCENE-1336 I made Analyzer Serializable. I think it's best to serialize the Analyzer, or create a small database of serialized analyzers that can be called upon during the transaction log recovery process. Because I am not entirely sure about the ramifications of serializing the Analyzer, for example, how much data a serialized Analyzer may have. Perhaps other have some ideas or feedback about serializing analyzers. In conclusion, I'll add more javadocs. Please feel free to ask more questions! Jason Is there a good place to place the javadocs on the Apache website once they are more complete?
LUCENE-1313.patch
Added javadocs. Still needs the
Jason Rutherglen made changes - 01/Oct/08 06:08 PM
Jason Rutherglen made changes - 01/Apr/09 09:06 PM
The patch includes RealtimeIndex a basic class for performing atomic
transactional realtime indexing and search. A single thread periodically flushes to disk the ram index. It relies on We need to benchmark this, specifically 1) realtime w/ramdir Long term I believe we need to implement searching over the
Jason Rutherglen made changes - 01/Apr/09 10:48 PM
Jason, your last patch looks like it's taking the "flush first to RAM Dir" approach I just described as the next step (on the java-dev thread), right? Which is great!
So this has no external dependencies, right? And it simply layers on top of I'd be very interested to compare (benchmark) this approach vs solely Could we change this class so that instead of taking a Transaction object, holding adds & deletes, it simply mirrors IndexWriter's API? Ie, I'd like to decouple the performance optimization of "let's flush small segments ithrough a RAMDir first" from the transactional semantics of "I process a transaction atomically, and lock out other thread's transactions". Ie, the transactional restriction could/should layer on top of this performance optimization for near-realtime search?
Yes.
Is the .alg using the NearRealtimeReader from
The transactional system should be able to support both methods. We'll need to integrate the RAM based indexer into IndexWriter
to carry over the deletes to the ram index while it's copied to disk. This is similar to IndexWriter.commitMergedDeletes carrying deletes over at the segment reader level based on a comparison of the current reader and the cloned reader. Otherwise there's redundant deletions to the disk index using IW.deleteDocuments which can be unnecessarily expensive. To make external we would need to do the delete by doc id genealogy.
So far, I think so? You get to set an update rate (delete + add) docs, eg 50 docs/sec, and a pause time between NRT reopens. Still, it's synthetic. If you guys (LinkedIn) have a way to fold in some realism into the test, that'd be great, if only "our app ingests at X docs(MB)/sec and reopens the NRT reader X times per second" to set our ballback.
Sorry, what are both "modes" of operation? I think there are two different "layers" here – first layer optimizes NRT by flushing small segments to RAMDir first. This seems generally useful and in theory has no impact to the API IndexWriter exposes (it's "merely" an internal optimization). The second layer adds this new Transaction object, such that N adds/deletes/commit/re-open NRT reader can be done atomically wrt other pending Transaction objects.
Right, I think the RAMDir optimization would go directly into IW, if we can separate it out from Transaction. It could naturally derive from the existing RAMBufferSizeMB, ie if NRT forces a flush, so long as its tiny, put it into the local RAMDir instead of the actual Dir, then "deduct" that size from the allowed budget of DW's ram usage. When RAMDIr + DW exceeds RAMBufferSizeMB, we then merge all of RAMDir's segments into a "real" segment in the directory. Latest realtime code, transactions are removed.
Jason Rutherglen made changes - 08/Apr/09 12:22 AM
The test we need to progress to is running the indexing side The ram dir portion of the NRT indexing increases in speed when It might be good to add some default charting to
Do you have a sense of what we'd need to add to contrib/benchmark to make this test possible?
Right – that's the point I made on java-dev about the "freedom" we have wrt NRT's performance.
Well, single threaded indexing speed is also improved by using RAM dir. Ie the use of RAM dir is orthogonal to the app's use of threads for indexing?
I've switched to Google's visualization API (http://code.google.com/apis/visualization/ I added an IndexWriter.getRAMIndex method that returns a
RAMIndex object that can be updated and flushed to the underlying writer. I think this is better than adding more methods to IndexWriter and it separates out the logic of the RAM based near realtime index and the rest of IW. Package protected IW.addIndexesNoOptimize(DirectoryIndexReader[] I think RAMIndex.flush to the underlying writer is not IW.getReader returns the normal IW reader and the RAMIndex The RAMIndex writer can be obtained and modified directly as
Jason Rutherglen made changes - 17/Apr/09 08:27 PM
Jason Rutherglen made changes - 20/Apr/09 09:09 PM
For this patch I'm debating whether to add a package protected
IndexWriter.addIndexWriter method. The problem is, the RAMIndex blocks on the write to disk during IW.addIndexesNoOptimize which if we're using ConcurrentMergeScheduler shouldn't happen? Meaning in this proposed solution, if segments keep on piling up in RAMIndex, we simply move them over to the disk IW which will in the background take care of merging them away and to disk. I don't think it's necessary to immediately write ram segments Just thinking out loud about this.
I agree: it should be fine from IndexWriter's standpoint if some In fact, early versions of I think we could adopt a simple criteria: you flush the new segment to A near realtime reader would also happily mix "real" Dir and RAMDir This should work well I think, and should not require a separate
So we're ok with the blocking that occurs when the ram buffer is
This is pretty much like resolveExternalSegments which would be
Agreed, however the IW.getReader MultiSegmentReader removes
Well... we don't have a choice (unless/until we implement IndexReader impl to directly search the RAM buffer). Still, this should be a good improvement over the blocking when flushing to a real dir.
Similar... the difference is I'd prefer to do a merge of the RAM segments vs the straight one-for-one copy that resolveExternalSegments does. commit would only become more time consuming in the NRT case? IE we'd only flush-to-RAMdir if it's getReader that's forcing the flush? In which case, I think it's fine that commit gets more costly. Also, I wouldn't expect it to be much more costly: we are doing an in-memory merge of N segments, writing one segment to the "real" directory. Vs writing each tiny segment as a real one. In fact, commit could get cheaper (when compared to not making this change) since there are fewer new files to fsync.
Or, fix that filtering to also accept IndexWriter's RAMDir. I'm confused as to how we make DocumentsWriter switch from
writing to disk vs the ramdir? It seems like a fairly major change to the system? One that's hard to switch later on after IW is instantiated? Perhaps the IW.addWriter method is easier in this regard?
Yeah I implemented it this way in the IW.addWriter code. I agree I started working on the IW.addWriter(IndexWriter, boolean
When we create SegmentWriteState (which is supposed to contain all
OK. Maybe we modify resolveExternalSegments to accept a "doMerge"?
I don't think we need two writers? I think one writer, sometimes One separate optimization we should make with NRT is to not close the doc store (stored fields, term vector) files when flushing for an NRT reader.
We do close them now, which then makes merging quite a bit more costly. The trickiness with this optimization is we'd need to be able to somehow share an IndexInput & IndexOutput; or, perhaps we can open an IndexInput even though an IndexOutput has the same file open (Windows may prevent this, though I think I've seen that it will in fact allow it). Once we do that optimization, then with this RAMDir optimization we should try to have the doc store files punch straight through to the real directory, ie bypass the RAMDir. The doc stores are space consuming, and since with autoCommit=false we can bypass merging them, it makes no sense to store them in the RAMDir. We should probably do this optimization as a "phase 2", after this one.
Agreed that DW can write the segment to the RAMDir. I started If we have a disk IW and ram IW, I'm not sure how the docstores
Sounds good.
Agreed, I think this feature is a must otherwise we're doing
I ran into problems with this before, I was trying to reuse
To implement this functionality in parallel (and perhaps make
Yonik Seeley made changes - 27/Apr/09 08:49 PM
Hmm, right. We could exclude RAMDir segments from consideration by EG we do in fact want some merging of the RAM segments if they get too
OK, let's do this as a separate issue/optimization for NRT. There are
Hmm... seems like we need to investigate further. We could either
Yes, separate issue ( So let's leave this issue focused on sometimes using RAMDir for newly created segments.
Is this over complicating things? Sometimes we want a mixture of
I don't think we want to mix RAM & disk merging? EG when RAM is full, we want to quickly flush it to disk as a single
This is functionally the same as not mixing RAM vs disk merging, Yonik raised a good question on
It would be nice if we could approximately measure this before putting more work into this issue – if the gains are not "decent" this optimization may not be worthwhile. Of course, we are talking about 100s of milliseconds for the turnaround time to add docs & open an NRT reader, so if the time for writing/opening many tiny files in RAMDir vs FSDir differs by say 10s of msecs then we should pursue this. We should also consider that the IO system may very well be quite busy (doing merge(s), backups, etc.) and that'd make it slower to have to create tiny files. A simpler optimization might be to allow using CFS for tiny files (even when CFS is turned off), but built the CFS in RAM (ie, write tiny files first to RAMFiles, then make the CFS file on disk). That might get most of the gains since the FSDir sees only one file created per tiny segment, not N.
I raised it more because of the direction the discussion was veering (write through caching to a RAMDirectory, and RAMDirectory being faster to search). I do believe that RAMDirectory can probably improve NRT, but it would be due to avoiding waiting for file open/write/close/open/read (as Mike also said)... and not any difference during IndexSearcher.search(), which should be irrelevant due to the relative size differences of the RAMDirectory and the FSDirectory. Small file creation speeds will also be heavily dependent on the exact file system used.
I assume it's ok for the IW.mergescheduler to be used which may Also a random thought, it seems like ConcurrentMergeScheduler
Only if we "accept" requiring MergePolicy to be aware that some I think this approach is probably best: you're right that allowing CMS It does mean, though, that the RAM usage semantics of IW is no longer Though one trickiness is... if a large RAM -> RAM merge takes place, This'd also mean advanced users that implement their own MergePolicy
Right, this is one of the strong reasons to do the "internal" approach
By far the biggest win of CMS over SMS is in the first merge, because Beyond the first merge...in theory, modern IO systems have concurrency I believe SSDs handle concurrent requests very well since under the A merge policy may be more optimal for ram segments vs disk
segments. Perhaps the best way to make this clean is to keep the ram merge policy and real dir merge policies different? That way we don't merge policy implementations don't need to worry about ram and non-ram dir cases. Perhaps an IW.updatePendingRamMerges method should be added that
I think having the ram merge policy should cover the reasons I
OK tentatively this feels like a good approach. Would you re-use Would we use the same MergeScheduler to then execute the selected How would we handle the "it's time to flush some RAM to disk" case?
Yes?
No?
OK good.
MergePolicy is used as is with a special IW method that handles The patch is not committable, however I am posting it to show a
There is one error that occurs regularly in testMergeRamExceeded MergePolicy selected non-contiguous segments to merge (_bo:cx83 _bm:cx4 _bn:cx2 _bl:cx1->_bj _bp:cx1->_bp _bq:cx1->_bp _c2:cx1->_c2 _c3:cx1->_c2 _c4:cx1->_c2 vs _5x:c120 _6a:c8 _6t:c11 _bo:cx83** _bm:cx4** _bn:cx2** _bl:cx1->_bj** _bp:cx1->_bp** _bq:cx1->_bp** _c1:c10 _c2:cx1->_c2** _c3:cx1->_c2** _c4:cx1->_c2**), which IndexWriter (currently) cannot handle
Jason Rutherglen made changes - 30/Apr/09 08:25 PM
Jason Rutherglen made changes - 30/Apr/09 09:59 PM
Fixed and cleaned up more.
All tests pass Added entry in CHANGES.txt I'm going to integrate
Jason Rutherglen made changes - 30/Apr/09 11:19 PM
Patch looks good! Some comments:
In the patch if ramdir is not passed in, the behavior of IW
The attached patch uses FileSwitchDirectory, where these files
SegmentInfo.files was being cleared while sizeInBytes was called
Isn't this handled well enough in updatePendingMerges or is
Jason Rutherglen made changes - 01/May/09 06:38 PM
I don't like how "deep" the dichotomy of "RAMDir vs FSDir" is being
Why did this require changes to FSD?
I think we're unlikely to gain from more than 1 BG thread for RAM
I think actually hardwire it, not just default. Building CFS in RAM On merging to disk it can then respect the user's CFS setting.
I don't understand what's wrong here?
Right. This should be "under the hood".
I think I'd rather not open up that option just yet. This really is a Way back when, IW used a RAMDir internally for buffering; then, with [BTW: once we have this machinery online, it's conceivable that we'd
OK, though I'd like to simply always use FSD, even if primary &
But... why did we have one thread asking for size while another was The size consumed by the RAM segments should be carefully computed Also, this ram size should be used not only for deciding when it's
There is more that needs to be done, because MergePolicy must Ooh: maybe a better approach is to disallow the merge if the expected
Agreed, it's a bit awkward but I don't see another way to do
Hmm... I think for benchmarking it would be good to allow
I think this is fixed in the patch, where compound files are not
Ok, this will require some reworking of the patch.
How will always using FSD work? Doesn't it assume writing to two
In the new patch this is fixed.
One issue is the ram buffer flush doubles the ram used (because
I think on creating IW the user should state (via new expert ctor) Then we could pass IFD either an FSD (when in NRT mode) or the normal SegmentInfos.hasExternalSegments, and MultiSegmentReader ctor, should Likewise all the switching in DW to handle two dirs should be rolled
I think we must keep transient RAM usage below the specified limit, so Or... we could flush the new segment directly to the real dir as one
I don't see where this is taken into account? Did you mean to attach
Yeah it's unclear what the best policy is here. Do we want to
Jason Rutherglen made changes - 05/May/09 12:32 AM
Did we decide to simply add a boolean param in the ctor to turn
on NRT instead of relying on getReader. Using getReader could cause problems with switching directories midstream.
Yes, let's switch to that.
Sounds like we create a subclass of RAMDirectory with this
Because we don't want the user customizing this?
Pass in FSD in the constructor of DocumentsWriter (and others) as
Where does the thread come from for this if we're using max threads?
Yeah, I noticed this, I'll change it. MergeSpecification.segString is
If we remove getFlushDirectory, are you saying getDirectory should getDestinationDirs is used by the ram merge scheduler, which if we I'm looking at how to get RAMLogMergePolicy to take into account the I'm not sure we have the right model yet for deciding when to
flush the ram buffer and/or ram segments. Perhaps we can simply divide the ram buffer size in half, allocating one part to the ram buffer, the other to the ram segments. When one exceeds it's (rambuffersize/2) allotment, it's flushed to disk. This way if the ram buffer size is 32MB, we will always safely flush 16MB to disk. The more ram allotted, greater the size of what's flushed to disk. We may eventually want to offer an expert method to set the ram buffer size and ram dir max size individually. Put another way I think we need a balanced upper limit for the I'd like to stay away from flushing the ram buffer to disk when If when merging ram segments (using the specialized
I don't think that's needed. I think whenever IW makes a change to
That, and because it's only used to determine CFS or not, which we've
Right. All these places could care less if they are dealing w/ FSD or
The thread is simply launched w/o checking maxThreadCount, if the Right, with JDK 1.5 we can make CMS better about pooling threads.
Do the usual back-compat dance – deprecate it and add the new one.
I agree, we should leave getDirectory() as is (returns whatever Dir We can keep getFlushDirectory, but it should not have duality inside it Then, nothing outside of IW should ever know there are two directories On the "when to flush to RAM" question... I agree it's tricky. This
Because we know the IW.ramdir is a RAMDirectory implementation,
So we let the user set the RAMMergePolicy but not get it?
Hmm... We can't just create threads and let them be garbage
We should definitely just use the sizeInBytes() method. I'm saying that IW knows when it writes new files to the RAMDir
Oh, we should add a getter (getRAMMergePolicy, not getLogMergePolicy)
This is how CMS has always been. It launches threads relatively In the patch the merge policies are split up which requires some
of the RAM NRT logic to be in updatePendingMerges. One solution is to have a merge policy that manages merging to Does IW.optimize and IW.expungeDeletes operate on the ramdir as Something in the DocumentsWriter API we may need to change is to
allow passing a directory through the IndexingChain. In the RAM NRT case, which directory we write to can change depending on if a ram buffer has exceeded it's maximum available size. If it is under half the available ram it will to go the ram dir, if not the new segment will be written to disk. For this reason we can't simply pass a directory into the constructor of DocumentsWriter, nor can we rely on calling IW.getFlushDirectory. We should be able to rely on the directory in SegmentWriteState?
I think IW.optimize should mean all RAM segments are merged into the single on-disk segment? And IW.expungeDeletes should also apply to RAM segments, ie if RAM segments have pending deletes, they are merged away (possibly entirely in RAM, ie the RAM merge policy could simply merge to a new RAM segment)?
It looks like it's the "is it time to flush to disk" logic, right? Why can't we make that the purview of the RAM MergePolicy? We may need to extend MergePolicy API to tell it how much RAM is free in the budget.
I think we should fix the indexing chain to always use SegmentWriteState's Directory and not pass Directory to the ctors? Does something go wrong if we take that approach?
Yes.
Yes.
Yep. The next patch will have these features.
Jason Rutherglen made changes - 12/May/09 03:20 AM
I think the easiest way to handle the ram buf size vs. the ram
dir size is the allow each to grow on request. I have some code I need to test that implements it. This way we're growing based on demand and availability. The only thing we may want to add is a way to grow and perhaps automatically flush based on the growth requested and perhaps prioritizing requests?
Jason Rutherglen made changes - 19/May/09 09:59 PM
Jason Rutherglen made changes - 20/May/09 04:31 AM
I think generally we are close. I have lots of little comments from
looking through the patch:
On the "how to share RAM" between RAMDir & DW's RAM buffer... instead The RAMDir only alters its ram usage when 1) we flush a new segment to For starters, and we can optimize this later, I don't think DW should So what happens is... each time getReader() is called, we make a new At some point, DW detects that the RAMDir size plus its own buffer is One challenge we face is ensuring that while we are flushing all ram I started on the pooled ram model after the last patch because
it is cleaner. Bytes are allocated up to the given limit set by IW.setRAMBufferSizeMB. As mentioned below, we may want to add a setting for the max ram temporarily used. I'm reusing the DocumentsWriter.numBytesAlloc/numBytesUsed and
You're talking about the synchronization in IW.doFlushInternal We may want to make some assumptions about usage of getReader I'll include the comments in the next patch.
Jason Rutherglen made changes - 28/May/09 07:53 PM
I had forgotten about concurrency with the docstores (keeping an
open IndexInput and IndexOutput) when using an FSDirectory. I wrote a test (which fails) so getting this functionality to work could require some reworking of FSDirectory internals? (Something on the order of auto updating IndexInput's buffers and file length as IndexOutput is flushed?) We need simultaneous IndexInput and IndexOutput ops to work as
That's a big (and good) change; I think we should save that one for another issue, and leave this one focusing on flushing segments through a RAMDir? I guess I'm confused about the doc stores. We keep one open on
disk for multiple segments being created in the via FSD, which means the IndexOutput isn't closed, however for each new SegmentReader that's opened, we're creating a new IndexInput only after the segment is flushed (to FSD, docstore to disk). So the concurrent docstores may work without too much changing
That's the crucial question: can we open a new IndexInput, while an IndexOutput is still writing to the file? I vaguely remember being surprised that this worked fine on Windows, but I'm not sure. If that's fine across all OS's, then, yes we could avoid closing the docStores when flushing a new segment. If it's not fine, then we'd need a way to make an IndexOutputInput, which is a bigger change. We also should [separately] consider having multiple SegmentReaders that share the same docStores, share a single set of IndexInputs (cloned). Ie if the RAM MergePolicy allows many segments in RAM at once, we are still opening real file descriptors to read the doc stores, so without such sharing we could start running out of descriptors. I want to run the Lucene unit tests in NRT mode without creating and/or
modifying all the test cases. In lieu of adding a System.property that turns NRT on, have we settled on a different mechanism for global settings? Perhaps the back compat type of system can be used here? Or for now a static variable on IW?
I ran a test case successfully that writes to a file while Closing docstores for every flush would seem to cause a lot of
Doesn't FSDir open only one FD per file?
You can just temporarily set the default then run all tests? We never reached consensus on a back compat sort of setting...
Excellent; let's take this up under a new issue? If this holds true across platform (and Windows is the most challenging one) then it'll make sharing much easier.
Yes it adds overhead, though, it's not on the critical path for opening a new reader since it happens in the BG. So... things like LUCENE-1526 (making deletions, norms incrementally copy-on-write) I think are higher priority.
No, it opens a real file every time openInput is called. I guess we could think about having it share/clone internally?
Or have separate parallel classes inherited classes that set the
Yeah we should, I got this confused with how we're cloning in
The issue would only be the unit test for now, or should it be a With flushToRAM=true, when IW.commit is called, do we still want
to not have concurrent merges execute synchronously? Or only wait for the merges to complete that are from ramDir to primaryDir?
I meant the issue of FSDir sharing a single IndexInput if the same file is opened more than once (you opened LUCENE-1671 for this).
This (failing to close an IndexInput that was obtained from Directory.openInput) should be rare in Lucene – we try not to do that. Failing to close clones does happen...
I think only ramDir -> primaryDir? commit() today doens't block on BG merges.
LogMergePolicy.findMergesForOptimize I assumed would merge When I created a mergeAllSegments to one, the
It should, but it will respect mergeFactor in the process (ie do multiple merges if necessary). I'm thinking we should just merge all the RAM segments in one go, and not consult merge policy, in resolveRAMSegments.
Why does this throw an exception? Ie you passed in the in-order RAM segments?
I think the only thing that'd break is IndexWriter now assumes continuity when it removes the old segments and puts the new one in. LUCENE-1076 explores allowing MergePolicy to select non-contiguous merges. But: with autoCommit=false, in order to avoid merging the doc stores, the segments (even RAM segments) must be contiguous. This is a sizable performance gain when building a large index in one IndexWriter session. Just so we don't forget, we need to test flushToRAM with a custom IndexDeletionPolicy which could be a bit tricky.
Jason Rutherglen made changes - 05/Jun/09 04:44 AM
Whats the verdict on this one Mike? Got the impression this was a likely 3.1 ...
OK let's push it to 3.1. It's very much in progress, but 1) the iterations are slow (it's a big patch), 2) it's a biggish change so I'd prefer to it shortly after a release, not shortly before, so it has plenty of time to "bake" on trunk.
Michael McCandless made changes - 15/Jun/09 03:35 PM
Just wanted to give an update on this, I'm running the unit
tests with flushToRAM=true, the ones that fail are (mostly) tests that look for files when they're now in RAM (temporarily) and the like. I'm not sure what to do with these tests, 1) ignore them (kind of hard to not run specific methods, I think) 2) or conditionalize them to run only if flushToRAM=false.
Jason Rutherglen made changes - 15/Jun/09 05:45 PM
TestThreadedOptimize is throwing a ensureContiguousMerge
exception. I think this is highlighting the change to merging all ram segments to a single primaryDir segment can sometimes lead to choosing segments that are non-contiguous? I'm not sure of the best way to handle this.
I don't see why that results in a non-contiguous merge? Ie, at all times the RAM segments should be on the tail of SegmentInfos? So if you merge all of them, in order, that merge should be contiguous?
That seems good? The patch is cleaned up. A static variable IndexWriter.GLOBALNRT
is added, which allows all the tests to be run with flushToRAM=true. I reran the tests which hopefully still work as intended. Tests that looked for specific file names were changed to work with NRT. Some of the tests are skipped entirely and need to be written specifically for flushToRAM.
I need to go through and mark the tests that can be converted to
Jason Rutherglen made changes - 18/Jun/09 12:55 AM
The other previous notes are still valid.
Jason Rutherglen made changes - 18/Jun/09 10:52 PM
Using a single segmentInfos in IW seems to be problematic as
we'll always potentially have different dir non-contiguous infos. I'm seeing the error off and on in different test cases. I will put together a patch separating the two dir infos in IW. On second thought, the previous idea would require quite a bit
of work. Perhaps OneMerge can have the segmentInfos (ramDir or primaryDir) they were selected from and the ensureContiguous can verify that? Then we'd adjust commitMerge to remove the newly merged segments individually. I'll give this a try. It's progressing. Randomly some tests fail such as the one noted below.
Jason Rutherglen made changes - 22/Jun/09 10:16 PM
Expected tests fail, the previous tests that were failing pass
on 2 runs. I didn't make any changes though so am suspicious!
Jason Rutherglen made changes - 30/Jun/09 09:28 PM
Jason Rutherglen made changes - 14/Jul/09 06:38 PM
Jason Rutherglen made changes - 15/Jul/09 12:50 AM
Jason Rutherglen made changes - 15/Jul/09 12:50 AM
Jason Rutherglen made changes - 15/Jul/09 01:21 AM
Jason Rutherglen made changes - 28/Aug/09 05:24 PM
Benchmark of LUCENE-1313
Patch is updated to trunk and compiles. Unit tests fail because
they're hardcoded for specific numbers of files etc, otherwise I don't think there's anything glaring. The included benchmark indexes 1 document, queries with a sort, This patch's benchmark fairs much better, I'll publish it's | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Patch includes libraries: