|
[
Permalink
| « Hide
]
Sean Timm added a comment - 13/Sep/07 09:09 PM
Patch against trunk revision 575451.
Sean Timm made changes - 13/Sep/07 09:09 PM
Sean Timm made changes - 13/Sep/07 09:11 PM
Here are some additional details on the changes.
New files: Extends HitCollector and detects timeouts resulting in a TimeLimitedCollector.TimeExceeded exception being thrown. TimerThread.java TimerThread provides a pseudo-clock service to all searching threads, so that they can count elapsed time with less overhead than repeatedly calling System.currentTimeMillis. A single thread should be created to be used for all searches. Modified Files: Added partial result flag. IndexSearcher.java Catches TimeLimitedCollector.TimeExceeded, sets partial results flag on TopDocs and estimates the total hit count (if we hadn't timed out partway through). Returns TopDocs with partial results. Searcher.java Added methods to set and get the timeout parameters. This implementation decision has the limitation of only permitting a single timeout value per Searcher instance (of which there is usually only one). However, this greatly minimizes the number of search methods that would need to be added. In practice, I have not needed the functionality to change the timeout settings on a per query basis. TopFieldDocCollector.java Uses TimeLimitedCollector functionality. TopDocCollector.java Uses TimeLimitedCollector functionality and exposes it to child class TopFieldDocCollector. TopDocs.java Added partial results flag. Note, TopFieldDocs extends this class and inherits the new functionality. Thanks for the patch. I didn't have a very close look, just one small thing: it's probably no good idea to catch and ignore the InterruptedException. See http://www-128.ibm.com/developerworks/java/library/j-jtp05236.html
http://www-128.ibm.com/developerworks/java/library/j-jtp05236.html
TimerThread Now follows Brian Goetz's best practice for a noncancelable task that restores interrupted status before returning rather than ignoring the InterruptedException.
Sean Timm made changes - 17/Sep/07 09:54 PM
Sean Timm made changes - 17/Sep/07 10:06 PM
Two issues are addressed in this latest patch:
1) Timeout support was not added to: public TopFieldDocs search(Weight weight, Filter filter, final int nDocs, Sort sort) 2) getCounter() in TimerThread was replaced by getMilliseconds()
Sean Timm made changes - 19/Oct/07 07:41 PM
I think this is a nice feature to have.
But I am not sure about the propsed API - the application creates a TimerThread, starts it, and the timer thread is then passed to the searcher with setTimeOut(timer,ticks). Not so simple. I think my preference for the API and implementation would be in HitCllector.collect() - in other words, we consider this new feature as an advanced one, and so only allow applications to provide their "timed" hitCollector. The modified collect() would either throw a TimeoutException or return a timedOut indication. If this is a (subclass of) RunTimeException (thuogh I am not crazy about this alternative) then there's no API change (a plus) but we need to verify that the code below propagates the RuntimeException gracefully and closes all the streams and everything (which I believe it does with all last careful changes by Mike and Michael). If RuntimeEXception is not acceptable, then this is an API change (a minus) and also many (simple) changes will be required in scorers (callers to collect). The application's timedColletor will have all the logic in that collector for both announcing and detecting the timeout. Next we can add a TimedCollector for the benefit of applications, and last, consider adding search() methods with timeOut, but I doubt that last step. > allow applications to provide their "timed" hitCollector
+1 I just requested a more fancy feature in the Solr Jira. My apologies, I did not think to search the Lucene Jira. 1) timeout: stop searching after N milliseconds and return results using only those hits already found Here is the complete request: I stumbled above; I do not yet know Jira
This request is inspired by a public search engine with millions of records. The HitCollector implementation mentioned above would allow all three of these control options. If they could be pipelined together we could use any or all of them. > I think my preference for the API and implementation would be in HitCollector.collect()
This would be simpler, but I don't see how it would be possible to estimate the total number of results and return partial results in that case. I think that is an important feature. If the concern is complexity for the application, perhaps it is possible to hide the TimerThread altogether. The TimerThread could be created and started via a searcher setTimeOut(tick, numTicks) method. To simplify it further, ticks could be fixed at a reasonable number, e.g., 100 ms, and a timeout in milliseconds could be passed in: setTimeout(milliseconds).
Is this really faster than calling System.currentTimeMillis()? I think one benefit of a dedicated timer thread is not being affected
by clock shift on the machine. System.currentTimeMillis() is not guaranteed to be monotonic. System.nanoTime() (1.5 only!) tries to be (I think), but it's still not guaranteed. Without a monotonic clock, if the clock shifts forward then it could But: what happens when TimerThread overflows the int (a Nice, I didn't think of this.
So with this understanding the timer thread (with long vs int) The suggested patch relied on collect() being called, and Also important to understand is what happens with IO I'd like to add my 2 cents on this issue.
The more I use Lucene in various ways, the more I'm getting convinced that the "Hits" API should be de-emphasized (if not outright depracated), and users should be encouraged to use the HitCollector API (search() taking a hitcollector, TopDocCollector, and so on) - especially for advanced usage. I believe that your TimeLimitedCollector is an interesting idea. However, there is simply no justification to change TopDocCollector, TopFieldDocCollector, Topdocs, Hits and Searcher. There's a MUCH simpler, and in my opinion cleaner, approach: Just make TimeLimitedCollector be a front-end for another collector (for example, TopDocCollector) which will be passed to TimeLimitedCollector's constructor. For every document that is presented to TimeLimitedCollector's collect(), it would call the inner collector's collect(). This way, without any changes to Lucene's core APIs, and just the addition of a new class, you can add the new functionality that you wanted. This, I think, is the beauty of the HitCollector interface over the "monolithic" Hits approach, and I think we should encourage this way of thinking instead of adding more and more features to Hits.
+1 Sean, can you revise your patch to follow the suggestions above?
That is, create a TimeLimitedCollector that takes and Collector parameter Then we can add to either core search or under contrib. On a related point - I am not happy with programming by a IMHO it definitely should be part of the core. Being able to control the runtime of queries/your ressources is crucial for every live system and I really wonder this it has taken so long to address this.
Otherwise I totally agree with Navdav: that Hits thingie is nice and fine for simple full-text queries but as soon as things become somewhat more complex you don't get around writing your own HitCollector (and do stuff like Facets). I also strongly agree that the timeout HC should be implemented as an decorator (what's been called "front-end" here), I just quickly wrote an example and attached it (no, I'm not happy throwing an runtime exception either): MyHitCollector hc = new MyHitCollector(); And finally, why return partial results? I don't think that this is reasonable. BTW I'm not sure whether volatile in the timer thread is really working reliably in 1.4... Example of the timeout HitCollector implemented as an decorator.
Timo Nentwig made changes - 01/Jan/08 12:23 PM
Example HitCollector to be decorated by HitCollectorTimeoutDecorator.
Timo Nentwig made changes - 01/Jan/08 12:24 PM
By using only a HitCollector it cannot be guaranteed that searches will not take too long. The reason is that there are searches that take a long time but do not collect any docs.
For example consider the case where every even doc has term A and every odd doc has term B, and the query requires both A and B. This is going to take an amount of time that is linear in the size of the index but no document will be collected. This means that every conjunction (boolean queries with more than one required clause, phrase queries, span near queries) will need to check for timeout. Also a HitCollector with a timeout facility will need a way True, unfortunately, but still better than nothing (->current situation). This approach isn't very precise in matters of timing either. Also, throwing a RuntimeException feels more like a hack than well thought out code...
I don't know Lucene's code good enough in order to estimate whether it's possible to build a real timeout machanism at all/without changing the API/rewriting a lot of code but it's incredibly important to be able to cancel running queries. You don't want to servers under high load suffering from lucene queries running up to multiple minutes at the same time consuming quite a lot of memory. And it makes no sense either because nobody is waiting so long for results... Created a patch using Timo's HitCollectorTimeoutDecorator. This patch just has a few mostly minor changes. The biggest changes are that it uses a long instead of an int for the counter in TimerThread and the TimerThread interval is fixed at 10 ms. It also throws a TimeLimitedCollector.TimeExceeded exception on a timeout.
Sean Timm made changes - 25/Jan/08 08:02 PM
Sean Timm made changes - 25/Jan/08 08:04 PM
Updated bad patch with good copy.
Created a patch using Timo's HitCollectorTimeoutDecorator. This patch just has a few mostly minor changes. The biggest changes are that it uses a long instead of an int for the counter in TimerThread and the TimerThread interval is fixed at 10 ms. It also throws a TimeLimitedCollector.TimeExceeded exception on a timeout.
Sean Timm made changes - 25/Jan/08 08:11 PM
Sean Timm made changes - 25/Jan/08 08:12 PM
The TimerThreadTest illustrates the accuracy of the TimerThread under load. On my 2GHz Xeon 4 CPU dual core RH AS 4 Linux box, it get a 20% difference between the TimerThread implementation and System.currentTimeMillis() and is independent of load.
java TimerThreadTest 8 With my single core single CPU Windows XP laptop I see a 20% difference at load, but when adding additional threads, I see an increasing difference. java TimerThreadTest 0 java TimerThreadTest 2
Sean Timm made changes - 25/Jan/08 08:34 PM
I believe this version of the patch won't work properly unless you add synchronization - writes to long values are non-atomic (Java Language Specification 17.7, as the comment says), that's why Nutch uses an int there. Perhaps using AtomicLong would be an answer, if you really need a long value.
AtomicLong is a Java 1.5 feature, so it doesn't fit either.
After
Once all conjunctions are in the same place, it would make sense to add a timeout there. Indeed, thanks for the correction - I forgot about the special treatment of volatile values.
@Sean: "The biggest changes are that it uses a long instead of an int for the counter in TimerThread" - didn't I use a volatile long already?
I hope this will become part of the next release. IMHO this is Priority Major or above... In the timeout.patch, instead of this:
time += resolution; I'd rather have this: time = System.currentTimeMillis();
because all of the wait() methods can become unreliable, especially at high loads. With (or without) this change, 100 msecs or even 200 msecs could be used as the By computing the time out moment up front, one subtraction can be saved at each document collection. Then only TIMER_THREAD.getMilliseconds() method is needed at document collection time, and the getElapsedMilliseconds() method is superfluous. This is a minor update to timeout.patch which fixes the comment about updates to 32-bit-sized variables being atomic and instead talks about volatile longs, as pointed out by Andrzej. It also computes the time out moment up front to save a subtraction on each document collection as suggested by Paul.
Sean Timm made changes - 29/Jan/08 07:18 PM
I could go either way on the System.currentTimeMillis() versus a TimerThread issue. I agree nanoTime is the correct implementation when using 1.5.
It doesn't seem like on Linux running ntp it matters much either way. NTP tries to perform smoothing and makes clock changes slowly over a longer period of time when it can rather than have an abrupt change, but YMMV if your system is having clock issues. On a really overloaded Windows box, the TimerThread implementation will not behave well as demonstrated above. I can't speak to any other platforms. The idea of System.currentTimeMillis() is to guard against misbehaviour of the java wait() method and against unexpected delays because of thread scheduling during the jump back for the loop around the wait() call.
One way to reduce such misbehaviour under heavy load is by increasing the scheduling priority of the timing thread, but I don't think that is necessary. Also System.currentTimeMillis() is obviously correct, whereas timeout += resolution is never more than an assumption about correct wait() behaviour. Clock changes by NTP are normally so slow that they don't really matter for query time outs. Sean, can you add a Junit test to timeout.patch?
I think such test should check (1) search correctness (regardless of timeout), (2) expected timeout behavior, and (3) some sanity test with multiple searching threads. This can also serve as an example for using this new collector. Cheers. IIRC I did time=System.currentTimeMillis() first but was surprised how slow this method actally is.
Would that still be a problem with a 200ms resolution?
Sean Timm made changes - 05/Feb/08 05:47 PM
200ms? No, probably not. I don't recall what resolution I used in my test but actually the timeout check took more time than the Lucene query...
Paul,
I think that if we were to use System.currentTimeMillis(), we would eschew the TimerThread as Doron suggests in his Dec. 15 comment. I haven't seen any performance issues with System.currentTimeMillis(). As far as 200ms, I think that is too large of a default resolution (and with the current implementation it is not configurable). With a 200 ms resolution, a query with a 1 second time allowed could timeout in 800 ms, and one with a time allowed of 500 ms could timeout in 300 ms. I think it is much worse to timeout a query early than to timeout late. Sean thanks for adding the test.
In the attached I tightened the check of allowed elapsed time until timeout. As for System.currentTimeMillis() vs. Timer thread - IMHO Mike's comment on 'system clock changes' makes the timer thread favorable. I checked this with up to 10,000 threads and with that number the test sometimes fails because it is quite tight on the max elapsed time required comparing to the timeout, so I don't see this is a problem. In the attached N_THREADS = 50 and this number of threads always passes for me. If there are no more major concerns I think this is now ready to go in, question is where to - under core o.a.l.search or under contrib (query or misc).
Doron Cohen made changes - 06/Feb/08 12:43 PM
Oh wrote comment that was before I decided to change the default...
Thanks for catching this. Attached patch corrects default resolution comment.
Doron Cohen made changes - 06/Feb/08 04:16 PM
Doron Cohen made changes - 06/Feb/08 04:21 PM
> My preference would be for core o.a.l.search.
+1
Doron Cohen made changes - 12/Feb/08 07:02 PM
Committed (under core o.a.l.search).
Thanks Sean!
Doron Cohen made changes - 12/Feb/08 09:02 PM
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||