Lucene - Core
  1. Lucene - Core
  2. LUCENE-2822

TimeLimitingCollector starts thread in static {} with no way to stop them

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.5, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      See the comment in LuceneTestCase.

      If you even do Class.forName("TimeLimitingCollector") it starts up a thread in a static method, and there isn't a way to kill it.

      This is broken.

      1. LUCENE-2822.patch
        15 kB
        Simon Willnauer
      2. LUCENE-2822.patch
        19 kB
        Simon Willnauer
      3. LUCENE-2822.patch
        26 kB
        Simon Willnauer
      4. LUCENE-2822.patch
        21 kB
        Simon Willnauer

        Issue Links

          Activity

          Hide
          Shai Erera added a comment -

          Same issue. LUCENE-2237 includes a patch we can apply to allow shutting down the thread.

          Show
          Shai Erera added a comment - Same issue. LUCENE-2237 includes a patch we can apply to allow shutting down the thread.
          Hide
          Shai Erera added a comment -

          For reference, TimeLimitedCollector was added in LUCENE-997, where TimerThread's functionality is discussed.

          When we resolve this one, I think SOLR-1735 can be resolved as well.

          From LUCENE-997:

          TimerThread provides a pseudo-clock service to all searching threads, so that they can count elapsed time with less overhead than repeatedly calling System.currentTimeMillis. A single thread should be created to be used for all searches.
          

          The opinions about the effectiveness of TimerThread went both ways. For example, in LUCENE-1720, the timeout mechanism works differently, and I wonder if we can apply the same logic here, or something very similar.

          Instead of having a static TimerThread, we can have start()/finish() API on TimeLimitingCollector(). start() will start a Thread that will sleep in nanosecs granularity until timeout exceeds. When it exceeds, it raises a flag. In collect() (and in setNextReader/Scorer too, to increase accuracy) all we do is check the flag and if it's on, we fail, just like collect() is impled today. App can call finish() that will signal the thread to stop.

          I think it's simple and shouldn't reduce accuracy of current impl (maybe only improve it). Assuming apps don't set timeout to 20 msec, Thread.sleep w/ nanosec granularity should be very accurate. Also, today timeout is counted from the moment you create TLC, but the app may do few more things before IndexSearcher.search() is actually called. start/finish will allow the app to time the search() method (or whatever it wants).

          Also, TimerThread was intended to serve all search requests, hence why it is static, but TLC itself cannot be shared across searches at all ...

          This will solve this issue, as well as simplify TLC IMO.

          Show
          Shai Erera added a comment - For reference, TimeLimitedCollector was added in LUCENE-997 , where TimerThread's functionality is discussed. When we resolve this one, I think SOLR-1735 can be resolved as well. From LUCENE-997 : TimerThread provides a pseudo-clock service to all searching threads, so that they can count elapsed time with less overhead than repeatedly calling System.currentTimeMillis. A single thread should be created to be used for all searches. The opinions about the effectiveness of TimerThread went both ways. For example, in LUCENE-1720 , the timeout mechanism works differently, and I wonder if we can apply the same logic here, or something very similar. Instead of having a static TimerThread, we can have start()/finish() API on TimeLimitingCollector(). start() will start a Thread that will sleep in nanosecs granularity until timeout exceeds. When it exceeds, it raises a flag. In collect() (and in setNextReader/Scorer too, to increase accuracy) all we do is check the flag and if it's on, we fail, just like collect() is impled today. App can call finish() that will signal the thread to stop. I think it's simple and shouldn't reduce accuracy of current impl (maybe only improve it). Assuming apps don't set timeout to 20 msec, Thread.sleep w/ nanosec granularity should be very accurate. Also, today timeout is counted from the moment you create TLC, but the app may do few more things before IndexSearcher.search() is actually called. start/finish will allow the app to time the search() method (or whatever it wants). Also, TimerThread was intended to serve all search requests, hence why it is static, but TLC itself cannot be shared across searches at all ... This will solve this issue, as well as simplify TLC IMO.
          Hide
          Hoss Man added a comment -

          linking related issues

          Show
          Hoss Man added a comment - linking related issues
          Hide
          Hoss Man added a comment -

          Having read through all of the comments in LUCENE-997 I fail to see why any Threads are needed in TimeLimitedCollector at all – repeatedly folks mention that the use of a Timer thread is purely because System.currentTimeMillis isn't reliable enough and/or not efficient enough, but if we could use Java 1.5, System.nanoTime would be exactly what we need.

          It's 2011.
          We can use Java 1.5 now in core Lucene.
          so why don't we just rip out the TimerThread and use System.nanoTime() ?

          Show
          Hoss Man added a comment - Having read through all of the comments in LUCENE-997 I fail to see why any Threads are needed in TimeLimitedCollector at all – repeatedly folks mention that the use of a Timer thread is purely because System.currentTimeMillis isn't reliable enough and/or not efficient enough, but if we could use Java 1.5, System.nanoTime would be exactly what we need. It's 2011. We can use Java 1.5 now in core Lucene. so why don't we just rip out the TimerThread and use System.nanoTime() ?
          Hide
          Uwe Schindler added a comment - - edited

          It's 2011.
          We can use Java 1.5 now in core Lucene.
          so why don't we just rip out the TimerThread and use System.nanoTime() ?

          One of our large customers used System.currentTimeMillies() in their own TLC collector implementation, which led to a immense slowdown as then on every hit you get a quite expensive system call to the kernel. This would not be more efficient with nanoTime, as its still a system call (at least on lot's of JVMs like on Windows).

          So the thread is much more effective (its only reading a volatile field), so we need some shutdown. Another idea is to change this collector to not call nanoTime or currentTimeMillies on every hit (as hit collection should be fast), to maybe do this only every 1000 hit. The granularity could be a ctor param. This saves the thread and its still controlable how exact the measurement should be. A default of 1000 or maybe 10000 should be fine. Of course on the first collect hit (modulo 0) it should already check the timeout (because lots of queries do most work before the collection of hits).

          Show
          Uwe Schindler added a comment - - edited It's 2011. We can use Java 1.5 now in core Lucene. so why don't we just rip out the TimerThread and use System.nanoTime() ? One of our large customers used System.currentTimeMillies() in their own TLC collector implementation, which led to a immense slowdown as then on every hit you get a quite expensive system call to the kernel. This would not be more efficient with nanoTime, as its still a system call (at least on lot's of JVMs like on Windows). So the thread is much more effective (its only reading a volatile field), so we need some shutdown. Another idea is to change this collector to not call nanoTime or currentTimeMillies on every hit (as hit collection should be fast), to maybe do this only every 1000 hit. The granularity could be a ctor param. This saves the thread and its still controlable how exact the measurement should be. A default of 1000 or maybe 10000 should be fine. Of course on the first collect hit (modulo 0) it should already check the timeout (because lots of queries do most work before the collection of hits).
          Hide
          Shai Erera added a comment -

          I don't think we should check for time every N hits – finding the next hit can take some time, and currently TimeLimitingCollector is not accurate b/c of that. I.e., it guarantees that when the time threshold has elapsed, it will stop the query processing, but it doesn't guarantee how accurate it is.

          The TimerThread was indeed included because a system call on every hit was too expensive. I think that my proposal above, introducing start/finish API on TLC will solve the thread kept alive issue, won't affect performance, and keep TLC's accuracy as it is today (not perfect, but better than 'check every N hits').

          Show
          Shai Erera added a comment - I don't think we should check for time every N hits – finding the next hit can take some time, and currently TimeLimitingCollector is not accurate b/c of that. I.e., it guarantees that when the time threshold has elapsed, it will stop the query processing, but it doesn't guarantee how accurate it is. The TimerThread was indeed included because a system call on every hit was too expensive. I think that my proposal above, introducing start/finish API on TLC will solve the thread kept alive issue, won't affect performance, and keep TLC's accuracy as it is today (not perfect, but better than 'check every N hits').
          Hide
          Michael McCandless added a comment -

          I think we should stick with our private timer thread (and we should definitely make it stop-able).

          I've seen too many problems associated with relying on the system's time for "important" things like timing out queries, eg when daylight savings time strikes, or the clock is being "aggressively corrected", and suddenly a bunch of queries are truncated. In theory System.nanoTime should be immune to this (it's the system's timer and not any notion of "wall clock time"), but in practice, I don't think we should risk it.

          Show
          Michael McCandless added a comment - I think we should stick with our private timer thread (and we should definitely make it stop-able). I've seen too many problems associated with relying on the system's time for "important" things like timing out queries, eg when daylight savings time strikes, or the clock is being "aggressively corrected", and suddenly a bunch of queries are truncated. In theory System.nanoTime should be immune to this (it's the system's timer and not any notion of "wall clock time"), but in practice, I don't think we should risk it.
          Hide
          Robert Muir added a comment -

          I think we should stick with our private timer thread (and we should definitely make it stop-able).

          And no private thread should start in the static initializer... its fine for all instances to share a single private timer thread but this should be lazy-loaded.

          Show
          Robert Muir added a comment - I think we should stick with our private timer thread (and we should definitely make it stop-able). And no private thread should start in the static initializer... its fine for all instances to share a single private timer thread but this should be lazy-loaded.
          Hide
          Uwe Schindler added a comment - - edited

          I think we should stick with our private timer thread (and we should definitely make it stop-able).

          I think this is still the best variant, as both System.nanoTime() and currentTimeMillies use system calls that are really expensive. nanoTime() has no wallclock problems, thats true, but is still a no-go for every collected hit!

          Show
          Uwe Schindler added a comment - - edited I think we should stick with our private timer thread (and we should definitely make it stop-able). I think this is still the best variant, as both System.nanoTime() and currentTimeMillies use system calls that are really expensive. nanoTime() has no wallclock problems, thats true, but is still a no-go for every collected hit!
          Hide
          Mark Harwood added a comment -

          FYI - I visited a site today using Lucene 1720 live on a large index (>2 billion docs, sharded with 5 minute update intervals). They haven't noticed any significant degrading of search performance as a result of using this approach.

          Show
          Mark Harwood added a comment - FYI - I visited a site today using Lucene 1720 live on a large index (>2 billion docs, sharded with 5 minute update intervals). They haven't noticed any significant degrading of search performance as a result of using this approach.
          Hide
          Uwe Schindler added a comment -

          Mark: But LUCENE-1720 does not use a System.nanoTime()/System.currentTimeMillis(), so what is your comment about?

          Show
          Uwe Schindler added a comment - Mark: But LUCENE-1720 does not use a System.nanoTime()/System.currentTimeMillis(), so what is your comment about?
          Hide
          Mark Harwood added a comment -

          does not use a System.nanoTime()/System.currentTimeMillis(), so what is your comment about?

          There's already a solution that was designed to avoid any overhead related to making either of these calls.

          Show
          Mark Harwood added a comment - does not use a System.nanoTime()/System.currentTimeMillis(), so what is your comment about? There's already a solution that was designed to avoid any overhead related to making either of these calls.
          Hide
          Robert Muir added a comment -

          I think this is still the best variant, as both System.nanoTime() and currentTimeMillies use system calls that are really expensive.

          Sorry its too funny, playing with LUCENE-2948 I saw a big slowdown on windows that mike didn't see on linux... finally tracked it down to an uncommented nanoTime

          Show
          Robert Muir added a comment - I think this is still the best variant, as both System.nanoTime() and currentTimeMillies use system calls that are really expensive. Sorry its too funny, playing with LUCENE-2948 I saw a big slowdown on windows that mike didn't see on linux... finally tracked it down to an uncommented nanoTime
          Hide
          Robert Muir added a comment -

          bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - bulk move 3.2 -> 3.3
          Hide
          Torsten Krah added a comment -

          Some workaround - until this one is resolved - to stop the thread (e.g. in a container shutdown callback) may be this one using reflection:

          Thread t = (Thread) FieldUtils.readDeclaredStaticField(TimeLimitingCollector.class, "TIMER_THREAD", true);
          if (t != null && t.isAlive())

          { t.interrupt(); t.join(1000); }

          Little bit ugly but theres no other way yet.

          Show
          Torsten Krah added a comment - Some workaround - until this one is resolved - to stop the thread (e.g. in a container shutdown callback) may be this one using reflection: Thread t = (Thread) FieldUtils.readDeclaredStaticField(TimeLimitingCollector.class, "TIMER_THREAD", true); if (t != null && t.isAlive()) { t.interrupt(); t.join(1000); } Little bit ugly but theres no other way yet.
          Hide
          Simon Willnauer added a comment -

          catching up here... I wonder if we can make this TLC simply use a o.a.l.utils.Counter so people can implement this on to of TLC. there could be a ThreadedCounter in TLC people can pull and use their own static variable?

          Show
          Simon Willnauer added a comment - catching up here... I wonder if we can make this TLC simply use a o.a.l.utils.Counter so people can implement this on to of TLC. there could be a ThreadedCounter in TLC people can pull and use their own static variable?
          Hide
          Simon Willnauer added a comment -

          here is a first version of what I have in mind. I'd like completely divorce the thread from the collector eventually and let the application create the thread. Solr should take care of this stuff itself. For now I used a little workaround for this.

          Show
          Simon Willnauer added a comment - here is a first version of what I have in mind. I'd like completely divorce the thread from the collector eventually and let the application create the thread. Solr should take care of this stuff itself. For now I used a little workaround for this.
          Hide
          Simon Willnauer added a comment -

          next iteration

          Show
          Simon Willnauer added a comment - next iteration
          Hide
          Simon Willnauer added a comment -

          I added a changes entry and opened up setting a baseline for the collector manually. Now by default we set the baseline once the first reader is set instead of during construction time. I think we are close here?!

          Show
          Simon Willnauer added a comment - I added a changes entry and opened up setting a baseline for the collector manually. Now by default we set the baseline once the first reader is set instead of during construction time. I think we are close here?!
          Hide
          Simon Willnauer added a comment -

          If nobody objects I am going to commit this later today and backport to 3.x - this will break bw compat for time limiting collector but I think we should do that in this particular case

          Show
          Simon Willnauer added a comment - If nobody objects I am going to commit this later today and backport to 3.x - this will break bw compat for time limiting collector but I think we should do that in this particular case
          Hide
          Simon Willnauer added a comment -

          update patch to trunk

          Show
          Simon Willnauer added a comment - update patch to trunk
          Hide
          Simon Willnauer added a comment -

          committed to trunk & backported to 3.x

          Show
          Simon Willnauer added a comment - committed to trunk & backported to 3.x
          Hide
          Uwe Schindler added a comment -

          Bulk close after release of 3.5

          Show
          Uwe Schindler added a comment - Bulk close after release of 3.5

            People

            • Assignee:
              Simon Willnauer
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development