Solr
  1. Solr
  2. SOLR-1482

Solr master and slave freeze after query

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Cannot Reproduce
    • Affects Version/s: 1.4
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      Description

      We're having issues with the deployment of 2 master-slave setups.

      One of the master-slave setups is OK (so far) but on the other both the master and the slave keep freezing, but only after I send a query to them. And by freezing I mean indefinite hanging, with almost no output to log, no errors, nothing. It's as if there's some sort of a deadlock. The hanging servers need to be killed with -9, otherwise they keep hanging.

      The query I send queries all instances at the same time using the ?shards= syntax.

      On the slave, the logs just stop - nothing shows up anymore after the query is issued. On the master, they're a bit more descriptive. This information seeps through very-very slowly, as you can see from the timestamps:

      SEVERE: java.lang.OutOfMemoryError: PermGen space

      Oct 1, 2009 2:16:00 PM org.apache.solr.common.SolrException log
      SEVERE: java.lang.OutOfMemoryError: PermGen space

      Oct 1, 2009 2:19:37 PM org.apache.catalina.connector.CoyoteAdapter service
      SEVERE: An exception or error occurred in the container during the request processing
      java.lang.OutOfMemoryError: PermGen space
      Oct 1, 2009 2:19:37 PM org.apache.coyote.http11.Http11Processor process
      SEVERE: Error processing request
      java.lang.OutOfMemoryError: PermGen space
      Oct 1, 2009 2:19:39 PM org.apache.catalina.connector.CoyoteAdapter service
      SEVERE: An exception or error occurred in the container during the request processing
      java.lang.OutOfMemoryError: PermGen space
      Exception in thread "ContainerBackException in thread "pool-29-threadOct 1, 2009 2:21:47 PM org.apache.catalina.connector.CoyoteAdapter service
      SEVERE: An exception or error occurred in the container during the request processing
      java.lang.OutOfMemoryError: PermGen space
      Oct 1, 2009 2:21:47 PM org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler process
      SEVERE: Error reading request, ignored
      java.lang.OutOfMemoryError: PermGen space
      Oct 1, 2009 2:21:47 PM org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler process
      SEVERE: Error reading request, ignored
      java.lang.OutOfMemoryError: PermGen space
      -22" java.lang.OutOfMemoryError: PermGen space
      Oct 1, 2009 2:21:47 PM org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler process
      SEVERE: Error reading request, ignored
      java.lang.OutOfMemoryError: PermGen space
      Exception in thread "http-8080-42" Oct 1, 2009 2:21:47 PM org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler process
      SEVERE: Error reading request, ignored
      java.lang.OutOfMemoryError: PermGen space
      Oct 1, 2009 2:21:47 PM org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler process
      SEVERE: Error reading request, ignored
      java.lang.OutOfMemoryError: PermGen space
      Oct 1, 2009 2:21:47 PM org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler process
      SEVERE: Error reading request, ignored
      java.lang.OutOfMemoryError: PermGen space
      Exception in thread "http-8080-26" Exception in thread "http-8080-32" Exception in thread "http-8080-25" Exception in thread "http-8080-22" Exception in thread "http-8080-15" Exception in thread "http-8080-45" Exception in thread "http-8080-13" Exception in thread "http-8080-48" Exception in thread "http-8080-7" Exception in thread "http-8080-38" Exception in thread "http-8080-39" Exception in thread "http-8080-28" Exception in thread "http-8080-1" Exception in thread "http-8080-2" Exception in thread "http-8080-12" Exception in thread "http-8080-44" Exception in thread "http-8080-47" Exception in thread "http-8080-29" Exception in thread "http-8080-33" Exception in thread "http-8080-27" Exception in thread "http-8080-36" Exception in thread "http-8080-113" Exception in thread "http-8080-112" Exception in thread "http-8080-37" Exception in thread "http-8080-18" java.lang.OutOfMemoryError: PermGen space
      java.lang.OutOfMemoryError: PermGen space
      java.lang.OutOfMemoryError: PermGen space
      java.lang.OutOfMemoryError: PermGen space
      java.lang.OutOfMemoryError: PermGen space
      Exception in thread "http-8080-34" java.lang.OutOfMemoryError: PermGen space
      java.lang.OutOfMemoryError: PermGen space
      Exception in thread "http-8080-103"

      So the problem seems to be related to PermGen space. I found http://www.nabble.com/Number-of-webapps-td22198080.html and tried -XX:MaxPermSize=256m, but it didn't fix the problem. The current CATALINA_OPTS looks like this:

      export CATALINA_OPTS="-XX:MaxPermSize=256m -Xmx6500m -XX:+UseConcMarkSweepGC"

      Is the only solution at this point going multicore, as Noble suggested (is Noble your first name? I always assumed it was Paul and Noble was part of the nickname)? Will multicore get rid of the problem, before we spend time looking at it? For multicore, will the existing data dirs be compatible or would a complete reindex be needed?

      I'm willing to provide any information to you guys, just not sure what at the moment. I'm also open to communicating outside of JIRA, at artem [_aT_] plaxo

      {dot}

      com.

      Thanks.

      1. catalina2.out
        167 kB
        Artem Russakovskii
      2. catalina.out
        87 kB
        Artem Russakovskii

        Activity

        Artem Russakovskii created issue -
        Hide
        Artem Russakovskii added a comment -

        I'm getting an error even just trying to access a single shard's admin interface, even after adjusting -XX:MaxPermSize=512m

        ==> catalina.out <==
        Oct 1, 2009 6:47:06 PM org.apache.coyote.http11.Http11Processor process
        SEVERE: Error processing request
        java.lang.OutOfMemoryError: PermGen space
        at java.lang.Throwable.getStackTraceElement(Native Method)
        at java.lang.Throwable.getOurStackTrace(Throwable.java:591)
        at java.lang.Throwable.printStackTrace(Throwable.java:510)
        at java.util.logging.SimpleFormatter.format(SimpleFormatter.java:72)
        at org.apache.juli.FileHandler.publish(FileHandler.java:129)
        at java.util.logging.Logger.log(Logger.java:458)
        at java.util.logging.Logger.doLog(Logger.java:480)
        at java.util.logging.Logger.logp(Logger.java:680)
        at org.apache.juli.logging.DirectJDKLog.log(DirectJDKLog.java:167)
        at org.apache.juli.logging.DirectJDKLog.error(DirectJDKLog.java:135)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:324)
        at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849)
        at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
        at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454)
        at java.lang.Thread.run(Thread.java:619)

        :-/

        Show
        Artem Russakovskii added a comment - I'm getting an error even just trying to access a single shard's admin interface, even after adjusting -XX:MaxPermSize=512m ==> catalina.out <== Oct 1, 2009 6:47:06 PM org.apache.coyote.http11.Http11Processor process SEVERE: Error processing request java.lang.OutOfMemoryError: PermGen space at java.lang.Throwable.getStackTraceElement(Native Method) at java.lang.Throwable.getOurStackTrace(Throwable.java:591) at java.lang.Throwable.printStackTrace(Throwable.java:510) at java.util.logging.SimpleFormatter.format(SimpleFormatter.java:72) at org.apache.juli.FileHandler.publish(FileHandler.java:129) at java.util.logging.Logger.log(Logger.java:458) at java.util.logging.Logger.doLog(Logger.java:480) at java.util.logging.Logger.logp(Logger.java:680) at org.apache.juli.logging.DirectJDKLog.log(DirectJDKLog.java:167) at org.apache.juli.logging.DirectJDKLog.error(DirectJDKLog.java:135) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:324) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454) at java.lang.Thread.run(Thread.java:619) :-/
        Hide
        Bill Au added a comment -

        You probably want to take a JVM thread dump (kill -3) while the JVM is hung to find out what's going on.

        Is your webapp app being reloaded? You can check the appserver log file to see if that's happening. One common way of running out of PermGen space is a classloader link which occurs when a webapp is reloaded.

        Show
        Bill Au added a comment - You probably want to take a JVM thread dump (kill -3) while the JVM is hung to find out what's going on. Is your webapp app being reloaded? You can check the appserver log file to see if that's happening. One common way of running out of PermGen space is a classloader link which occurs when a webapp is reloaded.
        Hide
        Artem Russakovskii added a comment -

        One thing I forgot to mention - when the hang occurs on the slave, 1 out of 8 CPUs on the machine starts using 100%, which might point in a direction of a bug rather than a Java memory issue. Remember - the slave never throws those Java errors to the log, only the master does. The slave just hangs. Using htop, I can see one of the children java processes use that 100% CPU.

        Bill, the appserver log is catalina.out, right? In any case, I'm tailing every file in the tomcat log dir and that's the log I've been pasting and talking about.

        I've attached 2 full thread dumps after kill -3 (it's quite verbose) on both slaves (both slaves are affected now).

        The first one catalina.out is from the slave that had the Perm limit raised to 512MB, the 2nd one catalina2.out is from the server without any changes to Perm limits.

        Show
        Artem Russakovskii added a comment - One thing I forgot to mention - when the hang occurs on the slave, 1 out of 8 CPUs on the machine starts using 100%, which might point in a direction of a bug rather than a Java memory issue. Remember - the slave never throws those Java errors to the log, only the master does. The slave just hangs. Using htop, I can see one of the children java processes use that 100% CPU. Bill, the appserver log is catalina.out, right? In any case, I'm tailing every file in the tomcat log dir and that's the log I've been pasting and talking about. I've attached 2 full thread dumps after kill -3 (it's quite verbose) on both slaves (both slaves are affected now). The first one catalina.out is from the slave that had the Perm limit raised to 512MB, the 2nd one catalina2.out is from the server without any changes to Perm limits.
        Artem Russakovskii made changes -
        Field Original Value New Value
        Attachment catalina.out [ 12421140 ]
        Attachment catalina2.out [ 12421141 ]
        Hide
        Artem Russakovskii added a comment - - edited

        Also, just saw this on the first slave:

        INFO: Closing Searcher@3efceb09 main
                fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
        Oct 2, 2009 11:43:27 AM org.apache.solr.handler.SnapPuller doCommit
        INFO: Force open index writer to make sure older index files get deleted
        Oct 2, 2009 11:43:35 AM org.apache.solr.update.SolrIndexWriter finalize
        SEVERE: SolrIndexWriter was not closed prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
        
        Show
        Artem Russakovskii added a comment - - edited Also, just saw this on the first slave: INFO: Closing Searcher@3efceb09 main fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0} Oct 2, 2009 11:43:27 AM org.apache.solr.handler.SnapPuller doCommit INFO: Force open index writer to make sure older index files get deleted Oct 2, 2009 11:43:35 AM org.apache.solr.update.SolrIndexWriter finalize SEVERE: SolrIndexWriter was not closed prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
        Hide
        Gus Heck added a comment -

        I've seen a lock up similar to this with just a single stand-alone instance, no master-slave relationship, so that may be a red herring.

        Sep 29, 2011 7:14:41 AM org.apache.solr.update.SolrIndexWriter finalize
        SEVERE: SolrIndexWriter was not closed prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
        Sep 29, 2011 9:22:16 AM org.apache.catalina.core.AprLifecycleListener init
        INFO: Loaded APR based Apache Tomcat Native library 1.1.20.
        

        (solr 1.4.1, tomcat 7, windows in my case on JDK 1.6)

        The server was completely unresponsive and the tomcat service wasn't even responding to a restart request. The machine had to be rebooted to get tomcat going again.

        Show
        Gus Heck added a comment - I've seen a lock up similar to this with just a single stand-alone instance, no master-slave relationship, so that may be a red herring. Sep 29, 2011 7:14:41 AM org.apache.solr.update.SolrIndexWriter finalize SEVERE: SolrIndexWriter was not closed prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!! Sep 29, 2011 9:22:16 AM org.apache.catalina.core.AprLifecycleListener init INFO: Loaded APR based Apache Tomcat Native library 1.1.20. (solr 1.4.1, tomcat 7, windows in my case on JDK 1.6) The server was completely unresponsive and the tomcat service wasn't even responding to a restart request. The machine had to be rebooted to get tomcat going again.
        Hide
        Shalin Shekhar Mangar added a comment -

        This is a very old issue. I haven't heard of similar hangs happening recently. Please let us know if anyone thinks it should be re-opened.

        Show
        Shalin Shekhar Mangar added a comment - This is a very old issue. I haven't heard of similar hangs happening recently. Please let us know if anyone thinks it should be re-opened.
        Shalin Shekhar Mangar made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Cannot Reproduce [ 5 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Artem Russakovskii
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development