James Server
  1. James Server
  2. JAMES-592

OOM caused by unbounded cache in InetAddress (was James leaks memory slowly)

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 2.2.0, 2.3.0
    • Fix Version/s: 2.3.1
    • Component/s: None
    • Labels:
      None

      Description

      Noel wrote on list:
      I do not know where in the application it is happening, but after running
      JAMES non-stop since Fri Aug 11 03:29:57 EDT 2006, this morning the JVM
      started to throw OutOfMemoryError exceptions, such as:

      21/08/06 08:39:47 WARN mailstore: Exception retrieving mail:
      java.lang.RuntimeException: Exception caught while retrieving an object,
      cause: java.lang.OutOfMemoryError, so we're deleting it.

      That did not recover, so it wasn't just due to a transient large allocation
      (which I limit, anyway), so there is definitely something leaking, albeit
      slowly. Keep in mind that the store was one of the victims, but not
      necessarily the cause.

      The JVM process size had steadily grown from a somewhat stable 114MB to
      130MB last night. I did not look at it this morning before restarting the
      server.

      — Noel

        Issue Links

          Activity

          Hide
          Danny Angus added a comment -

          Closing issue fixed in released version.

          Show
          Danny Angus added a comment - Closing issue fixed in released version.
          Hide
          Stefano Bagnara added a comment -

          Now james uses a default 300 seconds expiration for the positive dns results cache.
          The expiration is tunable via system property -Dnetworkaddress.cache.ttl
          Setting it to -1 will revert to the default "cache forever" JVM behaviour.
          Setting it to 0 will remove caching at all.
          Applied to trunk 4 days ago and backported to v2.3 (for 2.3.1) now.

          Show
          Stefano Bagnara added a comment - Now james uses a default 300 seconds expiration for the positive dns results cache. The expiration is tunable via system property -Dnetworkaddress.cache.ttl Setting it to -1 will revert to the default "cache forever" JVM behaviour. Setting it to 0 will remove caching at all. Applied to trunk 4 days ago and backported to v2.3 (for 2.3.1) now.
          Hide
          Stefano Bagnara added a comment -

          I reopen this while we decide what kind of fix backport to each of our branches.
          Also removing Trunk from fix versions as everything we have currenlty in trunk will be in next-major so it is more clean if we use only next-major for versions. Like we did previously for v2.3 vs trunk.

          Show
          Stefano Bagnara added a comment - I reopen this while we decide what kind of fix backport to each of our branches. Also removing Trunk from fix versions as everything we have currenlty in trunk will be in next-major so it is more clean if we use only next-major for versions. Like we did previously for v2.3 vs trunk.
          Hide
          Noel J. Bergman added a comment -

          Issue was caused by a single InetAddress.getHostName() call in SMTPHandler.java (also present in each of the other handlers). Resolved by replacing with a call to the DNSServer, which uses dnsjava, and doesn't suffer from the same problems as Sun's InetAddress implementation.

          Show
          Noel J. Bergman added a comment - Issue was caused by a single InetAddress.getHostName() call in SMTPHandler.java (also present in each of the other handlers). Resolved by replacing with a call to the DNSServer, which uses dnsjava, and doesn't suffer from the same problems as Sun's InetAddress implementation.
          Hide
          Stefano Bagnara added a comment -

          Updated the subject of the issue to be more specific now that Noel found out the source of the problem.

          Show
          Stefano Bagnara added a comment - Updated the subject of the issue to be more specific now that Noel found out the source of the problem.
          Hide
          Noel J. Bergman added a comment -

          Yes. We discussed Serge's e-mail at the time (the InetAddres part was old news, even then), have aleady made changes related to the dnsjava cache, and were supposed to have eliminated all uses of InetAddress for resolution. It just seems that in v2.3.0, we missed a few places that still populated the InetAddress cache instead of calling dnsjava.

          We have known about this problem for years, which is why we had previously set about removing all uses of InetAddress for resolution from the code. We just missed some. See also http://www.rgagnon.com/javadetails/java-0445.html, which explains the problem you had (as also noted by http://java.sun.com/j2se/1.4.2/docs/api/java/net/InetAddress.html for JDK 1.4 and http://java.sun.com/j2se/1.5.0/docs/guide/net/properties.html for JDK 5, the non-Sun properties are security policy properties).

          The proper solution is to use dnsjava for everything, and avoid anything that impacts the InetAddress cache.

          Show
          Noel J. Bergman added a comment - Yes. We discussed Serge's e-mail at the time (the InetAddres part was old news, even then), have aleady made changes related to the dnsjava cache, and were supposed to have eliminated all uses of InetAddress for resolution. It just seems that in v2.3.0, we missed a few places that still populated the InetAddress cache instead of calling dnsjava. We have known about this problem for years, which is why we had previously set about removing all uses of InetAddress for resolution from the code. We just missed some. See also http://www.rgagnon.com/javadetails/java-0445.html , which explains the problem you had (as also noted by http://java.sun.com/j2se/1.4.2/docs/api/java/net/InetAddress.html for JDK 1.4 and http://java.sun.com/j2se/1.5.0/docs/guide/net/properties.html for JDK 5, the non-Sun properties are security policy properties). The proper solution is to use dnsjava for everything, and avoid anything that impacts the InetAddress cache.
          Hide
          Stefano Bagnara added a comment -

          I think this is related to what Serge reported here:
          http://mail-archives.apache.org/mod_mbox/james-server-dev/200604.mbox/%3C140e061f0604251100s3820ba63t583200f0fa679043@mail.gmail.com%3E
          http://www.lokitech.com/~sergek/james/dnsleak/index

          Maybe you should try starting phoenix with this property:
          -Dsun.net.inetaddr.ttl=10
          (the documented property for jvm 14 is "-Dnetworkaddress.cache.ttl=10" but it seems this won't work)

          Show
          Stefano Bagnara added a comment - I think this is related to what Serge reported here: http://mail-archives.apache.org/mod_mbox/james-server-dev/200604.mbox/%3C140e061f0604251100s3820ba63t583200f0fa679043@mail.gmail.com%3E http://www.lokitech.com/~sergek/james/dnsleak/index Maybe you should try starting phoenix with this property: -Dsun.net.inetaddr.ttl=10 (the documented property for jvm 14 is "-Dnetworkaddress.cache.ttl=10" but it seems this won't work)
          Hide
          Stefano Bagnara added a comment -

          If you believe the problem is in the log rotation you may want to try replacing the DefaultLoggerManager with ExtendedLog4jLoggerManager in your kernel.xml.
          If the problem is in logkit or in its manager it won't leak anymore.

          Show
          Stefano Bagnara added a comment - If you believe the problem is in the log rotation you may want to try replacing the DefaultLoggerManager with ExtendedLog4jLoggerManager in your kernel.xml. If the problem is in logkit or in its manager it won't leak anymore.
          Hide
          Noel J. Bergman added a comment -

          Changing priority based upon JIRA definitions. This would be either critical or major, depending upon one's interpretation of the severity of the memory leak.

          Show
          Noel J. Bergman added a comment - Changing priority based upon JIRA definitions. This would be either critical or major, depending upon one's interpretation of the severity of the memory leak.
          Hide
          Noel J. Bergman added a comment -

          As per comments on mailing list:

          Unfortunately, I once again confirm (this time using the new memstat check in RemoteManager) that we are losing about 2MB per day. And it does seem related to the days, not the amount of mail.

          I wonder about the memory consumed by those Phoenix artifacts that showed up in the prior logs. Considering that both Stefano and Norman run with larger heaps than I do, and that at least Stefano rotates logs relatively infrequently compared to my daily rotation, it is entirely possible that they would not see such leaks for extended periods of time. Also, load testing wouldn't help in this case, since it seems to be independent of load.

          I am still prepared to release, but we will have to document this issue.

          To check for this in your environment, once a day, run memstat -gc (new to 2.3.0 RC3) in the RemoteManager, and record the available heap after the garbage collection. I have done this for the past 3 days, and see ~2MB less memory available each day.

          Show
          Noel J. Bergman added a comment - As per comments on mailing list: Unfortunately, I once again confirm (this time using the new memstat check in RemoteManager) that we are losing about 2MB per day. And it does seem related to the days, not the amount of mail. I wonder about the memory consumed by those Phoenix artifacts that showed up in the prior logs. Considering that both Stefano and Norman run with larger heaps than I do, and that at least Stefano rotates logs relatively infrequently compared to my daily rotation, it is entirely possible that they would not see such leaks for extended periods of time. Also, load testing wouldn't help in this case, since it seems to be independent of load. I am still prepared to release, but we will have to document this issue. To check for this in your environment, once a day, run memstat -gc (new to 2.3.0 RC3) in the RemoteManager, and record the available heap after the garbage collection. I have done this for the past 3 days, and see ~2MB less memory available each day.
          Hide
          Noel J. Bergman added a comment -

          This isn't, and probably never was, about MySQL. JAMES leaks memory, and we're not really sure why, yet.

          Show
          Noel J. Bergman added a comment - This isn't, and probably never was, about MySQL. JAMES leaks memory, and we're not really sure why, yet.
          Hide
          Noel J. Bergman added a comment -

          Re-opening the issue, since it still exists and was prematurely closed.

          Show
          Noel J. Bergman added a comment - Re-opening the issue, since it still exists and was prematurely closed.
          Hide
          Norman Maurer added a comment -

          This OOM disappear after noel change to a new mysql connector.

          Show
          Norman Maurer added a comment - This OOM disappear after noel change to a new mysql connector.
          Hide
          Stefano Bagnara added a comment -

          What mysql server version do you use? What connector/j version do you use?
          If you're not running the latest connector/j upgrade it, otherwise try adding "dontTrackOpenResources=true" to your jdbc url. If with the latest driver and no option the memory leak is there and adding the option the memory leak disappear then we probably miss some resultset/connection/statement close in some path of our code, otherwise tell us the results.

          Show
          Stefano Bagnara added a comment - What mysql server version do you use? What connector/j version do you use? If you're not running the latest connector/j upgrade it, otherwise try adding "dontTrackOpenResources=true" to your jdbc url. If with the latest driver and no option the memory leak is there and adding the option the memory leak disappear then we probably miss some resultset/connection/statement close in some path of our code, otherwise tell us the results.
          Hide
          Bernd Fondermann added a comment -

          Maybe we can confirm a db/mysql specific problem by running identical scenarios on file/derby/mysql and compare mem consumption.
          At best, the scenario would mimmick Noel's production load (scaled up, to accelerate the process), if he likes to disclose this information.

          Show
          Bernd Fondermann added a comment - Maybe we can confirm a db/mysql specific problem by running identical scenarios on file/derby/mysql and compare mem consumption. At best, the scenario would mimmick Noel's production load (scaled up, to accelerate the process), if he likes to disclose this information.
          Hide
          Norman Maurer added a comment -

          Ok after recheck our code it seems that we did all correct. So any idea is welcome

          Show
          Norman Maurer added a comment - Ok after recheck our code it seems that we did all correct. So any idea is welcome
          Hide
          Norman Maurer added a comment -

          After reading the bugreport and have a quick look at our code it seems that we need to add "ResultSet.close()" to our code.

          Show
          Norman Maurer added a comment - After reading the bugreport and have a quick look at our code it seems that we need to add "ResultSet.close()" to our code.
          Hide
          Norman Maurer added a comment -

          One thing to review:

          https://list.scms.waikato.ac.nz/pipermail/wekalist/2006-June/007468.html

          I do use org.gjt.mm.mysql.Driver, which is the only software in the JVM not
          part of the standard JAMES 2.3.0 distribution.

          See also: http://bugs.mysql.com/bug.php?id=20003

          Show
          Norman Maurer added a comment - One thing to review: https://list.scms.waikato.ac.nz/pipermail/wekalist/2006-June/007468.html I do use org.gjt.mm.mysql.Driver, which is the only software in the JVM not part of the standard JAMES 2.3.0 distribution. See also: http://bugs.mysql.com/bug.php?id=20003

            People

            • Assignee:
              Stefano Bagnara
              Reporter:
              Norman Maurer
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development