It's been both fun and a learning experience debugging this. I have good news and bad news:
- the good news is: it's not a memory leak,
- the bad news is: it's not a memory leak
the debugging process
Clearly permgen is one of the most wicked JVM features - it's damn hard to figure out what its
content really is (I didn't find a way to dump it from within a running process without invoking
the debugging interface, which in turn starts its own threads, etc.).
The way I approached the problem (which may be useful for future reference) is as follows:
- I wrote a short aspect that injects itself before any String.intern is called:
pointcut targetMethod(): call(String java.lang.String.intern());
before() : targetMethod()
{
final JoinPoint jp = thisJoinPoint;
System.out.println("String#intern() from: "
+ jp.getSourceLocation().getWithinType() + " => "
+ jp.getTarget());
}
- then I added a Before and After hook (executed before/after each test) that dumped memory pools:
System.out.println("Memdump#from: "
+ this.getClass().getName() + " => ");
for (MemoryPoolMXBean bean : ManagementFactory.getMemoryPoolMXBeans()) {
MemoryUsage usage = bean.getUsage();
System.out.println(
String.format(Locale.ENGLISH,
"%20s - I:%7.1f U:%7.1f M:%7.1f",
bean.getName(),
usage.getInit() / (1024 * 1024.0d),
usage.getUsed() / (1024 * 1024.0d),
usage.getMax() / (1024 * 1024.0d)));
}
the results
From the dumped output streams we have the following weave info indicating which methods run String.intern:
$ grep "String.intern(" junit4-J0-20130710_122632_726.syserr
in Type 'com.ctc.wstx.util.SymbolTable'
in Type 'com.ctc.wstx.util.SymbolTable'
in Type 'com.ctc.wstx.util.InternCache'
in Type 'org.apache.solr.response.JSONWriter'
in Type 'org.apache.lucene.codecs.lucene3x.TermBuffer'
in Type 'org.apache.lucene.codecs.lucene3x.Lucene3xFields$PreTermsEnum'
in Type 'org.apache.solr.common.luke.FieldFlag'
in Type 'org.apache.solr.search.DocSetPerf'
in Type 'org.joda.time.tz.ZoneInfoProvider'
in Type 'org.joda.time.tz.DateTimeZoneBuilder$PrecalculatedZone'
in Type 'org.joda.time.tz.DateTimeZoneBuilder$Recurrence'
in Type 'org.joda.time.chrono.GJLocaleSymbols'
in Type 'org.apache.solr.request.TestWriterPerf'
These indeed intern a lot of strings but they're typically the same so they don't amount to the growth of permgen.
This in turn is very steady over the runtime of the test JVM:
$ egrep -o -e "PS Perm Gen[^%]+" junit4-J0-20130710_122632_726.sysout
PS Perm Gen - I: 20.8 U: 15.9 M: 82.0
PS Perm Gen - I: 20.8 U: 16.1 M: 82.0
PS Perm Gen - I: 20.8 U: 34.6 M: 82.0
PS Perm Gen - I: 20.8 U: 37.7 M: 82.0
PS Perm Gen - I: 20.8 U: 37.7 M: 82.0
PS Perm Gen - I: 20.8 U: 37.9 M: 82.0
PS Perm Gen - I: 20.8 U: 37.9 M: 82.0
PS Perm Gen - I: 20.8 U: 38.0 M: 82.0
...
PS Perm Gen - I: 20.8 U: 77.3 M: 82.0
PS Perm Gen - I: 20.8 U: 77.4 M: 82.0
PS Perm Gen - I: 20.8 U: 77.4 M: 82.0
PS Perm Gen - I: 20.8 U: 77.4 M: 82.0
PS Perm Gen - I: 20.8 U: 77.4 M: 82.0
I stands for "initial", U for "used", M for "maximum". So you can see that the permgen is nearly-exhausted in this run
(it didn't OOM though). Out of curiosity I checked for class loading markers – classes are loaded throughout the whole run,
because each test loads different fragments of the code. So even at the end of the run you get things like:
Memdump#from: org.apache.solr.update.processor.ParsingFieldUpdateProcessorsTest =>
Code Cache - I: 2.4 U: 27.0 M: 48.0
PS Eden Space - I: 62.9 U: 68.7 M: 167.9
PS Survivor Space - I: 10.4 U: 0.8 M: 0.8
PS Old Gen - I: 167.5 U: 97.8 M: 341.4
PS Perm Gen - I: 20.8 U: 72.7 M: 82.0
[Loaded org.joda.time.ReadWritableInstant from file:/C:/Work/lucene-solr-svn/branch_4x/solr/core/lib/joda-time-2.2.jar]
[Loaded org.joda.time.ReadWritableDateTime from file:/C:/Work/lucene-solr-svn/branch_4x/solr/core/lib/joda-time-2.2.jar]
[Loaded org.joda.time.MutableDateTime from file:/C:/Work/lucene-solr-svn/branch_4x/solr/core/lib/joda-time-2.2.jar]
[Loaded org.joda.time.field.AbstractReadableInstantFieldProperty from file:/C:/Work/lucene-solr-svn/branch_4x/solr/core/lib/joda-time-2.2.jar]
[Loaded org.joda.time.MutableDateTime$Property from file:/C:/Work/lucene-solr-svn/branch_4x/solr/core/lib/joda-time-2.2.jar]
[Loaded org.joda.time.chrono.GJLocaleSymbols from file:/C:/Work/lucene-solr-svn/branch_4x/solr/core/lib/joda-time-2.2.jar]
Memdump#from: org.apache.solr.update.processor.ParsingFieldUpdateProcessorsTest =>
It seems like the problem leading to the permgen is just the huge number of classes being loaded under a single class loader (and these
classes cannot be unloaded because they're either cross-referenced or something else is holding on to them).
verifying the class-number hipothesis
It was interesting to answer the question: how much permgen space would it take to load all these classes without running tests? I wrote
a small utility that parses the output log with class loading information:
...
[Loaded org.apache.lucene.index.DocTermOrds from file:/C:/Work/lucene-solr-svn/branch_4x/lucene/build/core/classes/java/]
[Loaded org.apache.lucene.search.FieldCacheImpl$DocTermOrdsCache from file:/C:/Work/lucene-solr-svn/branch_4x/lucene/build/core/classes/java/]
[Loaded org.apache.lucene.search.FieldCacheImpl$DocsWithFieldCache from file:/C:/Work/lucene-solr-svn/branch_4x/lucene/build/core/classes/java/]
[Loaded org.apache.lucene.search.FieldCache$2 from file:/C:/Work/lucene-solr-svn/branch_4x/lucene/build/core/classes/java/]
...
and turns it into a custom URLClassLoader with the URLs that appear in those entries. Then the tool attempts to load all the referenced classes (and run initializers)
but does not do anything else. It also dumps the permgen state every 100 classes. The results are as follows:
# 10240 classes from 61 sources.
0 - Code Cache - I: 2.4 U: 0.6 M: 48.0
0 - PS Eden Space - I: 62.9 U: 54.1 M: 1319.1
0 - PS Survivor Space - I: 10.4 U: 2.6 M: 10.4
0 - PS Old Gen - I: 167.5 U: 0.0 M: 2680.0
0 - PS Perm Gen - I: 20.8 U: 4.1 M: 82.0
...
1400 - Code Cache - I: 2.4 U: 0.7 M: 48.0
1400 - PS Eden Space - I: 62.9 U: 18.8 M: 1319.1
1400 - PS Survivor Space - I: 10.4 U: 3.5 M: 10.4
1400 - PS Old Gen - I: 167.5 U: 0.0 M: 2680.0
1400 - PS Perm Gen - I: 20.8 U: 12.0 M: 82.0
...
6200 - Code Cache - I: 2.4 U: 1.3 M: 48.0
6200 - PS Eden Space - I: 62.9 U: 33.3 M: 1319.1
6200 - PS Survivor Space - I: 10.4 U: 10.4 M: 10.4
6200 - PS Old Gen - I: 167.5 U: 10.7 M: 2680.0
6200 - PS Perm Gen - I: 20.8 U: 45.6 M: 82.0
...
10239 - Code Cache - I: 2.4 U: 1.5 M: 48.0
10239 - PS Eden Space - I: 62.9 U: 4.8 M: 1319.1
10239 - PS Survivor Space - I: 10.4 U: 10.4 M: 10.4
10239 - PS Old Gen - I: 167.5 U: 21.7 M: 2680.0
10239 - PS Perm Gen - I: 20.8 U: 71.5 M: 82.0
which, if you forgot already, very nicely matches the result acquired from the real test run (classes plus
interned strings):
Memdump#from: org.apache.solr.util.FileUtilsTest =>
Code Cache - I: 2.4 U: 24.3 M: 48.0
PS Eden Space - I: 62.9 U: 39.6 M: 166.5
PS Survivor Space - I: 10.4 U: 1.1 M: 2.1
PS Old Gen - I: 167.5 U: 173.9 M: 341.4
PS Perm Gen - I: 20.8 U: 77.4 M: 82.0
I repeated the above results with JDK 1.7 (64 bit) and the required permgen space is smaller:
10239 - Code Cache - I: 2.4 U: 1.3 M: 48.0
10239 - PS Eden Space - I: 62.9 U: 97.5 M: 1319.1
10239 - PS Survivor Space - I: 10.4 U: 10.4 M: 10.4
10239 - PS Old Gen - I: 167.5 U: 27.5 M: 2680.0
10239 - PS Perm Gen - I: 20.8 U: 59.9 M: 82.0
which may be a hint why we're seing the problem only on 1.6 – we're running very close to the limit and 1.6
is less space-conservative.
I also ran it with jrockit (for fun):
10239 - Nursery - I: -0.0 U: 13.0 M: 2918.4
10239 - Old Space - I: 64.0 U: 62.0 M: 3072.0
10239 - Class Memory - I: 0.5 U: 68.7 M: -0.0
10239 - ClassBlock Memory - I: 0.5 U: 4.0 M: -0.0
and with J9:
10239 - class storage - I: 0.0 U: 41.3 M: -0.0
10239 - JIT code cache - I: 0.0 U: 8.0 M: -0.0
10239 - JIT data cache - I: 0.0 U: 0.3 M: -0.0
10239 - miscellaneous non-heap storage - I: 0.0 U: 0.0 M: -0.0
10239 - Java heap - I: 4.0 U: 38.3 M: 512.0
conclusions
So it's the number of classes that is the core of the problem. The workarounds in the order of difficulty:
- increase max permgen for hotspot (other JVMs should be able to do it dynamically),
- split solr core tests into multiple ant sub-calls so that they don't run in a single JVM,
- change the runner to support running tests in isolation (for example max-N tests per JVM, then relaunch)
- probably a lot more options here, depending on your current creativity levels
See: http://mail-archives.apache.org/mod_mbox/lucene-dev/201507.mbox/%3C05bb01d0c5e3%2442ed9960%24c8c8cc20%24%40thetaphi.de%3E