Summary: | Problem with String.intern() in CategoryKey | ||
---|---|---|---|
Product: | Log4j - Now in Jira | Reporter: | Vincent Shek <wsshek> |
Component: | Other | Assignee: | log4j-dev <log4j-dev> |
Status: | RESOLVED FIXED | ||
Severity: | major | ||
Priority: | P2 | ||
Version: | 1.2 | ||
Target Milestone: | --- | ||
Hardware: | Other | ||
OS: | All |
Description
Vincent Shek
2005-05-25 04:47:35 UTC
It doesn't seem like it makes much sense for intern() to be used... I find it does save a lot on memory, if you are reading values from a file or database and expect the same values to appear multiple times, but this is not that use case. In any case, the typical usage pattern is Logger.getLogger() is called once and the returned value is kept in a static variable. I don't see any other use of String.intern() in mainline, which is good. 1.2.12 candidate I don't know if this is fixable, but I think it should be looked at to understand why we are using String.intern(). So, this is the code for CategoryKey: class CategoryKey { String name; int hashCache; CategoryKey(String name) { this.name = name.intern(); hashCache = name.hashCode(); } final public int hashCode() { return hashCache; } final public boolean equals(Object rArg) { if(this == rArg) return true; if(rArg != null && CategoryKey.class == rArg.getClass()) return name == ((CategoryKey)rArg ).name; else return false; } } I think we are using the String.intern() call so that we can do the == between the two CategoryKey objects. Theoretically the String.intern() call will return the same String object for the same string value. I am assuming that using this instead of String.equals() is the speed increase. But I don't understand why this is really an issue. Only one version of the String should be added/created in the StringTable. Once a given logger has been created, no more entries to the StringTable should be added related to that logger. Are you creating that many unique Loggers over time? I am not creating many unique loggers over time, but Logger.getLogger() will still call new CategoryKey() even thought I am accessing the same logger. e.g. String.intern() will be called twice if I do the following: Logger.getLogger("logger1"); Logger.getLogger("logger1"); Right. But this "should" not matter, as per the String Javadoc for intern(): "A pool of strings, initially empty, is maintained privately by the class String. When the intern method is invoked, if the pool already contains a string equal to this String object as determined by the equals(Object) method, then the string from the pool is returned. Otherwise, this String object is added to the pool and a reference to this String object is returned. It follows that for any two strings s and t, s.intern() == t.intern() is true if and only if s.equals(t) is true." So, for any given logger, only one String object should be in the StringTable, not multiple. Unless String.equal() is returning not equal for Strings with the same values? I wonder if this problem is related to CategoryKey or something more to do with the "C:L" conversion pattern. I can't see any problem with the use of intern here either. Class CategoryKey is used in only one way: as the key to the hashtable of named Logger objects maintained by a Hierarchy. And under normal patterns of use, even a very large program is only likely to have a few hundreds of differently-named Logger objects. And as mark said, calling String.intern("foo") any number of times still creates just one entry in the interned-string table. That's the whole point of intern! Are you perhaps dynamically generating category names, eg by Logger l = Logger.getLogger("some.category." + currentDateTime); This would cause large numbers of distinctly-named Logger objects to be created, and of course all those distinct names *would* clutter the interned-string table. However it's a pretty odd way to use log4j.. I agree with both of you that in most cases, the String.intern() should not cause any problems. I have to revisit my application to make sure I am not creating so many unique loggers which led to my original problem. However, if you using a small fixed size permanent heap, a few thousands different loggers would start to create problem in the stringtable. A few thousand distinct logger names implies one monstrous app. I wouldn't ever expect to see something like that running with "a small heap". What kind of app are you writing that has "thousands" of distinct logger names but only a small heap? Note that the number of *loggers* is irrelevant. It's the *names* of those loggers that go into the String.intern pool, so a thousand Logger objects with the same name --> one entry in the intern pool. In addition, the String.intern method says this: "All literal strings and string-valued constant expressions are interned." I've checked this and it is definitely true. So if these loggers are created like Logger l = new Logger("some.literal.string") then that constant string is automatically interned anyway; it happens when the class is loaded. So the CategoryKey call to String.intern isn't adding any new entries to the constant string pool; it simply locates the literal value that was put there by the classloading process. Only when the logger name is dynamically generated like: l = new Logger("foo" + someParam) will the ConstantKey's call to String.intern force the constant string pool to get a new value. And that just isn't a common logging idiom. I really think your initial problem diagnosis is wrong and you should look again. You've clearly got a problem but I find it hard to believe it's due to calls to String.intern. NB: this is all just my personal opinion of course. NB: your example code does show that a problem can occur when many hundreds of thousands of *unique* strings are generated and interned. But that isn't the case for logging - unless you're generating logger names via some process like Logger l = Logger.getLogger("some.category." + currentDateTime); as I described earlier. If you are -- don't do that! It sounds to me that getting rid of intern doesn't hurt performance considerably. Consider these three cases: 1. You're doing: Logger.getLogger("logger1"); ... Logger.getLogger("logger1"); Using a string constant here, "logger1" was already interned and "logger1".equals("logger1") is fast. No need to explicitly do an intern() 2. You're evaluating the string, you're doing: int i = 1; Logger.getLogger("logger" + i); ... Logger.getLogger("logger" + i); In which case, the cost of intern() is MUCH more expensive than intern anyway, so why bother? 3. You're storing the evaluated string and a combination of above. Again, intern() is more expensive than equals() so why bother. I don't understand the need for intern() at all. Sounds like a premature optimization bug. (In reply to comment #10) > Logger.getLogger("logger1"); > > Using a string constant here, "logger1" was already interned true > "logger1".equals("logger1") is fast. No need to explicitly do an intern() I agree that the literal strings in code are automatically interned, and the String.equals method should be checking for identity before bothering to do a loop checking the string content [1]. There's still the overhead of a (non-virtual) method invocation and return though. However the main point of this is to optimise the comparisons executed as part of Hashtable.get when looking up the logger from the map of all loggers. And Hashtable.get may well do several *failed* comparisons before finding the right logger to return. The operation "string1" == "string2" is fast always; if the references are different then the result is false. However "string1".equals("string2") has a fair bit of overhead. The String.equals method needs to do: * if (param instanceof String) * cast param to String * if string lengths differ: return false * otherwise compare each byte [1] [1] or potentially compare hashcodes. The reference comparison starts to look nice.. How many such non-equal comparisons occur in the Hashtable lookup? Well that depends upon the hashcode collision rate inside the hashtable; obviously two strings whose hashcodes cause them to be allocated to different buckets never get compared. The java.util.HashMap class has a default loadFactory of 0.75. So at a wild guess I would think only about 30% of lookups would hit a bucket with more than one entry, and in 50% of those cases the right entry would be first. So I agree this optimisation is not critical, but it could have a measurable improvement. > 2. You're evaluating the string, you're doing: > > int i = 1; > Logger.getLogger("logger" + i); > ... > Logger.getLogger("logger" + i); > > In which case, the cost of intern() is MUCH more expensive than intern anyway, > so why bother? True. However such usage really is a little bizarre. > > 3. You're storing the evaluated string and a combination of above. Again, > intern() is more expensive than equals() so why bother. > > I don't understand the need for intern() at all. Sounds like a premature > optimization bug. A very common pattern is: Logger.getLogger(someClass.getName()); I've checked with Sun java 1.3.1 and 1.5 and the string returned by someClass.getName does get interned automatically like literal strings in the code. The question is whether that behaviour should be relied on or not. I'm not aware of anything in the java specs that mandate this string going into the intern pool (unlike literals in the code which *must* go there) but I think it likely that all Java implementations *will* do this; when the JVM is loading the raw .class file into memory it needs to store the literal strings somewhere and putting them in the intern pool seems as good a place as any. So: first of all, string comparisons in the logger Hashtable lookup will not be common (maybe 30% of lookups). And in all the scenarios where interning makes sense, the strings seem to already be interned; comparing two equal interned strings with .equals should be close to the speed of comparing them with ==. Where the two strings are actually different (maybe 15% of lookups) there are two cases: * strings have different length (most of the time) --> minor win for the interned comparison * strings have same length (pretty rare) --> major win for the interned comparison Summary: From a theoretical point I tend to agree that this use of String.intern won't have much effect and could have been omitted. Of course actually testing this might be wise -- real stats win over theory any day :-). But on the other hand, String.intern only does active harm when the user is generating large numbers of logger names dynamically - and I can't see any sane reason for anyone to do that. [1] Actually, GNU Classpath doesn't! Currently comparison of a string object with itself does a reasonable amount of work (though not a scan of the actual string data). I've sent off an email querying this. The String.intern (potentially mal-) optimization appears to be the only reason that CategoryKey exists. Switching to a String-keyed hashtable would eliminate the need to create and reclaim CategoryKey objects on the calls to getLogger(). Unfortunately, Hierarchy.Hashtable is package-visible, so it might be necessary to maintain a distinct string-keyed hashmap/hashtable in preference and preserve ht for classes squatting in our package. I've committed a performance test for Logger.getLogger on the CVS HEAD. To run: cd tests ant build ' once to build unit tests ant -f performance.xml getLogger I've run the tests with the current implementation and two alternative implementations. The test, by default, calls getLogger() for 1000 different logger names of typical length. The first pass is much more expensive than the subsequent passes since new instances of Logger will be created for each name. Subsequent passes simply retrieve the previously created loggers. The baseline results were ~3200 ms for the pass 0 and 123 ms on average for passes 1-9. The first modification was: diff -r1.5 CategoryKey.java 29c29 < this.name = name.intern(); --- > this.name = name; 43c43 < return name == ((CategoryKey) rArg).name; --- > return name.equals(((CategoryKey) rArg).name); After this change: pass 0: ~1900 ms, 77 ms average, 38 ms min for passes 1- 9 This indicates that (at least on this JVM), that initial call to String.intern adds substantial time to the creation of a new Logger and use of == is slower than name.equals() surprisingly. The second change was to eliminate the use of CategoryKey altogether: RCS file: /home/cvs/logging-log4j/src/java/org/apache/log4j/Hierarchy.java,v retrieving revision 1.62 diff -r1.62 Hierarchy.java 79c79 < Hashtable ht; --- > private Hashtable ht; 227c227 < Object o = ht.get(new CategoryKey(name)); --- > Object o = ht.get(name); 447c447 < CategoryKey key = new CategoryKey(name); --- > String key = name; 726c726 < CategoryKey key = new CategoryKey(substr); // simple constructor --- > String key = substr; // simple constructor The results for this mod were, on average, slightly better than the previous iteration at ~1800 ms and 61 ms average, 20 ms min for passes 1-9. These results were from: java version "1.4.2_07" Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_07-215) Java HotSpot(TM) Client VM (build 1.4.2-50, mixed mode) on Mac OS/X. Running a JRE 1.2 on Win98 SE: Stock: Pass 0: 1000 ms, pass 1-9: avg 66, min 0 Mod 1: Pass 0: 550 ms, Pass 1-9: avg 48, min 0 Mod 2: Pass 0: 550 ms, pass 1-9: avg 67, min 0 Running JRE 1.5.0 on Win XP Stock: Pass 0: 390 ms, pass 1-9: avg 12 ms, min 0 Mod 1: Pass 0: 203 ms, pass 1-9: avg 5 ms, min 0 Mod 2: Pass 0: 172 ms, pass 1-9: avg 3 ms, min 0 My take is that anything 1.2 or later would likely not benefit from the String.intern "optimization". Unless we can find a platform where it does offer a compelling advantage, my current take would be to do Mod 1 on the log4j 1.2 branch and mark CategoryKey deprecated (it is already only visible within the package) and do Mod 2 on the CVS HEAD. Did the minor change to CategoryKey in both the CVS HEAD and 1.2 branch. Expect to eliminate CategoryKey in 2.0. Wow. This change is in the 1.2.12rc1 build. Vincent, please check that it fixes the reported problem. I will run some benchmarks to see if I get the same performance results. Thx so much. I will check out rc1 and will report my findings (but not too soon) I tried this on my Windows machine using jdk 1.4.2_07. I modified the performance.xml to reference the 1.2.11 and 1.2.12rc2 jars instead of the classes build in the cvs head. I see an improvement in the 1.2.12 version: getLogger: (using jdk 1.4.2_07, log4j 1.2.11) [java] Usage: java org.apache.log4j.GetLoggerTest passes [nameCount nameLength] [java] Pass 0: 1232 ms. [java] Average non-initial pass: 34 ms. getLogger: (using jdk 1.4.2_07, log4j 1.2.12rc2) [java] Usage: java org.apache.log4j.GetLoggerTest passes [nameCount nameLength] [java] Pass 0: 832 ms. [java] Average non-initial pass: 21 ms. Not an incredible improvement in my environment (WinXP HE, Pentium III, 652 MHz, 256 MB RAM), but an improvement. More interestingly to me is that with the 1.2.12 code, the times for the non-initial passes is highly variable (in my tests from 0 - 70ms). The avg is less though. Vincent, get a chance to try this in your environment? |