Bug 51400 - Use of "new String(byte[] b, String enc)" hits Sun JVM bottleneck
Summary: Use of "new String(byte[] b, String enc)" hits Sun JVM bottleneck
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 6
Classification: Unclassified
Component: Catalina (show other bugs)
Version: 6.0.32
Hardware: All All
: P2 enhancement (vote)
Target Milestone: default
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-06-21 00:47 UTC by Dave Engberg
Modified: 2014-02-17 13:41 UTC (History)
1 user (show)



Attachments
Patch with optimizations (4.79 KB, patch)
2011-06-21 00:47 UTC, Dave Engberg
Details | Diff
Updated patch that works with Java 1.5 (5.37 KB, patch)
2011-06-21 16:09 UTC, Mark Thomas
Details | Diff
Patch to prepopulate available charsets (2.47 KB, patch)
2011-06-23 20:24 UTC, Konstantin Preißer
Details | Diff
Updated patch (5.61 KB, patch)
2011-06-27 14:07 UTC, Mark Thomas
Details | Diff
Patch for tc6.0.x (5.76 KB, patch)
2011-06-27 15:21 UTC, Mark Thomas
Details | Diff
Small patch to update C2BConverter (1.16 KB, patch)
2011-06-27 16:06 UTC, Konstantin Preißer
Details | Diff
Patch for tc6.0.x (5.73 KB, patch)
2011-06-27 16:10 UTC, Mark Thomas
Details | Diff
Patch for tc6.0.x (4.68 KB, patch)
2011-06-28 07:37 UTC, Mark Thomas
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Dave Engberg 2011-06-21 00:47:24 UTC
Created attachment 27186 [details]
Patch with optimizations

We're using Tomcat 6 for a high-volume, high-concurrency service (Evernote).  At times, we've seen a performance slowdown within the service, which we've traced to a concurrency flaw within the JVM code that translates named encodings (e.g. "utf-8") into Charsets.  This translates into a number of stuck threads trying to convert a byte array to a String or vice versa, ala:

  java.lang.Thread.State: BLOCKED (on object monitor)
       at sun.nio.cs.FastCharsetProvider.charsetForName(Unknown Source)
       - waiting to lock <0x00007ff3b4cc85b0> (a sun.nio.cs.StandardCharsets)
       at java.nio.charset.Charset.lookup2(Unknown Source)
       at java.nio.charset.Charset.lookup(Unknown Source)
       at java.nio.charset.Charset.isSupported(Unknown Source)
       at java.lang.StringCoding.lookupCharset(Unknown Source)
       at java.lang.StringCoding.decode(Unknown Source)
       at java.lang.String.<init>(Unknown Source)
       at org.apache.tomcat.util.buf.ByteChunk.toStringInternal(ByteChunk.java:499)
       at org.apache.tomcat.util.buf.StringCache.toString(StringCache.java:315)
       at org.apache.tomcat.util.buf.ByteChunk.toString(ByteChunk.java:492)
       at org.apache.tomcat.util.buf.MessageBytes.toString(MessageBytes.java:213)
       at org.apache.tomcat.util.http.MimeHeaders.getHeader(MimeHeaders.java:319)
       at org.apache.coyote.Request.getHeader(Request.java:330)
       at org.apache.catalina.connector.Request.getHeader(Request.java:1854)
       at org.apache.catalina.connector.RequestFacade.getHeader(RequestFacade.java:643)

This isn't a true deadlock, since each thread will eventually finish, but it can
significantly affect concurrency if there are a number of threads making heavy use of:
   new String(byte[] b, String encoding)
   String.getBytes()
   String.getBytes(String encoding)

This is, unfortunately, a known bottleneck within the JVM:
http://blog.inuus.com/vox/2008/05/the-mysteries-of-java-character-set-performance.html
http://halfbottle.blogspot.com/2009/07/charset-continued-i-wrote-about.html
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6790402


To avoid this bottleneck in the JVM, we've patched our server to use the explicit Charset object for String encoding rather than the name of the charset, and then added a ConcurrentHashMap<String, Charset> to lookup charsets by encodings.

I've attached a patch with our fixes on 6.0.32

Just as a random FYI - the same issue hits MySQL's Java connector, so we'd occasionally see Tomcat and MySQL fighting over this same JVM chokepoint:  http://bugs.mysql.com/bug.php?id=61105
Comment 1 Mark Thomas 2011-06-21 11:16:15 UTC
Comment on attachment 27186 [details]
Patch with optimizations

Mark patch as patch so it can be viewed in the browser.
Comment 2 Mark Thomas 2011-06-21 14:32:27 UTC
Thanks for the report and the patch.

I've gone through the 7.0.x codebase and changed almost all of the calls to:
   new String(byte[] b, String encoding)
   String.getBytes()
   String.getBytes(String encoding)
to use a Charset. That fix will be in 7.0.17 onwards.

The 7.0.x patch is a lot larger than the patch you propose for 6.0.x. I'm currently considering what to propose for 6.0.x.
Comment 3 Konstantin Preißer 2011-06-21 15:05:53 UTC
Hello,
please note that String.getBytes(Charset charset), new String(byte[] bytes, Charset charset) etc. are available since Java 1.6 only, so these probably can't be used for Tomcat 6.

According to the second link in the report, Charset.decode() / Charset.encode() (with ByteBuffer/CharBuffer) can be used for Strings.
Comment 4 Mark Thomas 2011-06-21 15:10:00 UTC
Unfortunately, Tomcat 6 implements Servlet 2.5 and that requires Java 1.5 support. That means we can't use any of the String method that use Charset.

http://halfbottle.blogspot.com/2009/07/charset-continued-i-wrote-about.html does provide a pure Java 1.5 way of doing this.

With this in mind, I am going to go propose a variation of your patch for 6.0.x that doesn't use any 1.6 methods. Back-porting the 7.0.x patch is too big a change. I would rather focus on the known issue for 6.0.x.
Comment 5 Mark Thomas 2011-06-21 16:09:57 UTC
Created attachment 27189 [details]
Updated patch that works with Java 1.5
Comment 6 Mark Thomas 2011-06-21 16:13:44 UTC
Proposed for 6.0.x
Comment 7 Konstantin Preißer 2011-06-21 19:25:02 UTC
Hi,

could/should this also be fixed for calls to new OutputStreamWriter(OutputStream out, String charsetName)?

For example, org.apache.tomcat.util.buf.WriteConvertor (in C2BConverter.java) which seems to be used for ServletResponse.getWriter() calls the constructor of OutputStreamWriter with charsetName as String, and in the Sun Implementation this calls Charset.forName() / Charset.lookup().
Comment 8 Konstantin Kolinko 2011-06-22 00:18:40 UTC
Maybe we could cache not only found charsets, but charset misses as well.
Comment 9 Konstantin Preißer 2011-06-23 14:08:49 UTC
Hi,

would caching charset misses be a good idea, if the Encoding strings can also be received from external sources?

For example, if a client makes a POST request to a Servlet and sends this header: 

Content-Type: application/x-www-form-urlencoded; charset=this-is-a-non-existing-charset

and a Servlet makes a call to HttpServletRequest.getParameter(...), then o.a.tomcat.util.buf.B2CConverter.getCharset(String) will be called with a value of "this-is-a-non-existing-charset". If a client would make tons of requests with random, invalid charset strings and these misses would be added to a List, couldn't it lead to a memory leak? (if they would never be deleted)

However, there is static method Charset.availableCharsets() which returns a SortedMap<String, Charset> of all charsets available by the current JVM. Maybe this list could be used to build a Map of all available charsets (the aliases returned by Charset.aliases() would also have to be added)? Then missing charsets could also be found fast.

However, I think, in B2CConverter.getCharset() the encoding string should be converted to lower-case/upper-case before a lookup in the Map, to avoid multiple entries ("uTF-8", "UtF-8" etc.).
Comment 10 Christopher Schultz 2011-06-23 14:47:42 UTC
> would caching charset misses be a good idea, if the Encoding strings can also
> be received from external sources?

+1 to Konstantin Preißer's DOS concerns.

> However, there is static method Charset.availableCharsets() which returns a
> SortedMap<String, Charset> of all charsets available by the current JVM. Maybe
> this list could be used to build a Map of all available charsets (the aliases
> returned by Charset.aliases() would also have to be added)? Then missing
> charsets could also be found fast.

If you read some of the online posts linked from this BZ issue, you'll see claims that pre-populating such a cache does not have a noticeable impact on performance. Honestly, I'm okay not pre-populating things because there are probably a dozen encodings that get any significant amount of real use on the web, while Charset.availableCharsets returns 163 different obscure character sets.

I suppose it's a fairly small set of encodings, but with little benefit, there's no reason IMO to pre-populate.

> However, I think, in B2CConverter.getCharset() the encoding string should be
> converted to lower-case/upper-case before a lookup in the Map, to avoid
> multiple entries ("uTF-8", "UtF-8" etc.).

Actually, I might leave the case in-tact for performance considerations. Yes, it's true that utf-8, UTF-8, uTf-8, UTf-8, UtF-8, etc. would all be the same, I suspect that only "utf-8" and "UTF-8" will be used in the wild with any reasonable frequency. Normalizing case for every lookup is probably a waste of time, unless there are significant concerns of DOS using long, non-normalized permutations of valid encodings (longest is x-MacCentralEurope with 17 characters to play with). 17 characters is a lot of permutations (~2MiB), though.
Comment 11 Konstantin Preißer 2011-06-23 15:25:01 UTC
Hi Christopher,

(In reply to comment #10)
> If you read some of the online posts linked from this BZ issue, you'll see
> claims that pre-populating such a cache does not have a noticeable impact on
> performance. Honestly, I'm okay not pre-populating things because there are
> probably a dozen encodings that get any significant amount of real use on the
> web, while Charset.availableCharsets returns 163 different obscure character
> sets.
> 
> I suppose it's a fairly small set of encodings, but with little benefit,
> there's no reason IMO to pre-populate.
You're right; however if I read the reports correctly, this is true if charsets with valid names only are used for the lookup. But everytime when there is a loopkup for a non-existing Charset, the JVM-synchronized Charset.lookup() is called. Probably to speed this up, Konstantin Kolinko suggested to cache charset missings.

If a list with all avaliable charsets would be pre-populated, including their aliases, missing charsets could also be determined fast. 

> Actually, I might leave the case in-tact for performance considerations. Yes,
> it's true that utf-8, UTF-8, uTf-8, UTf-8, UtF-8, etc. would all be the same, I
> suspect that only "utf-8" and "UTF-8" will be used in the wild with any
> reasonable frequency. Normalizing case for every lookup is probably a waste of
> time, unless there are significant concerns of DOS using long, non-normalized
> permutations of valid encodings (longest is x-MacCentralEurope with 17
> characters to play with). 17 characters is a lot of permutations (~2MiB),
> though.
Well, on my Windows machine the longest alias (not canonical name) of a charset is "Extended_UNIX_Code_Packed_Format_for_Japanese" which consists of 39 muatble characters. The current (trunk) implementation in o.a.tomcat.util.buf.B2CConverter.getCharset() does not normalize the name, so a Client could send requests with 2^39 permutations in a Content-Type header (which would make 49 TiB of Charset strings) ;-)
Comment 12 Christopher Schultz 2011-06-23 20:02:11 UTC
> > I suppose it's a fairly small set of encodings, but with little benefit,
> > there's no reason IMO to pre-populate.
>
> You're right; however if I read the reports correctly, this is true if charsets
> with valid names only are used for the lookup. But everytime when there is a
> loopkup for a non-existing Charset, the JVM-synchronized Charset.lookup() is
> called. Probably to speed this up, Konstantin Kolinko suggested to cache
> charset missings.

Duh. I hadn't thought of spurious lookups causing their own synchronization disasters.

Perhaps the invalid-charset cache could be limited in some way: MRU caches are easy to build with the standard Java library.

> If a list with all avaliable charsets would be pre-populated, including their
> aliases, missing charsets could also be determined fast. 

True: if the encoding is not supported by the JVM, then it's invalid no matter what. In that case, case normalization is probably a good thing to do: if it's not in the case (after normalization), then it's not valid... no reason to ever call Charset.lookup() after startup.

> Well, on my Windows machine the longest alias (not canonical name) of a charset
> is "Extended_UNIX_Code_Packed_Format_for_Japanese" which consists of 39 mutable
> characters.

Wow.

> The current (trunk) implementation in
> o.a.tomcat.util.buf.B2CConverter.getCharset() does not normalize the name, so a
> Client could send requests with 2^39 permutations in a Content-Type header
> (which would make 49 TiB of Charset strings) ;-)

My math might be wrong, too, but I believe that's only 512GiB if names are 1-byte-per-char, but I think Java does 2-bytes-per-char, so it's 1TiB.

You're right, though: that's pretty huge.

+1 to case normalization.
+1 to LUT pre-population.
-1 to LUT miss caching: it's totally unnecessary given the above.
Comment 13 Konstantin Preißer 2011-06-23 20:24:24 UTC
Created attachment 27202 [details]
Patch to prepopulate available charsets

I made a example of a patch which uses a HashMap to prepopulate all available Charsets in the current JVM when the class is initialized (I'm sorry if I did something wrong, I didn't made a patch before).

The patch adds all available charsets and their aliases (by converting them to lower-case) to the Map. The case normalization solves the DoS danger when a Client makes requests with lot of different case permutations of the same charset name, and it allows detection of missing charsets without calling Charset.forName().
Comment 14 Christopher Schultz 2011-06-23 20:44:01 UTC
Thanks for the patch -- it looks just fine.

If you wanted to, you could even remove the explicit use of SortedMap and just use "Charset.availableCharsets().entrySet()" as your Iterable.
Comment 15 Mark Thomas 2011-06-27 13:18:35 UTC
The patch doesn't correctly address the DOS concerns, neither does it cache misses.
Comment 16 Mark Thomas 2011-06-27 13:51:46 UTC
I've updated the 7.0.x code to:
- address the DOS concerns
- pre-populate the cache
- ensure cache misses are efficient

I'll create a patch for 6.0.x shortly.
Comment 17 Mark Thomas 2011-06-27 14:07:00 UTC
Created attachment 27211 [details]
Updated patch

Updated patch that works with Java 5, pre-populates the cache and handles the DOS concerns. This is still focussed on the areas reported by the OP where issues were observed.
Comment 18 Konstantin Preißer 2011-06-27 14:40:57 UTC
Hi Mark,

does the patch not correctly address DoS and charset misses because I didn't use Locale.US? (Nothing would be added to the Map after initialization, so I can't see a DoS danger there or how it would not cache misses (besides some "false-positives" with incorrect charset strings in other Locales)).

However, I see that the patch applied to trunk doesn't add the charset aliases to the Map, which means Charset strings like "utf8" don't work anymore, but they worked before that change, because Charset.forName or new String(byte[], String charsetName) allow charset aliases to be used as encoding names. That is why I also added the aliases to the Map.
Comment 19 Mark Thomas 2011-06-27 14:46:55 UTC
(In reply to comment #18)
> Hi Mark,
> 
> does the patch not correctly address DoS and charset misses because I didn't
> use Locale.US?

My bad. I mis-read the patch (too  much time working with Eclipse where the old and new versions are the other way around). Your patch is fine.

> However, I see that the patch applied to trunk doesn't add the charset aliases
> to the Map, which means Charset strings like "utf8" don't work anymore, but
> they worked before that change, because Charset.forName or new String(byte[],
> String charsetName) allow charset aliases to be used as encoding names. That is
> why I also added the aliases to the Map.

Yep - I missed that too. I'll add that shortly.
Comment 20 Mark Thomas 2011-06-27 15:21:10 UTC
Created attachment 27212 [details]
Patch for tc6.0.x

Third attempt
Comment 21 Konstantin Preißer 2011-06-27 16:06:52 UTC
Created attachment 27213 [details]
Small patch to update C2BConverter

Hi Mark,

(In reply to comment #19)
> My bad. I mis-read the patch (too  much time working with Eclipse where the old
> and new versions are the other way around). Your patch is fine.

> Yep - I missed that too. I'll add that shortly.

Thanks!

Well, I don't mean to be annoying or something, but.. It seems that you forgot to convert the aliases to lower-case before adding them to the Map, which also would cause "utf8" not to work. ;-)

May I suggest to also update o.a.tomcat.util.buf.C2BConverter to use a Charset (because it seems this is used when calling HttpServletResponse.getWriter())? Please see attached (small) patch.
Comment 22 Mark Thomas 2011-06-27 16:10:54 UTC
Created attachment 27214 [details]
Patch for tc6.0.x

This just isn't my day ;)

7.0.x is patched.

Updated 6.0.x patch attached.
Comment 23 Mark Thomas 2011-06-28 07:37:05 UTC
Created attachment 27219 [details]
Patch for tc6.0.x

Updated patch based on review comments
Comment 24 Konstantin Preißer 2011-06-28 12:45:26 UTC
Hi Mark,
did you see my point about C2BConverter? I can see that you marked that patch as obsolete (it was for C2BConverter, not B2CConverter), but in trunk, C2BConverter still uses the Encoding String instead of a Charset to create a new IntermediateOutputStream (but I forgot to remove "throws UnsupportedEncodingException" in the patch).
Comment 25 Mark Thomas 2011-06-28 16:07:11 UTC
C2BConvertor patched with a slightly modified patch for 7.0.x. I don't propose back-porting this to 6.0.x
Comment 26 Konstantin Preißer 2011-06-28 18:48:32 UTC
Thanks!
Comment 27 Mark Thomas 2011-06-28 23:26:28 UTC
Fixed in 6.0.x and will be included in 6.0.33 onwards.
Comment 28 Ismael Juma 2011-07-19 15:56:24 UTC
Note that there have been changes in this area in Java 7:

"Using Charset as the parameter does not bring you any performance benefit (in fact it's slower and bloat) in most use scenarios, use it with caution."
http://blogs.oracle.com/xuemingshen/entry/faster_new_string_bytes_cs

The blog above talks about single-byte encodings, but there has been a follow-up commit for UTF-8 earlier this year.
Comment 29 piky 2014-01-07 06:05:02 UTC
Comment on attachment 27219 [details]
Patch for tc6.0.x

Help to