[TIKA-471] Avoid Charset name bottleneck when multiple threads are using HtmlParser - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.2
Component/s: parser
Labels:
None

Description

As reported by a user on the Nutch list, if there are lots of threads all parsing HTML documents, there's a lock contention issue caused by a JVM-wide lock used when resolving charset names:

Apparently this is a known issue with Java, and a couple articles are
written about it:
http://paul.vox.com/library/post/the-mysteries-of-java-character-set-perform
ance.html
http://halfbottle.blogspot.com/2009/07/charset-continued-i-wrote-about.html

There is also a note in java bug database about scaling issues with the
class...
Please also note that the current implementation of
sun.nio.cs.FastCharsetProvider.charsetForName() uses a JVM-wide lock and is
called very often (e.g. by new String(byte[] data,String encoding)). This
JVM-wide lock means that Java applications do not scale beyond 4 CPU cores.

I noted in the case of my stack at this particular point in time. The
BLOCKED calls to charsetForName were generated by:

at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:84) 378
at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:99) 61
at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:133) 19
at org.apache.nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlParser.java:86) 238
...

We now have a CharsetUtils class in Tika, and we could add a cache for validated names in the isSupported() method.

Attachments

Activity

People

Assignee:: Jukka Zitting

Reporter:: Kenneth William Krugler

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 24/Jul/10 01:36

Updated:: 08/Jul/12 12:05

Resolved:: 08/Jul/12 12:05