Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-471

Avoid Charset name bottleneck when multiple threads are using HtmlParser

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2
    • Component/s: parser
    • Labels:
      None

      Description

      As reported by a user on the Nutch list, if there are lots of threads all parsing HTML documents, there's a lock contention issue caused by a JVM-wide lock used when resolving charset names:

      Apparently this is a known issue with Java, and a couple articles are
      written about it:
      http://paul.vox.com/library/post/the-mysteries-of-java-character-set-perform
      ance.html
      http://halfbottle.blogspot.com/2009/07/charset-continued-i-wrote-about.html

      There is also a note in java bug database about scaling issues with the
      class...
      Please also note that the current implementation of
      sun.nio.cs.FastCharsetProvider.charsetForName() uses a JVM-wide lock and is
      called very often (e.g. by new String(byte[] data,String encoding)). This
      JVM-wide lock means that Java applications do not scale beyond 4 CPU cores.

      I noted in the case of my stack at this particular point in time. The
      BLOCKED calls to charsetForName were generated by:

      at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:84) 378
      at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:99) 61
      at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:133) 19
      at org.apache.nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlParser.java:86) 238
      ...

      We now have a CharsetUtils class in Tika, and we could add a cache for validated names in the isSupported() method.

        Activity

        Hide
        chrismattmann Chris A. Mattmann added a comment -
        • classify
        Show
        chrismattmann Chris A. Mattmann added a comment - classify
        Hide
        jukkaz Jukka Zitting added a comment -

        As a followup to TIKA-322 I did some fairly significant refactoring of the charset handling code. The outcome massively reduces the number of Charset.forName() calls we make.

        Show
        jukkaz Jukka Zitting added a comment - As a followup to TIKA-322 I did some fairly significant refactoring of the charset handling code. The outcome massively reduces the number of Charset.forName() calls we make.

          People

          • Assignee:
            jukkaz Jukka Zitting
            Reporter:
            kkrugler Ken Krugler
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development