Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1011

Exception (Null charset name) processing .mhtml file

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3
    • Component/s: None
    • Labels:
      None

      Description

      This small test.mhtml file:

      From: <Saved by Windows Internet Explorer 8>
      Subject: Index Pages
      Date: Tue, 28 Aug 2012 09:53:28 +0300
      MIME-Version: 1.0
      Content-Type: multipart/related;
      	type="multipart/alternative";
      	boundary="----=_NextPart_000_0000_01CD8502.F991E790"
      X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157
      
      This is a multi-part message in MIME format.
      
      ------=_NextPart_000_0000_01CD8502.F991E790
      Content-Type: multipart/alternative;
      	boundary="----=_NextPart_001_0023_01CD8502.F99DCE70"
      
      
      ------=_NextPart_001_0023_01CD8502.F99DCE70
      Content-Type: text/html;
      	charset="x-user-defined"
      Content-Transfer-Encoding: quoted-printable
      

      Hits this exception when run through TikaCLI:

      <?xml version="1.0" encoding="UTF-8"?>Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.html.HtmlParser@37e67d34
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      	at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
      	at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
      	at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:138)
      	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:399)
      	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:121)
      Caused by: java.lang.IllegalArgumentException: Null charset name
      	at java.nio.charset.Charset.lookup(Charset.java:467)
      	at java.nio.charset.Charset.forName(Charset.java:540)
      	at org.apache.tika.parser.txt.CharsetDetector.setCanonicalDeclaredEncoding(CharsetDetector.java:352)
      	at org.apache.tika.parser.txt.CharsetDetector.setDeclaredEncoding(CharsetDetector.java:75)
      	at org.apache.tika.parser.txt.Icu4jEncodingDetector.detect(Icu4jEncodingDetector.java:49)
      	at org.apache.tika.detect.AutoDetectReader.detect(AutoDetectReader.java:51)
      	at org.apache.tika.detect.AutoDetectReader.<init>(AutoDetectReader.java:92)
      	at org.apache.tika.detect.AutoDetectReader.<init>(AutoDetectReader.java:98)
      	at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:74)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      	... 11 more
      

        Attachments

        1. TIKA-1011.patch
          3 kB
          Michael McCandless

          Activity

            People

            • Assignee:
              mikemccand Michael McCandless
              Reporter:
              mikemccand Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: