Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-4298

Failed to detect charset for zip entry with short non-Unicode file name

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.9.2
    • 3.0.0, 2.9.3
    • detector
    • None

    Description

      The Japanese file names extracted from a zip file  testZipEntryNameCharsetShiftSJIS.zip were garbled. The charset of the file name is Shift_JIS, but the detect() method within the PackageParser class was not able to detect the charset properly.

      $ ls -1 testZipEntryNameCharsetShiftSJIS
      shiba.png
      文章1.txt
      文章2.txt
      
      $ java -jar tika-app-2.9.2.jar testZipEntryNameCharsetShiftSJIS.zip
      
      <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
      <head>
      <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
      <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.pkg.PackageParser"/>
      <meta name="resourceName" content="testZipEntryNameCharsetShiftSJIS.zip"/>
      <meta name="X-TIKA:detectedEncoding" content="ISO-8859-1"/>
      <meta name="Content-Length" content="28885"/>
      <meta name="X-TIKA:encodingDetector" content="UniversalEncodingDetector"/>
      <meta name="Content-Type" content="application/zip"/>
      <title/>
      </head>
      <body><div class="embedded" id="shiba.png"/>
      <div class="package-entry"><h1>shiba.png</h1>
      </div>
      <div class="embedded" id="���1.txt"/>
      <div class="package-entry"><h1>���1.txt</h1>
      <p>あいうえお&#13;
      かきくけこ&#13;
      </p></div>
      <div class="embedded" id="���2.txt"/>
      <div class="package-entry"><h1>���2.txt</h1>
      <p>さしすせそ&#13;
      たちつてと&#13;
      </p></div>
      </body></html>% 

      Attachments

        1. TIKA-4298.patch
          4 kB
          Mingchun Zhao
        2. testZipEntryNameCharsetShiftSJIS.zip
          28 kB
          Mingchun Zhao

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            tilman Tilman Hausherr
            mingchun.zhao Mingchun Zhao
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment