Description
The Japanese file names extracted from a zip file testZipEntryNameCharsetShiftSJIS.zip were garbled. The charset of the file name is Shift_JIS, but the detect() method within the PackageParser class was not able to detect the charset properly.
$ ls -1 testZipEntryNameCharsetShiftSJIS shiba.png 文章1.txt 文章2.txt
$ java -jar tika-app-2.9.2.jar testZipEntryNameCharsetShiftSJIS.zip <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/> <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.pkg.PackageParser"/> <meta name="resourceName" content="testZipEntryNameCharsetShiftSJIS.zip"/> <meta name="X-TIKA:detectedEncoding" content="ISO-8859-1"/> <meta name="Content-Length" content="28885"/> <meta name="X-TIKA:encodingDetector" content="UniversalEncodingDetector"/> <meta name="Content-Type" content="application/zip"/> <title/> </head> <body><div class="embedded" id="shiba.png"/> <div class="package-entry"><h1>shiba.png</h1> </div> <div class="embedded" id="���1.txt"/> <div class="package-entry"><h1>���1.txt</h1> <p>あいうえお かきくけこ </p></div> <div class="embedded" id="���2.txt"/> <div class="package-entry"><h1>���2.txt</h1> <p>さしすせそ たちつてと </p></div> </body></html>%
Attachments
Attachments
Issue Links
- links to