Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-936

encoding of ZipArchiveInputStream

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.1
    • Fix Version/s: 1.8
    • Component/s: parser
    • Labels:
      None

      Description

      When extracting from the zip files which are zipped at Windows OS(Japanese), the file name extracted from zip is garbled.

      ZipArchiveInputStream has three constructors. Modifying like the below, the file name was not garbled. I specified the encoding - SJIS.

      PackageExtractor
      public void parse(InputStream stream)
       :
       //unpack(new ZipArchiveInputStream(stream), xhtml);  
       unpack(new ZipArchiveInputStream(stream,"SJIS",true), xhtml); 
       :
      

      In first constructor the platform's default encoding UTF-8 is used. In my case the encoding of my computer is UTF-8, the encoding of zip file is SJIS, so the file name was garbled. We will get garbled file name if there is a difference of encoding between platform this constructor and zip file.

      I want Tika to parse zip by giving some kind of encoding parameter per file, Where should I give the encoding, somewhere in Metadata or ParseContext? Please support this. I am using Tika via Solr(SolrCell), so when posting zip file to Solr I want to add encoding parameter to the request.

        Attachments

        1. x-日本語メモ.zip
          0.2 kB
          Shinichiro Abe

          Issue Links

            Activity

              People

              • Assignee:
                chrismattmann Chris A. Mattmann
                Reporter:
                shinichiro abe Shinichiro Abe
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: