Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3374

Non-Unicode archive entry name is garbled

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.26
    • None
    • parser
    • None

    Description

      PackageParser retrieves archive entry name through commons-compress archiver's ArchiveEntry#getName function and does not have automatic charset detection for entry names.
      Although one could set encoding by passing ArchiveStreamFactory(charset) into parser context,
      It is not practical since all kinds of charset could be used in an archive file.

      Instead of directly calling entry.getName() in the PackageParser#parseEntry() function,

      use entry.getRawName() and apply charset detection to reduce the possibility of getting garbled string is recommended.

       

      The attachment is an example of a Non-Unicode archive entry name been used in a zip file.

      The filename in the zip file should be 集团邮件审计系统2021年自动巡检需求文档_V4.0.doc

      but is gabled in TIKA 1.26 since the PackageParser treats it as Unicode.

       

      Attachments

        1. gbk.zip
          67 kB
          Ryan Liu

        Activity

          People

            Unassigned Unassigned
            ryan421 Ryan Liu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: