Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-4204

ChmExtractor unable to decompress file

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.9.1, 3.0.0-BETA
    • 2.9.2, 3.0.0
    • parser
    • None
    • The file I am trying to parse is attached, the file being found as the content file is "/CSS/ABBContent.css"

    Description

      ChmExtractor fails with error: "TikaException: can't copy beyond array length" when calling extractChmEntry on any non-empty entry. 

      Upon inspection this turns out to be caused by lzxBlockOffset being incorrectly set.

      This is caused by the method ChmExtractor#getIndexOfContent returing the wrong entry.

      This is because ChmCommons#indexOf(List, String) returns the first entry with a name containing the string "Content". The file I am trying to parse contains a file with the name Content.css, which is the entry returned by #indexOf(...), instead of the actual content entry.

      To fix the issue, ChmCommons#indexOf(...) should be more strict in how it detects the content entry.

      According to: http://www.russotto.net/chm/chmformat.html, the name of the content entry will always start with "::DataSpace/Storage/", which could be used to restrict it to find the correct entry.

      Attachments

        1. 3HAC050917_TRM_RAPID_RW_6-en.chm
          5.98 MB
          Robert Fromholz

        Activity

          People

            tallison Tim Allison
            bossymr Robert Fromholz
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: