[TIKA-4204] ChmExtractor unable to decompress file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.9.1, 3.0.0-BETA
Fix Version/s: 2.9.2, 3.0.0
Component/s: parser
Labels:
None
Environment:

The file I am trying to parse is attached, the file being found as the content file is "/CSS/ABBContent.css"

Description

ChmExtractor fails with error: "TikaException: can't copy beyond array length" when calling extractChmEntry on any non-empty entry.

Upon inspection this turns out to be caused by lzxBlockOffset being incorrectly set.

This is caused by the method ChmExtractor#getIndexOfContent returing the wrong entry.

This is because ChmCommons#indexOf(List, String) returns the first entry with a name containing the string "Content". The file I am trying to parse contains a file with the name Content.css, which is the entry returned by #indexOf(...), instead of the actual content entry.

To fix the issue, ChmCommons#indexOf(...) should be more strict in how it detects the content entry.

According to: http://www.russotto.net/chm/chmformat.html, the name of the content entry will always start with "::DataSpace/Storage/", which could be used to restrict it to find the correct entry.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

3HAC050917_TRM_RAPID_RW_6-en.chm
27/Feb/24 19:03
5.98 MB
Robert Fromholz

Activity

People

Assignee:: Tim Allison

Reporter:: Robert Fromholz

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 27/Feb/24 19:02

Updated:: 28/Feb/24 17:02

Resolved:: 28/Feb/24 16:47