Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2352

Incorrect EOF exception in WordPerfect parser

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0, 1.15
    • Component/s: None
    • Labels:
      None

      Description

      We have a few EOF exceptions in WordPerfect files that are likely not truncated. The example I'll attach shortly is able to be opened without complaint by LibreOffice.

      1. 462321.wp
        46 kB
        Tim Allison
      2. reports.zip
        15 kB
        Tim Allison

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment -

        Triggering file. I think something is going wrong with skipUntilChar. I think we're accidentally skipping too far and then a multibyte function is incorrectly perceived, leading to the false expectation that the parser should skip 60430 bytes.

        Show
        tallison@mitre.org Tim Allison added a comment - Triggering file. I think something is going wrong with skipUntilChar . I think we're accidentally skipping too far and then a multibyte function is incorrectly perceived, leading to the false expectation that the parser should skip 60430 bytes.
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited
        00002FF0              0A 0C D0 08 0A 00 00 C8 00 06 00 00      ..Ð....È....
        00003000  0A 00 08 D0 D3 04 0A 00 01 00 01 00 00 00 0A 00  ...ÐÓ...........
        00003010  04 D3 C3 0C C3 0A C1 E0 C1 10 EC 13 23 00 C1 20  .ÓÃ.Ã.ÁàÁ.ì.#.Á 
        00003020  C3 02 C3                                         Ã.Ã
        

        then 1. INTRODUCTION

        It looks like C1 E0 C1 is a complete C1 skip, then EC is interpreted as the start of a variable length multi-byte function of length 23; but from the text which appears in LibreOffice, EC should not be interpreted as the start of a variable length function.

        I wonder Pascal Essiembre...if C1...C1...C1 were a valid skip pattern, then EC would be enclosed in the skipped content, and we could resume with C3 02 C3 and then the text.

        Show
        tallison@mitre.org Tim Allison added a comment - - edited 00002FF0 0A 0C D0 08 0A 00 00 C8 00 06 00 00 ..Ð....È.... 00003000 0A 00 08 D0 D3 04 0A 00 01 00 01 00 00 00 0A 00 ...ÐÓ........... 00003010 04 D3 C3 0C C3 0A C1 E0 C1 10 EC 13 23 00 C1 20 .ÓÃ.Ã.ÁàÁ.ì.#.Á 00003020 C3 02 C3 Ã.à then 1. INTRODUCTION It looks like C1 E0 C1 is a complete C1 skip, then EC is interpreted as the start of a variable length multi-byte function of length 23 ; but from the text which appears in LibreOffice, EC should not be interpreted as the start of a variable length function. I wonder Pascal Essiembre ...if C1...C1...C1 were a valid skip pattern, then EC would be enclosed in the skipped content, and we could resume with C3 02 C3 and then the text.
        Hide
        pascal.essiembre Pascal Essiembre added a comment -

        Found the cause. My assumption was wrong that the opening and closing bytes could not also be used within these delimiters. Since we are talking about fixed-length functions here, the number of bytes for each is known and we should rely on this knowledge instead. I am working on a fix.

        Show
        pascal.essiembre Pascal Essiembre added a comment - Found the cause. My assumption was wrong that the opening and closing bytes could not also be used within these delimiters. Since we are talking about fixed-length functions here, the number of bytes for each is known and we should rely on this knowledge instead. I am working on a fix.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Whoa. Awesome. Thank you!

        Show
        tallison@mitre.org Tim Allison added a comment - Whoa. Awesome. Thank you!
        Hide
        tallison@mitre.org Tim Allison added a comment -

        For the record, I realize you didn't sign the ongoing support agreement. Thank you!!!

        Show
        tallison@mitre.org Tim Allison added a comment - For the record, I realize you didn't sign the ongoing support agreement. Thank you!!!
        Hide
        pascal.essiembre Pascal Essiembre added a comment -

        Must have got lost in the mail! I just made a pull request: https://github.com/apache/tika/pull/176

        Show
        pascal.essiembre Pascal Essiembre added a comment - Must have got lost in the mail! I just made a pull request: https://github.com/apache/tika/pull/176
        Hide
        githubbot ASF GitHub Bot added a comment -

        tballison closed pull request #176: fix for TIKA-2352 contributed by pascal.essiembre
        URL: https://github.com/apache/tika/pull/176

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - tballison closed pull request #176: fix for TIKA-2352 contributed by pascal.essiembre URL: https://github.com/apache/tika/pull/176 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        githubbot ASF GitHub Bot added a comment -

        tballison commented on issue #176: fix for TIKA-2352 contributed by pascal.essiembre
        URL: https://github.com/apache/tika/pull/176#issuecomment-299023202

        Wow, that was fast. Thank you!!!

        ----------------------------------------------------------------
        This is an automated message from the Apache Git Service.
        To respond to the message, please log on GitHub and use the
        URL above to go to the specific comment.

        For queries about this service, please contact Infrastructure at:
        users@infra.apache.org

        Show
        githubbot ASF GitHub Bot added a comment - tballison commented on issue #176: fix for TIKA-2352 contributed by pascal.essiembre URL: https://github.com/apache/tika/pull/176#issuecomment-299023202 Wow, that was fast. Thank you!!! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
        Hide
        pascal.essiembre Pascal Essiembre added a comment -

        No problem. I'd be curious to know how many problematic WP files remain in your corpus after this fix.

        Show
        pascal.essiembre Pascal Essiembre added a comment - No problem. I'd be curious to know how many problematic WP files remain in your corpus after this fix.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Will let you know. There's one other fix that I have in mind before I rerun against WordPerfect. Thank you, again!

        Show
        tallison@mitre.org Tim Allison added a comment - Will let you know. There's one other fix that I have in mind before I rerun against WordPerfect. Thank you, again!
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1255 (See https://builds.apache.org/job/Tika-trunk/1255/)
        Fix for TIKA-2352 contributed by pascal.essiembre (pascal.essiembre: https://github.com/apache/tika/commit/e7b0cadfeaf76bf6bd96ad35b3510633368a1a07)

        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1255 (See https://builds.apache.org/job/Tika-trunk/1255/ ) Fix for TIKA-2352 contributed by pascal.essiembre (pascal.essiembre: https://github.com/apache/tika/commit/e7b0cadfeaf76bf6bd96ad35b3510633368a1a07 ) (edit) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WP5DocumentAreaExtractor.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/wordperfect/WP6DocumentAreaExtractor.java TIKA-2352 – via Pascal Essiembre. This closes #176 (tallison: https://github.com/apache/tika/commit/19348811a9ff9e89cb309334cfba010a62b72600 ) (edit) CHANGES.txt
        Hide
        hudson Hudson added a comment -

        UNSTABLE: Integrated in Jenkins build tika-2.x-windows #204 (See https://builds.apache.org/job/tika-2.x-windows/204/)
        TIKA-2352 – bug fix for WordPerfect parser via Pascal Essiembre. Pull (tallison: rev babb2534e163b182b3c55f5e02188302b5c4d07e)

        • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP6DocumentAreaExtractor.java
        Show
        hudson Hudson added a comment - UNSTABLE: Integrated in Jenkins build tika-2.x-windows #204 (See https://builds.apache.org/job/tika-2.x-windows/204/ ) TIKA-2352 – bug fix for WordPerfect parser via Pascal Essiembre. Pull (tallison: rev babb2534e163b182b3c55f5e02188302b5c4d07e) (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP6DocumentAreaExtractor.java
        Hide
        hudson Hudson added a comment -

        UNSTABLE: Integrated in Jenkins build tika-2.x #251 (See https://builds.apache.org/job/tika-2.x/251/)
        TIKA-2352 – bug fix for WordPerfect parser via Pascal Essiembre. Pull (tallison: rev babb2534e163b182b3c55f5e02188302b5c4d07e)

        • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP6DocumentAreaExtractor.java
        Show
        hudson Hudson added a comment - UNSTABLE: Integrated in Jenkins build tika-2.x #251 (See https://builds.apache.org/job/tika-2.x/251/ ) TIKA-2352 – bug fix for WordPerfect parser via Pascal Essiembre. Pull (tallison: rev babb2534e163b182b3c55f5e02188302b5c4d07e) (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP6DocumentAreaExtractor.java
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Y, that fixed several problems, with no new exceptions. I'm attaching the relevant reports. It looks like there may be some rare(ish) EOF in wordperfect 5.1, and there may be some areas for improvement in application/x-quattro-pro; version=9.

        We should ignore EOF on files from common crawl that are near 1MB, which typically means they were truncated and legitimately hit EOF (e.g. the one exception for application/vnd.wordperfect; version=6.x).

        Thank you, again!

        Show
        tallison@mitre.org Tim Allison added a comment - Y, that fixed several problems, with no new exceptions. I'm attaching the relevant reports. It looks like there may be some rare(ish) EOF in wordperfect 5.1, and there may be some areas for improvement in application/x-quattro-pro; version=9 . We should ignore EOF on files from common crawl that are near 1MB, which typically means they were truncated and legitimately hit EOF (e.g. the one exception for application/vnd.wordperfect; version=6.x ). Thank you, again!
        Hide
        tallison@mitre.org Tim Allison added a comment -

        There's one other fix that I have in mind before I rerun against WordPerfect.

        Re-checked the code, and it already covered what I thought was an edge case before the most recent patch. Please ignore that.

        Show
        tallison@mitre.org Tim Allison added a comment - There's one other fix that I have in mind before I rerun against WordPerfect. Re-checked the code, and it already covered what I thought was an edge case before the most recent patch. Please ignore that.
        Hide
        hudson Hudson added a comment -

        UNSTABLE: Integrated in Jenkins build tika-2.x #252 (See https://builds.apache.org/job/tika-2.x/252/)
        TIKA-2352 – bug fix for WordPerfect parser via Pascal Essiembre. Pull (tallison: rev fe3971a69e203f38214071f6df65430d835592a0)

        • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP5DocumentAreaExtractor.java
        Show
        hudson Hudson added a comment - UNSTABLE: Integrated in Jenkins build tika-2.x #252 (See https://builds.apache.org/job/tika-2.x/252/ ) TIKA-2352 – bug fix for WordPerfect parser via Pascal Essiembre. Pull (tallison: rev fe3971a69e203f38214071f6df65430d835592a0) (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP5DocumentAreaExtractor.java
        Hide
        hudson Hudson added a comment -

        UNSTABLE: Integrated in Jenkins build tika-2.x-windows #205 (See https://builds.apache.org/job/tika-2.x-windows/205/)
        TIKA-2352 – bug fix for WordPerfect parser via Pascal Essiembre. Pull (tallison: rev fe3971a69e203f38214071f6df65430d835592a0)

        • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP5DocumentAreaExtractor.java
        Show
        hudson Hudson added a comment - UNSTABLE: Integrated in Jenkins build tika-2.x-windows #205 (See https://builds.apache.org/job/tika-2.x-windows/205/ ) TIKA-2352 – bug fix for WordPerfect parser via Pascal Essiembre. Pull (tallison: rev fe3971a69e203f38214071f6df65430d835592a0) (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/wordperfect/WP5DocumentAreaExtractor.java
        Hide
        pascal.essiembre Pascal Essiembre added a comment -

        I had time to look further at one of the file in lists: "govdocs1\318\318891.wp". It puzzles me and I feel I must be missing something obvious. LibreOffice opens it fine.

        It is read just fine until the last page where there is an isolated "1" in the middle of the page. The sequence of interest is "31 02 02 DA D0 04 D0", which can be broken down as follow:

        31 - The number "1"
        02 - Control character indicating to print a page number
        02 - Control character indicating to print a page number
        DA - Variable-length function (218) for a "box group"
        D0 - Subfunction code 208. INVALID, possible values range from 0 to 6.
        04 D0 - function length 53252 (two bytes, reverse order). INVALID, greater than what's left.

        So I do not know why this invalid function code is there and how LibreOffice interprets it fine. It may be the 0x02 also throwing things off... since it is the only place those characters are found in the document and it goes wrong after that.

        In other context (non WP docs), the ASCII standard for 0x02 is "STX -> Start of Text -> First character of message text", and may be used to terminate the message heading"

        Since there is a page number in the middle, it could be that the page/document is ended there and a new one is appended? If so, not sure then how 0x02 should be treated in relation to that.

        Show
        pascal.essiembre Pascal Essiembre added a comment - I had time to look further at one of the file in lists: "govdocs1\318\318891.wp". It puzzles me and I feel I must be missing something obvious. LibreOffice opens it fine. It is read just fine until the last page where there is an isolated "1" in the middle of the page. The sequence of interest is "31 02 02 DA D0 04 D0", which can be broken down as follow: 31 - The number "1" 02 - Control character indicating to print a page number 02 - Control character indicating to print a page number DA - Variable-length function (218) for a "box group" D0 - Subfunction code 208. INVALID, possible values range from 0 to 6. 04 D0 - function length 53252 (two bytes, reverse order). INVALID, greater than what's left. So I do not know why this invalid function code is there and how LibreOffice interprets it fine. It may be the 0x02 also throwing things off... since it is the only place those characters are found in the document and it goes wrong after that. In other context (non WP docs), the ASCII standard for 0x02 is "STX -> Start of Text -> First character of message text", and may be used to terminate the message heading" Since there is a page number in the middle, it could be that the page/document is ended there and a new one is appended? If so, not sure then how 0x02 should be treated in relation to that.
        Hide
        pascal.essiembre Pascal Essiembre added a comment -

        FYI, "commoncrawl2_likely_broken\W4\W4YNRCMM3TPKQSU24LS6T2PEVWD2FU7Y" and "commoncrawl2\4L\4LCO3UGXCLRSHCKSNB2DDW3MNLE7KP3N" definitely look broken. They both contain no words when you open them in any text editor (you should see "some"). One cannot be open by LibreOffice and the other appears empty when doing so.

        Show
        pascal.essiembre Pascal Essiembre added a comment - FYI, "commoncrawl2_likely_broken\W4\W4YNRCMM3TPKQSU24LS6T2PEVWD2FU7Y" and "commoncrawl2\4L\4LCO3UGXCLRSHCKSNB2DDW3MNLE7KP3N" definitely look broken. They both contain no words when you open them in any text editor (you should see "some"). One cannot be open by LibreOffice and the other appears empty when doing so.
        Hide
        pascal.essiembre Pascal Essiembre added a comment -

        I also checked some of the QuatroPro ones, and those I checked did not look too healthy either. If you can find some that open just right in LibreOffice (or other), let me know which ones and I will investigate further.

        Show
        pascal.essiembre Pascal Essiembre added a comment - I also checked some of the QuatroPro ones, and those I checked did not look too healthy either. If you can find some that open just right in LibreOffice (or other), let me know which ones and I will investigate further.

          People

          • Assignee:
            Unassigned
            Reporter:
            tallison@mitre.org Tim Allison
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development