Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2170

Tika 1.13 ForkParser fails intermittently with very large MS Word docx

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 2.0, 1.15
    • Component/s: parser
    • Labels:
      None
    • Environment:

      Windows 10

      Description

      If the ForkParser is run in a for-loop over and over against a single large Microsoft Word DOCX file, it fails intermittently. Sometimes it will fail on the very first iteration. Sometimes it will run through several iterations before failing. Results are inconsistent.

      A small test application is enclosed. For the test, I use a Word docx with the full text of "War and Peace". 2.8MB, 1141 pages of text.

        Attachments

        1. TIKA_2170.patch
          8 kB
          Tim Allison
        2. TikaForkParserExample.java
          3 kB
          Tim Kingsbury
        3. War and Peace.docx
          2.81 MB
          Tim Kingsbury

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                tkingsbury@lenovo.com Tim Kingsbury
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: