Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2170

Tika 1.13 ForkParser fails intermittently with very large MS Word docx

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.13
    • 1.15, 2.0.0
    • parser
    • None
    • Windows 10

    Description

      If the ForkParser is run in a for-loop over and over against a single large Microsoft Word DOCX file, it fails intermittently. Sometimes it will fail on the very first iteration. Sometimes it will run through several iterations before failing. Results are inconsistent.

      A small test application is enclosed. For the test, I use a Word docx with the full text of "War and Peace". 2.8MB, 1141 pages of text.

      Attachments

        1. TIKA_2170.patch
          8 kB
          Tim Allison
        2. TikaForkParserExample.java
          3 kB
          Tim Kingsbury
        3. War and Peace.docx
          2.81 MB
          Tim Kingsbury

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tkingsbury@lenovo.com Tim Kingsbury
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: