Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2170

Tika 1.13 ForkParser fails intermittently with very large MS Word docx

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 2.0, 1.15
    • Component/s: parser
    • Labels:
      None
    • Environment:

      Windows 10

      Description

      If the ForkParser is run in a for-loop over and over against a single large Microsoft Word DOCX file, it fails intermittently. Sometimes it will fail on the very first iteration. Sometimes it will run through several iterations before failing. Results are inconsistent.

      A small test application is enclosed. For the test, I use a Word docx with the full text of "War and Peace". 2.8MB, 1141 pages of text.

      1. TIKA_2170.patch
        8 kB
        Tim Allison
      2. TikaForkParserExample.java
        3 kB
        Tim Kingsbury
      3. War and Peace.docx
        2.81 MB
        Tim Kingsbury

        Issue Links

          Activity

          Hide
          tkingsbury@lenovo.com Tim Kingsbury added a comment -

          Sample Java app to demonstrate problem

          Show
          tkingsbury@lenovo.com Tim Kingsbury added a comment - Sample Java app to demonstrate problem
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          I'm able to reproduce this problem. The ForkServer is shutting down with exit value 0, which is a good thing.

          I think the problem is that the ForkServer shuts down if it hasn't received or sent any data in 5 seconds.

              public void run() {
                  try {
                      while (active) {
                          active = false;
                          Thread.sleep(5000);
                      }
                      System.exit(0);
                  } catch (InterruptedException e) {
                  }
              }
          

          When I remove the call to sleep() in your example code, I'm not able to reproduce the problem.

          Even without your call to sleep, though, if a parser takes > 5 seconds to do something...let's say the parser slurps the entire input stream and then spends a long time parsing it before writing any output, then the ForkServer will shutdown.

          We could parameterize the amount of sleep before shutting-down-on-no-stream-activity if that would help.

          Show
          tallison@mitre.org Tim Allison added a comment - - edited I'm able to reproduce this problem. The ForkServer is shutting down with exit value 0, which is a good thing. I think the problem is that the ForkServer shuts down if it hasn't received or sent any data in 5 seconds. public void run() { try { while (active) { active = false; Thread.sleep(5000); } System.exit(0); } catch (InterruptedException e) { } } When I remove the call to sleep() in your example code, I'm not able to reproduce the problem. Even without your call to sleep, though, if a parser takes > 5 seconds to do something...let's say the parser slurps the entire input stream and then spends a long time parsing it before writing any output, then the ForkServer will shutdown. We could parameterize the amount of sleep before shutting-down-on-no-stream-activity if that would help.
          Hide
          tkingsbury@lenovo.com Tim Kingsbury added a comment -

          Hey Tim,
          The 5 second sleep on my side is definitely not required (that was just an experiment). It's interesting that you guys time out after 5 seconds of inactivity. How long you spend parsing a very large file is going to be hardware dependent. Perhaps your machine is a bit faster than mine? I see it fail about 20% of the time at random. If I try with a smaller file, things become stable.

          >>We could parameterize the amount of sleep before shutting-down-on-no-stream-activity if that would help.

          That sounds like a perfect solution to me. I would bet that if I could bump the timeout up to 30 seconds, the problem might go away entirely.

          -Tim

          Show
          tkingsbury@lenovo.com Tim Kingsbury added a comment - Hey Tim, The 5 second sleep on my side is definitely not required (that was just an experiment). It's interesting that you guys time out after 5 seconds of inactivity. How long you spend parsing a very large file is going to be hardware dependent. Perhaps your machine is a bit faster than mine? I see it fail about 20% of the time at random. If I try with a smaller file, things become stable. >>We could parameterize the amount of sleep before shutting-down-on-no-stream-activity if that would help. That sounds like a perfect solution to me. I would bet that if I could bump the timeout up to 30 seconds, the problem might go away entirely. -Tim
          Hide
          tallison@mitre.org Tim Allison added a comment -

          That sounds like a perfect solution to me. I would bet that if I could bump the timeout up to 30 seconds, the problem might go away entirely.

          If only someone had suggested and implemented this earlier...

          Will do.

          Show
          tallison@mitre.org Tim Allison added a comment - That sounds like a perfect solution to me. I would bet that if I could bump the timeout up to 30 seconds, the problem might go away entirely. If only someone had suggested and implemented this earlier ... Will do.
          Hide
          tkingsbury@lenovo.com Tim Kingsbury added a comment -

          Awesome
          Do you have any sort of a wild guess for an ETA?

          Show
          tkingsbury@lenovo.com Tim Kingsbury added a comment - Awesome Do you have any sort of a wild guess for an ETA?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Wed in trunk, but it won't make it into 1.14 which is soon to be released.

          Show
          tallison@mitre.org Tim Allison added a comment - Wed in trunk, but it won't make it into 1.14 which is soon to be released.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          This changes the <init> signature of some package-private classes. If there are any objections or other recommendations, I'm happy to modify. If I don't hear anything back by, say, November 9, I'll commit this.

          Show
          tallison@mitre.org Tim Allison added a comment - This changes the <init> signature of some package-private classes. If there are any objections or other recommendations, I'm happy to modify. If I don't hear anything back by, say, November 9, I'll commit this.
          Hide
          tkingsbury@lenovo.com Tim Kingsbury added a comment -

          Cool. I'll be eagerly awaiting Tika 1.15 in the coming months.

          Show
          tkingsbury@lenovo.com Tim Kingsbury added a comment - Cool. I'll be eagerly awaiting Tika 1.15 in the coming months.
          Hide
          tkingsbury@lenovo.com Tim Kingsbury added a comment -

          Hey Tim, one more question: When the forkparser exits abnormally, two files (.jar and .tmp) are being left behind in the user's Temp folder. If we experience a large number of failures over a week, we have a risk of running out of disk space. Should I log this as a separate bug?

          Show
          tkingsbury@lenovo.com Tim Kingsbury added a comment - Hey Tim, one more question: When the forkparser exits abnormally, two files (.jar and .tmp) are being left behind in the user's Temp folder. If we experience a large number of failures over a week, we have a risk of running out of disk space. Should I log this as a separate bug?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Y. Please re-open TIKA-1933. I'm not sure there is a quick fix, but I can look again.

          Show
          tallison@mitre.org Tim Allison added a comment - Y. Please re-open TIKA-1933 . I'm not sure there is a quick fix, but I can look again.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Please reopen if you're still having problems after increasing the timeout.

          Show
          tallison@mitre.org Tim Allison added a comment - Please reopen if you're still having problems after increasing the timeout.
          Hide
          hudson Hudson added a comment -

          UNSTABLE: Integrated in Jenkins build Tika-trunk #1140 (See https://builds.apache.org/job/Tika-trunk/1140/)
          TIKA-2170 – allow users to configure timeout for ForkServer (tallison: rev e8bf985040e8b4fe6975468a6b912edd2d3e03d5)

          • (edit) tika-core/src/main/java/org/apache/tika/fork/ForkServer.java
          • (edit) tika-core/src/main/java/org/apache/tika/fork/ForkClient.java
          • (edit) CHANGES.txt
          • (edit) tika-core/src/test/java/org/apache/tika/fork/ForkParserTest.java
          • (edit) tika-core/src/main/java/org/apache/tika/fork/ForkParser.java
          Show
          hudson Hudson added a comment - UNSTABLE: Integrated in Jenkins build Tika-trunk #1140 (See https://builds.apache.org/job/Tika-trunk/1140/ ) TIKA-2170 – allow users to configure timeout for ForkServer (tallison: rev e8bf985040e8b4fe6975468a6b912edd2d3e03d5) (edit) tika-core/src/main/java/org/apache/tika/fork/ForkServer.java (edit) tika-core/src/main/java/org/apache/tika/fork/ForkClient.java (edit) CHANGES.txt (edit) tika-core/src/test/java/org/apache/tika/fork/ForkParserTest.java (edit) tika-core/src/main/java/org/apache/tika/fork/ForkParser.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1141 (See https://builds.apache.org/job/Tika-trunk/1141/)
          TIKA-2170 – fix unit test to allow for different exceptions to be (tallison: rev 2e325cb7103f3e7090433264a52f9787200c526e)

          • (edit) tika-core/src/test/java/org/apache/tika/fork/ForkParserTest.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1141 (See https://builds.apache.org/job/Tika-trunk/1141/ ) TIKA-2170 – fix unit test to allow for different exceptions to be (tallison: rev 2e325cb7103f3e7090433264a52f9787200c526e) (edit) tika-core/src/test/java/org/apache/tika/fork/ForkParserTest.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build tika-2.x-windows #75 (See https://builds.apache.org/job/tika-2.x-windows/75/)
          TIKA-2170 allow configuration of timeout for ForkServer (tallison: rev 7adfe1cb5490bed1c912c21ccb29f56244485017)

          • (edit) tika-core/src/main/java/org/apache/tika/fork/ForkParser.java
          • (edit) tika-core/src/test/java/org/apache/tika/fork/ForkParserTest.java
          • (edit) CHANGES.txt
          • (edit) tika-core/src/main/java/org/apache/tika/fork/ForkServer.java
          • (edit) tika-core/src/main/java/org/apache/tika/fork/ForkClient.java
            TIKA-2170 fix unit test to allow for different exceptions depending on (tallison: rev 7df6fe4be6ac92fcf3ed7773ae83a1061ee8db02)
          • (edit) tika-core/src/test/java/org/apache/tika/fork/ForkParserTest.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build tika-2.x-windows #75 (See https://builds.apache.org/job/tika-2.x-windows/75/ ) TIKA-2170 allow configuration of timeout for ForkServer (tallison: rev 7adfe1cb5490bed1c912c21ccb29f56244485017) (edit) tika-core/src/main/java/org/apache/tika/fork/ForkParser.java (edit) tika-core/src/test/java/org/apache/tika/fork/ForkParserTest.java (edit) CHANGES.txt (edit) tika-core/src/main/java/org/apache/tika/fork/ForkServer.java (edit) tika-core/src/main/java/org/apache/tika/fork/ForkClient.java TIKA-2170 fix unit test to allow for different exceptions depending on (tallison: rev 7df6fe4be6ac92fcf3ed7773ae83a1061ee8db02) (edit) tika-core/src/test/java/org/apache/tika/fork/ForkParserTest.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build tika-2.x #174 (See https://builds.apache.org/job/tika-2.x/174/)
          TIKA-2170 allow configuration of timeout for ForkServer (tallison: rev 7adfe1cb5490bed1c912c21ccb29f56244485017)

          • (edit) tika-core/src/main/java/org/apache/tika/fork/ForkServer.java
          • (edit) tika-core/src/main/java/org/apache/tika/fork/ForkClient.java
          • (edit) CHANGES.txt
          • (edit) tika-core/src/test/java/org/apache/tika/fork/ForkParserTest.java
          • (edit) tika-core/src/main/java/org/apache/tika/fork/ForkParser.java
            TIKA-2170 fix unit test to allow for different exceptions depending on (tallison: rev 7df6fe4be6ac92fcf3ed7773ae83a1061ee8db02)
          • (edit) tika-core/src/test/java/org/apache/tika/fork/ForkParserTest.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #174 (See https://builds.apache.org/job/tika-2.x/174/ ) TIKA-2170 allow configuration of timeout for ForkServer (tallison: rev 7adfe1cb5490bed1c912c21ccb29f56244485017) (edit) tika-core/src/main/java/org/apache/tika/fork/ForkServer.java (edit) tika-core/src/main/java/org/apache/tika/fork/ForkClient.java (edit) CHANGES.txt (edit) tika-core/src/test/java/org/apache/tika/fork/ForkParserTest.java (edit) tika-core/src/main/java/org/apache/tika/fork/ForkParser.java TIKA-2170 fix unit test to allow for different exceptions depending on (tallison: rev 7df6fe4be6ac92fcf3ed7773ae83a1061ee8db02) (edit) tika-core/src/test/java/org/apache/tika/fork/ForkParserTest.java

            People

            • Assignee:
              Unassigned
              Reporter:
              tkingsbury@lenovo.com Tim Kingsbury
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development