Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3103

Tesseract fails to respect timeouts and clean up after itself

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: 1.24.1
    • Fix Version/s: None
    • Component/s: ocr
    • Labels:
      None

      Description

      We're using the Tika Server with OCR:

      java -jar /opt/tika/tika-server-1.24.1.jar -p 9998 -spawnChild -JXmx500m

       

      Two undersirable things happen:

      1. The CPU runs at 100% for >10 minutes, long after any Tika requests have finished.

      These processes show in top as "tesseract" (OCR) and consume all CPU cores at 100%.

      They eventually die (or finish?) but the machine is unusable in the mean time.

      Expected behaviour: Tika cleans up spawned processes after itself: at most after its timeout limit (which is 2 minutes I believe?)

      2. The temp is full of files like:

      root@38acd588ee22:/# ll /tmp/
      total 197320
      drwxrwxrwt 1 root root 24576 May 20 09:35 ./
      drwxr-xr-x 1 root root 4096 May 20 08:40 ../
      rw-rr- 1 root root 9273920 May 20 08:56 TIKA_streamstore_11144988934311367241.tmp
      rw-rr- 1 root root 8938048 May 20 08:57 TIKA_streamstore_11649337406504198407.tmp
      rw-rr- 1 root root 9478720 May 20 08:56 TIKA_streamstore_13551529918743702933.tmp
      rw-rr- 1 root root 9151040 May 20 08:57 TIKA_streamstore_13568226047805501311.tmp
      rw-rr- 1 root root 7701056 May 20 08:56 TIKA_streamstore_13908373602714189455.tmp

      rw-rr- 1 root root 33367 May 20 08:55 apache-tika-11167866320029165062.tmp
      rw-rr- 1 root root 44353 May 20 08:54 apache-tika-1152515137515755865.tmp
      rw-rr- 1 root root 245279 May 20 08:52 apache-tika-12106368488659105236.tmp
      rw-rr- 1 root root 1759 May 20 08:47 apache-tika-12291680472524021463.tmp

       

      slowly filling up the disk.

      Expected behaviour: Tika cleans up disk space after itself.

       

      These bugs are critical for us. What's the best way to avoid them?

       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              piskvorky Radim Rehurek
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: