Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1404

tika-app server leaking temporary files when converting Word97 (doc)

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.5, 1.7
    • 1.7
    • cli
    • None
    • Linux (observed on CentOS 6.5 and SuSE SLES 11)

    Description

      When converting Word97 documents (*.doc), tika-server reproducibly leaves behind temporary files.

      Steps to reproduce:

      • Start tika-app-1.5.jar in --server mode
      • Send a *.doc file to server for conversion
      • Stop tika-server using CTRL+C or kill -15

      For example:

      lukas@host:~> java -jar tika-app-1.5.jar -v --server --port 8077 --text
      
      # ...
      
      lukas@host:/tmp> ls -lah apache-tika-*
      ls: cannot access apache-tika-*: No such file or directory
      lukas@host:/tmp>
      lukas@host:/tmp> netcat 127.0.0.1 8077 < simple_word97.doc
      Simple Word-97 Document
      Lorem Ipsum.
      lukas@host:/tmp> ls -lah apache-tika-*
      -rw-r--r-- 1 lukas users 22K 2014-08-29 15:48 apache-tika-2457738389388821864.tmp
      
      # after conversion is done, tmp file handles are still open
      
      lukas@host:/tmp> lsof | grep tika
      java   29857   lukas   32r   REG   104,2  28628386  4571740 /home/lukas/tika-app-1.5.jar
      java   29857   lukas   85r   REG   104,2     22528  8604717 /tmp/apache-tika-2457738389388821864.tmp
      java   29857   lukas   86r   REG   104,2     22528  8604717 /tmp/apache-tika-2457738389388821864.tmp
      
      # stop tika-server...
      
      ^C
      lukas@host:~>
      
      # ...
      
      lukas@host:/tmp> lsof | grep tika
      lukas@host:/tmp>
      

      No exceptions are thrown, and the plaintext is being extracted correctly from the document, but temporary files are still left behind every single time.

      This obviously is a major issue in a production environment when converting thousands of documents a day. Our temp directories are filling up rapidly, and we had to configure cron jobs to clean up after Tika on most of our production servers. I wasn't able to reproduce this issue using tika-app-1.5.jar in non-server mode. However, booting up a JVM for every single conversion is just too slow.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            nick Nick Burch
            lukasgraf Lukas Graf
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment