Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1404

tika-app server leaking temporary files when converting Word97 (doc)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.5, 1.7
    • 1.7
    • cli
    • None
    • Linux (observed on CentOS 6.5 and SuSE SLES 11)

    Description

      When converting Word97 documents (*.doc), tika-server reproducibly leaves behind temporary files.

      Steps to reproduce:

      • Start tika-app-1.5.jar in --server mode
      • Send a *.doc file to server for conversion
      • Stop tika-server using CTRL+C or kill -15

      For example:

      lukas@host:~> java -jar tika-app-1.5.jar -v --server --port 8077 --text
      
      # ...
      
      lukas@host:/tmp> ls -lah apache-tika-*
      ls: cannot access apache-tika-*: No such file or directory
      lukas@host:/tmp>
      lukas@host:/tmp> netcat 127.0.0.1 8077 < simple_word97.doc
      Simple Word-97 Document
      Lorem Ipsum.
      lukas@host:/tmp> ls -lah apache-tika-*
      -rw-r--r-- 1 lukas users 22K 2014-08-29 15:48 apache-tika-2457738389388821864.tmp
      
      # after conversion is done, tmp file handles are still open
      
      lukas@host:/tmp> lsof | grep tika
      java   29857   lukas   32r   REG   104,2  28628386  4571740 /home/lukas/tika-app-1.5.jar
      java   29857   lukas   85r   REG   104,2     22528  8604717 /tmp/apache-tika-2457738389388821864.tmp
      java   29857   lukas   86r   REG   104,2     22528  8604717 /tmp/apache-tika-2457738389388821864.tmp
      
      # stop tika-server...
      
      ^C
      lukas@host:~>
      
      # ...
      
      lukas@host:/tmp> lsof | grep tika
      lukas@host:/tmp>
      

      No exceptions are thrown, and the plaintext is being extracted correctly from the document, but temporary files are still left behind every single time.

      This obviously is a major issue in a production environment when converting thousands of documents a day. Our temp directories are filling up rapidly, and we had to configure cron jobs to clean up after Tika on most of our production servers. I wasn't able to reproduce this issue using tika-app-1.5.jar in non-server mode. However, booting up a JVM for every single conversion is just too slow.

      Attachments

        1. simple_word97.doc
          22 kB
          Lukas Graf

        Activity

          People

            nick Nick Burch
            lukasgraf Lukas Graf
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: