Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1776

tika stop converting at this pdf document

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 1.10
    • Fix Version/s: None
    • Component/s: batch
    • Labels:
      None
    • Environment:

      Description

      Hi and thank you all for this great project,

      I use https://github.com/offenesdresden/ratsinfo-scraper to download thousands of pdfs and convert it from pdf to xml, that works pretty well and need max 1-2minutes even for big files. But since over 15hours the process hangs with CPU load = 0% at one file:
      http://ratsinfo.dresden.de/getfile.php?id=149624&type=do
      wich is just 5mb large, but contains text, scans and CAD plans.

      I run "get_xml()" from follwing class (located in tika_app.rb):
      -----------------------------
      require 'rubygems'
      require 'stringio'
      require 'open4'

      class TikaApp
      def initialize(document)
      filename = File.basename(document)
      t = Time.now
      puts t.strftime("%H:%M:%S") + ": analyze #

      {filename}

      "
      @document = document
      java_cmd = 'java'
      java_args = '-server -Djava.awt.headless=true'
      tika_path = "tika-app.jar"
      @tika_cmd = "#

      {java_cmd}

      #

      {java_args}

      -jar '#

      {tika_path}

      '"
      end

      def get_xml
      run_tika('--xml')
      end

      def get_metadata
      run_tika('--metadata --json')
      end

      private

      def run_tika(option)
      final_cmd = "#

      {@tika_cmd}

      #

      {option}

      '#

      {@document}

      '"
      pid, stdin, stdout, stderr = Open4::popen4(final_cmd)
      stdout_result = stdout.read.strip
      stderr_result = stderr.read.strip
      unless strip_stderr(stderr_result).empty?
      end

      stdout_result
      ensure
      stdin.close
      stdout.close
      stderr.close
      end

      def strip_stderr(s)
      s.gsub(/^(info|warn) - .*$/i, '').strip
      end
      end
      ----------

      The tika command with this function looks like this:
      java -server -Djava.awt.headless=true -jar 'tika-app.jar' --xml '~/data/00149624.pdf'

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              tranquillo tranquillo
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: