Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1776

tika stop converting at this pdf document

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • 1.10
    • None
    • batch
    • None

    Description

      Hi and thank you all for this great project,

      I use https://github.com/offenesdresden/ratsinfo-scraper to download thousands of pdfs and convert it from pdf to xml, that works pretty well and need max 1-2minutes even for big files. But since over 15hours the process hangs with CPU load = 0% at one file:
      http://ratsinfo.dresden.de/getfile.php?id=149624&type=do
      wich is just 5mb large, but contains text, scans and CAD plans.

      I run "get_xml()" from follwing class (located in tika_app.rb):
      -----------------------------
      require 'rubygems'
      require 'stringio'
      require 'open4'

      class TikaApp
      def initialize(document)
      filename = File.basename(document)
      t = Time.now
      puts t.strftime("%H:%M:%S") + ": analyze #

      {filename}

      "
      @document = document
      java_cmd = 'java'
      java_args = '-server -Djava.awt.headless=true'
      tika_path = "tika-app.jar"
      @tika_cmd = "#

      {java_cmd}

      #

      {java_args}

      -jar '#

      {tika_path}

      '"
      end

      def get_xml
      run_tika('--xml')
      end

      def get_metadata
      run_tika('--metadata --json')
      end

      private

      def run_tika(option)
      final_cmd = "#

      {@tika_cmd}

      #

      {option}

      '#

      {@document}

      '"
      pid, stdin, stdout, stderr = Open4::popen4(final_cmd)
      stdout_result = stdout.read.strip
      stderr_result = stderr.read.strip
      unless strip_stderr(stderr_result).empty?
      end

      stdout_result
      ensure
      stdin.close
      stdout.close
      stderr.close
      end

      def strip_stderr(s)
      s.gsub(/^(info|warn) - .*$/i, '').strip
      end
      end
      ----------

      The tika command with this function looks like this:
      java -server -Djava.awt.headless=true -jar 'tika-app.jar' --xml '~/data/00149624.pdf'

      Attachments

        Activity

          People

            Unassigned Unassigned
            tranquillo tranquillo
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: