Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Not A Problem
-
1.10
-
None
-
None
Description
Hi and thank you all for this great project,
I use https://github.com/offenesdresden/ratsinfo-scraper to download thousands of pdfs and convert it from pdf to xml, that works pretty well and need max 1-2minutes even for big files. But since over 15hours the process hangs with CPU load = 0% at one file:
http://ratsinfo.dresden.de/getfile.php?id=149624&type=do
wich is just 5mb large, but contains text, scans and CAD plans.
I run "get_xml()" from follwing class (located in tika_app.rb):
-----------------------------
require 'rubygems'
require 'stringio'
require 'open4'
class TikaApp
def initialize(document)
filename = File.basename(document)
t = Time.now
puts t.strftime("%H:%M:%S") + ": analyze #
"
@document = document
java_cmd = 'java'
java_args = '-server -Djava.awt.headless=true'
tika_path = "tika-app.jar"
@tika_cmd = "#
#
{java_args}-jar '#
{tika_path}'"
end
def get_xml
run_tika('--xml')
end
def get_metadata
run_tika('--metadata --json')
end
private
def run_tika(option)
final_cmd = "#
#
{option}'#
{@document}'"
pid, stdin, stdout, stderr = Open4::popen4(final_cmd)
stdout_result = stdout.read.strip
stderr_result = stderr.read.strip
unless strip_stderr(stderr_result).empty?
end
stdout_result
ensure
stdin.close
stdout.close
stderr.close
end
def strip_stderr(s)
s.gsub(/^(info|warn) - .*$/i, '').strip
end
end
----------
The tika command with this function looks like this:
java -server -Djava.awt.headless=true -jar 'tika-app.jar' --xml '~/data/00149624.pdf'