[TIKA-1776] tika stop converting at this pdf document - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Not A Problem
Affects Version/s: 1.10
Fix Version/s: None
Component/s: batch
Labels:
None
Environment:

Hide

Intel Core I5 4GB Ram, Notebook
OS: debian8, x64, Gnome
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-linux]

Show
Intel Core I5 4GB Ram, Notebook OS: debian8, x64, Gnome java version "1.8.0_60" Java(TM) SE Runtime Environment (build 1.8.0_60-b27) Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode) ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-linux]

Description

Hi and thank you all for this great project,

I use https://github.com/offenesdresden/ratsinfo-scraper to download thousands of pdfs and convert it from pdf to xml, that works pretty well and need max 1-2minutes even for big files. But since over 15hours the process hangs with CPU load = 0% at one file:
http://ratsinfo.dresden.de/getfile.php?id=149624&type=do
wich is just 5mb large, but contains text, scans and CAD plans.

I run "get_xml()" from follwing class (located in tika_app.rb):
-----------------------------
require 'rubygems'
require 'stringio'
require 'open4'

class TikaApp
def initialize(document)
filename = File.basename(document)
t = Time.now
puts t.strftime("%H:%M:%S") + ": analyze #

{filename}

"
@document = document
java_cmd = 'java'
java_args = '-server -Djava.awt.headless=true'
tika_path = "tika-app.jar"
@tika_cmd = "#

{java_cmd}

{java_args}

-jar '#

{tika_path}

'"
end

def get_xml
run_tika('--xml')
end

def get_metadata
run_tika('--metadata --json')
end

private

def run_tika(option)
final_cmd = "#

{@tika_cmd}

{option}

{@document}

'"
pid, stdin, stdout, stderr = Open4::popen4(final_cmd)
stdout_result = stdout.read.strip
stderr_result = stderr.read.strip
unless strip_stderr(stderr_result).empty?
end

stdout_result
ensure
stdin.close
stdout.close
stderr.close
end

def strip_stderr(s)
s.gsub(/^(info|warn) - .*$/i, '').strip
end
end
----------

The tika command with this function looks like this:
java -server -Djava.awt.headless=true -jar 'tika-app.jar' --xml '~/data/00149624.pdf'

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: tranquillo

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 20/Oct/15 07:03

Updated:: 21/Oct/15 01:36

Resolved:: 21/Oct/15 01:36