[SOLR-7139] SolrContentHandler for TIKA is broken by TikaOCR (caused by multiple startDocument events) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 4.10.3
Fix Version/s: 4.10.4, 5.1, 6.0
Component/s: contrib - Solr Cell (Tika extraction)
Labels:
None

Description

While testing my large scale Tika/SolrCell indexing (great work on /extraction guys, really really appreciate it) on my 40M image dataset, I was pulling my frickin' hair out trying to figure out why the TesseractOCR extracted content wasn't actually making it into the index. Well I figured it out lol (many many System.out.printlns later) - it's the disabling of div tags (=>ignored) in the default solrconfig.xml. This basically renders TesseractOCR output in SolrCell useless since it is surrounded by a div tag.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-7139.Mattmann.022115.patch.txt
22/Feb/15 05:41
1 kB
Chris A. Mattmann
SOLR-7139.patch
26/Feb/15 14:19
5 kB
Uwe Schindler

Issue Links

relates to

TIKA-1445 Figure out how to add Image metadata extraction to Tesseract parser

Resolved

SOLR-7137 Upgrade to Tika 1.7 in 4_10_3 branch

Closed

Activity

People

Assignee:: Uwe Schindler

Reporter:: Chris A. Mattmann

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 22/Feb/15 03:56

Updated:: 09/May/16 18:47

Resolved:: 26/Feb/15 14:39