[SOLR-7137] Upgrade to Tika 1.7 in 4_10_3 branch - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 4.10.3
Fix Version/s: None
Component/s: contrib - Solr Cell (Tika extraction)
Labels:
None

Description

I have been trying out SolrCell as an alternative to ingesting around 40M images using Tesseract/OCR and Tika. I noticed in 4.10.3 Tika is pinned to 1.5. In 1.5 Tika and in SolrCell 4.10.3, only about 5600 images of a subset of 50,000 are ingested when I run a series of 50k cURL commands to the extract handler. I had a feeling it has something to do with the fact that some of the characters extracted are oddball characters (4@#@#/ ^^^^) due to Tesseract not always extracting the right text. But then I remembered Tesseract didn't land in Tika until 1.7.

So regardless, I thought I'd upgrade the 4.10.x branch to Tika 1.7. This is a trivial patch to do so, attached (Tika + compress updates). Now all 50K images on the 50K subset are ingested, but I'm noticing something else weird. Despite the fact that Tesseract is called, and despite the fact that on certain images I can verify text is extracted by running Tesseract from the command line on that file, all I am getting in the "content" field of SolrCell is a bunch of "\n \n \n \n \n \n" text. So the text is extracted, there are weird characters, but they don't make it into Solr. Extremely odd.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-7137.Mattmann.022115.patch.txt
21/Feb/15 23:48
0.8 kB
Chris A. Mattmann

Issue Links

is related to

SOLR-7139 SolrContentHandler for TIKA is broken by TikaOCR (caused by multiple startDocument events)

Closed

relates to

SOLR-6488 Upgrade to TIKA 1.6

Closed

SOLR-6991 Update to Apache TIKA 1.7

Closed

Activity

People

Assignee:: Uwe Schindler

Reporter:: Chris A. Mattmann

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 21/Feb/15 23:47

Updated:: 22/Feb/15 20:07

Resolved:: 22/Feb/15 19:41