[SOLR-6475] SOLR-5517 broke the ExtractingRequestHandler / Tika content-type detection. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 4.7
Fix Version/s: None
Component/s: contrib - Solr Cell (Tika extraction)
Labels:

Description

Hi,

as discussed with "hoss" on IRC, i'm creating this Issue about a problem we recently ran into:

Our company uses Solr to index user-generated files for fulltext searching (PDFs, etc.) by using the ExtractingRequestHandler / Tika.
Since we recently upgraded to Solr 4.9, the indexing process began to throw the following exception: "Must specify a Content-Type header with POST requests" (in solr/servlet/SolrRequestParsers.java, line 684 in the 4.9 source).

This behavior was introduced with ~~SOLR-5517~~, but even as the Solr wiki states, Tika needs the content-type to be empty or not present to trigger auto detection of the content- / mime-type.

Since both features block each other, but are basically both correct behavior, "hoss" suggested that Tika should be fixed to trigger the auto-detection on content-type "application/octet-stream" too and i highly agree with this proposal.

Test case:
Just use the example from the ExtractingRequestHandler wiki page:

curl "http://localhost:8983/solr/update/extract?literal.id=doc5&defaultField=text"  --data-binary @tutorial.html  [-H 'Content-type:text/html']

but don't send the content-type, obviously. or you could just use the "SimplePostTool (post.jar)" mentioned in the wiki, but i guess this would be broken now, too.

Proposed solution:
Fix the Tika content guessing in that way, that it also triggers the auto detection on content-type "application/octet-stream".

Thanks,
Dominik

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Dominik Geelen

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 03/Sep/14 06:14

Updated:: 18/Oct/14 17:50