Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
4.7
-
None
Description
Hi,
as discussed with "hoss" on IRC, i'm creating this Issue about a problem we recently ran into:
Our company uses Solr to index user-generated files for fulltext searching (PDFs, etc.) by using the ExtractingRequestHandler / Tika.
Since we recently upgraded to Solr 4.9, the indexing process began to throw the following exception: "Must specify a Content-Type header with POST requests" (in solr/servlet/SolrRequestParsers.java, line 684 in the 4.9 source).
This behavior was introduced with SOLR-5517, but even as the Solr wiki states, Tika needs the content-type to be empty or not present to trigger auto detection of the content- / mime-type.
Since both features block each other, but are basically both correct behavior, "hoss" suggested that Tika should be fixed to trigger the auto-detection on content-type "application/octet-stream" too and i highly agree with this proposal.
Test case:
Just use the example from the ExtractingRequestHandler wiki page:
curl "http://localhost:8983/solr/update/extract?literal.id=doc5&defaultField=text" --data-binary @tutorial.html [-H 'Content-type:text/html']
but don't send the content-type, obviously. or you could just use the "SimplePostTool (post.jar)" mentioned in the wiki, but i guess this would be broken now, too.
Proposed solution:
Fix the Tika content guessing in that way, that it also triggers the auto detection on content-type "application/octet-stream".
Thanks,
Dominik