Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-6475

SOLR-5517 broke the ExtractingRequestHandler / Tika content-type detection.

    XMLWordPrintableJSON

Details

    Description

      Hi,

      as discussed with "hoss" on IRC, i'm creating this Issue about a problem we recently ran into:

      Our company uses Solr to index user-generated files for fulltext searching (PDFs, etc.) by using the ExtractingRequestHandler / Tika.
      Since we recently upgraded to Solr 4.9, the indexing process began to throw the following exception: "Must specify a Content-Type header with POST requests" (in solr/servlet/SolrRequestParsers.java, line 684 in the 4.9 source).

      This behavior was introduced with SOLR-5517, but even as the Solr wiki states, Tika needs the content-type to be empty or not present to trigger auto detection of the content- / mime-type.

      Since both features block each other, but are basically both correct behavior, "hoss" suggested that Tika should be fixed to trigger the auto-detection on content-type "application/octet-stream" too and i highly agree with this proposal.

      Test case:
      Just use the example from the ExtractingRequestHandler wiki page:

      curl "http://localhost:8983/solr/update/extract?literal.id=doc5&defaultField=text"  --data-binary @tutorial.html  [-H 'Content-type:text/html']
      

      but don't send the content-type, obviously. or you could just use the "SimplePostTool (post.jar)" mentioned in the wiki, but i guess this would be broken now, too.

      Proposed solution:
      Fix the Tika content guessing in that way, that it also triggers the auto detection on content-type "application/octet-stream".

      Thanks,
      Dominik

      Attachments

        Activity

          People

            Unassigned Unassigned
            dernop Dominik Geelen
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: