Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2724

Tika does not recognize http 3xx error codes when passed fileUrl

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.18
    • None
    • server
    • None

    Description

      When the fileUrl passed to the Tika server results in a 3xx http status code, Tika happily returns a 200 response.

      How to reproduce the issue: Run tika server with -enableUnsecureFeatures and -enableFileUrl options. Then send a fileUrl to the server that returns a 300 error code. Here is a sample curl session:

      $ curl -v google.com
      * Rebuilt URL to: google.com/
      * Trying 216.58.216.142...
      * TCP_NODELAY set
      * Connected to google.com (216.58.216.142) port 80 (#0)
      > GET / HTTP/1.1
      > Host: google.com
      > User-Agent: curl/7.54.0
      > Accept: */*
      >
      < HTTP/1.1 301 Moved Permanently
      < Location: http://www.google.com/
      < Content-Type: text/html; charset=UTF-8
      < Date: Wed, 05 Sep 2018 15:31:51 GMT
      < Expires: Fri, 05 Oct 2018 15:31:51 GMT
      < Cache-Control: public, max-age=2592000
      < Server: gws
      < Content-Length: 219
      < X-XSS-Protection: 1; mode=block
      < X-Frame-Options: SAMEORIGIN
      <
      <HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
      <TITLE>301 Moved</TITLE></HEAD><BODY>
      <H1>301 Moved</H1>
      The document has moved
      <A HREF="http://www.google.com/">here</A>.
      </BODY></HTML>
      * Connection #0 to host google.com left intact
      
      $ curl -XPUT -H 'fileUrl:http://google.com' localhost:9998/rmeta/text -v
      * Trying ::1...
      * TCP_NODELAY set
      * Connected to localhost (::1) port 9998 (#0)
      > PUT /rmeta/text HTTP/1.1
      > Host: localhost:9998
      > User-Agent: curl/7.54.0
      > Accept: */*
      > fileUrl:http://google.com
      >
      < HTTP/1.1 200 OK
      < Content-Type: application/json
      < Date: Wed, 05 Sep 2018 15:25:12 GMT
      < Transfer-Encoding: chunked
      < Server: Jetty(8.y.z-SNAPSHOT)
      <
      * Connection #0 to host localhost left intact
      [{"Content-Encoding":"UTF-8","Content-Type":"text/html; charset\u003dUTF-8","Content-Type-Hint":"text/html; charset\u003dUTF-8","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.html.HtmlParser"],"X-TIKA:content":"\n\n\n\n\n\n\n\n\nGoogle\n\n Search Images Maps Play YouTube News Gmail Drive More »\nWeb History | Settings | Sign in\n\n\n \n\n\n\n\n\t \t\n\n\tAdvanced searchLanguage tools\n\n\n\n\nGoogle offered in: Fran�ais \n\n\nAdvertising�ProgramsBusiness Solutions+GoogleAbout GoogleGoogle.ca\n\n© 2018 - Privacy - Terms\n\n\n","X-TIKA:parse_time_millis":"11","dc:title":"Google","title":"Google"}]

       

      I am using Tika server to pull files from S3 and parse them, but upon a redirect request, it neither redirects nor returns an error code.

      See https://docs.aws.amazon.com/AmazonS3/latest/dev/Redirects.html

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            mohsen3 Mohsen
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: