Tika
  1. Tika
  2. TIKA-1196

JAX-RS server only responds to queries to/from http://localhost

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.5
    • Component/s: server
    • Environment:

      Mac OS X, Windows Server 2008

      Description

      I'm not sure if this is a problem with the Tika JAX-RS server, or with how it uses CXF under the hood. Anyway:

      I have a large text extraction job (10-15 million documents) that I'm using the web service for. It would be nice to be able to distribute this horizontally across multiple nodes to speed up the processing. I had thought to have a job queue with a couple consumers, farming out PUT requests across several Tika web service endpoints.

      But the JAX-RS web service will only respond to queries made to http://localhost:9998/tika.

      I can't call http://hostname:9998/tika – even if it's still a local operation.

      Here is a list of things I've tried:

      • I changed line 89 of TikaServerCLI.java to compute the name of the host at runtime. No go: the server starts up, and immediately terminates.
      • I changed line 89 of TikaServerCLI.java to be a hostname (not a FQDN), and re-compiled:
        • mvn compile -rf :tika-server compiles successfully. Start up the server, and it terminates, just like when I tried to compute the hostname at runtime
        • mvn install from the topmost Tika directory gets the service responding to both http://hostname:9998/tika and http://hostname.domain.net:9998/tika (Seemed weird, this is why I was thinking it was further up the chain in CXF?)

      In a perfect world:

      1. The server should respond to any valid calls that make sense:
        • 127.0.0.1
        • localhost
        • hostname
        • host.domain.tld
        • ip_address
      2. A hostname invocation parameter could be used to limit what the service responds to when it's started up. (A very optional, nice-to-have.)
      1. tika-1196.patch
        0.8 kB
        Rian Stockbower
      2. tika-1196b.patch
        2 kB
        Rian Stockbower
      3. tika-1196c.patch
        2 kB
        Rian Stockbower

        Issue Links

          Activity

          Hide
          Nick Burch added a comment -

          I think the problem isn't the hostname, it's the interface bound to. By default it comes up just listening on the loopback 127.0.0.1 interface and not any others:

          $ netstat -nl | grep 9998
          tcp6 0 0 127.0.0.1:9998 :::* LISTEN

          However, I can call it just fine by loopback IP, so it's not the host header:

          $ curl -T test.pdf http://127.0.0.1:9998/meta
          "dcterms:modified","2003-01-28T18:56:48Z"
          "meta:creation-date","2001-12-28T17:15:34Z"
          "meta:save-date","2003-01-28T18:56:48Z"
          "dc:creator","gbish"
          "Last-Modified","2003-01-28T18:56:48Z"
          (snip)

          So, I think we need to provide an option to bind to all interfaces instead of just loopback

          Show
          Nick Burch added a comment - I think the problem isn't the hostname, it's the interface bound to. By default it comes up just listening on the loopback 127.0.0.1 interface and not any others: $ netstat -nl | grep 9998 tcp6 0 0 127.0.0.1:9998 :::* LISTEN However, I can call it just fine by loopback IP, so it's not the host header: $ curl -T test.pdf http://127.0.0.1:9998/meta "dcterms:modified","2003-01-28T18:56:48Z" "meta:creation-date","2001-12-28T17:15:34Z" "meta:save-date","2003-01-28T18:56:48Z" "dc:creator","gbish" "Last-Modified","2003-01-28T18:56:48Z" (snip) So, I think we need to provide an option to bind to all interfaces instead of just loopback
          Hide
          Rian Stockbower added a comment -

          I'm new to this, but it looks like there's no way to call JAXRSServerFactoryBean.setAddress with anything except a literal string. There doesn't seem to be a way to give it a list of valid hostnames.

          http://cxf.apache.org/javadoc/latest/org/apache/cxf/endpoint/AbstractEndpointFactory.html

          Show
          Rian Stockbower added a comment - I'm new to this, but it looks like there's no way to call JAXRSServerFactoryBean.setAddress with anything except a literal string. There doesn't seem to be a way to give it a list of valid hostnames. http://cxf.apache.org/javadoc/latest/org/apache/cxf/endpoint/AbstractEndpointFactory.html
          Hide
          Nick Burch added a comment -

          What about "*", that's often used to mean listen on everything

          Otherwise, you'll probably need to join the CXF User mailing list, and ask them there what we should be doing to listen across all interfaces

          Show
          Nick Burch added a comment - What about "*", that's often used to mean listen on everything Otherwise, you'll probably need to join the CXF User mailing list, and ask them there what we should be doing to listen across all interfaces
          Hide
          Rian Stockbower added a comment -

          Unfortunately that didn't work. I've just emailed the CXF user list.

          Show
          Rian Stockbower added a comment - Unfortunately that didn't work. I've just emailed the CXF user list.
          Hide
          Sergey Beryozkin added a comment -

          Hi. A number of fixes have been applied to CXF code dealing with the HTTP host resolution across multiple releases.
          I think users sometimes use "0.0.0.0" instead of the host name or simply use the relative address. Rian, can you please try CXF 2.7.7 ?
          Cheers, Sergey

          Show
          Sergey Beryozkin added a comment - Hi. A number of fixes have been applied to CXF code dealing with the HTTP host resolution across multiple releases. I think users sometimes use "0.0.0.0" instead of the host name or simply use the relative address. Rian, can you please try CXF 2.7.7 ? Cheers, Sergey
          Hide
          Rian Stockbower added a comment -

          That worked, Sergey. Changing localhost to 0.0.0.0 now lets me hit the service using any valid address.

          Show
          Rian Stockbower added a comment - That worked, Sergey. Changing localhost to 0.0.0.0 now lets me hit the service using any valid address.
          Hide
          Rian Stockbower added a comment -

          I've attached a patch file that just changes localhost to 0.0.0.0, which allows users to hit the endpoint using any valid IP or hostname.

          Attempting to move the JAX-RS server to CXF 2.7.8 is a little beyond my skill.

          Show
          Rian Stockbower added a comment - I've attached a patch file that just changes localhost to 0.0.0.0, which allows users to hit the endpoint using any valid IP or hostname. Attempting to move the JAX-RS server to CXF 2.7.8 is a little beyond my skill.
          Hide
          Nick Burch added a comment -

          I'm not sure if that should be an option, or if it's OK to change the default?

          Show
          Nick Burch added a comment - I'm not sure if that should be an option, or if it's OK to change the default?
          Hide
          Rian Stockbower added a comment -

          It seems weird to restrict access to the endpoint to only loopback addresses.

          That said, I'm working on something a little more interesting/robust.

          Show
          Rian Stockbower added a comment - It seems weird to restrict access to the endpoint to only loopback addresses. That said, I'm working on something a little more interesting/robust.
          Hide
          Rian Stockbower added a comment -

          Disregard my first patch. This one changes the default behavior to make the service respond to any valid hostname/ip address. It also adds a CLI parameter to control the address with instructions for the user on how to restrict usage to only loopback addresses.

          Show
          Rian Stockbower added a comment - Disregard my first patch. This one changes the default behavior to make the service respond to any valid hostname/ip address. It also adds a CLI parameter to control the address with instructions for the user on how to restrict usage to only loopback addresses.
          Hide
          Rian Stockbower added a comment -

          Patch C fixes a careless error where the default port was always used, regardless of what was specified by the user.

          Show
          Rian Stockbower added a comment - Patch C fixes a careless error where the default port was always used, regardless of what was specified by the user.
          Hide
          Sergey Beryozkin added a comment -

          Rian, thanks for the patch. I'd prefer going for a 'host' option only and keep the default to 'localhost' as Nick also suggested.
          You are right it does not make much sense for cases where clients are not collocated, but in those cases we are most likely also have to care about the secure HTTPS. Making sure the server can run in the secure mode is a separate issue IMHO (can be done via configuring CXF Jetty connectors or supporting the war deployments with the containers taking care of HTTPS)
          Thanks. Sergey

          Show
          Sergey Beryozkin added a comment - Rian, thanks for the patch. I'd prefer going for a 'host' option only and keep the default to 'localhost' as Nick also suggested. You are right it does not make much sense for cases where clients are not collocated, but in those cases we are most likely also have to care about the secure HTTPS. Making sure the server can run in the secure mode is a separate issue IMHO (can be done via configuring CXF Jetty connectors or supporting the war deployments with the containers taking care of HTTPS) Thanks. Sergey
          Hide
          Rian Stockbower added a comment -

          I can put it back to localhost, but I'm not sure why that's desirable. (Other than that's the way it was.) What's the reasoning behind having it limited to loopback addresses by default? This is not the behavior I would expect as a user. As a user, I would expect it to work like a web service: it does something when I make a semantically valid call to it.

          From an operational perspective, there's some added complexity as well: when I deploy this to N nodes, I'll have to have my invocation script compute the local hostname before launching the service. Admittedly this is a small problem, but I don't see why it needs to be a problem at all.

          What am I missing here?

          Show
          Rian Stockbower added a comment - I can put it back to localhost, but I'm not sure why that's desirable. (Other than that's the way it was.) What's the reasoning behind having it limited to loopback addresses by default? This is not the behavior I would expect as a user. As a user, I would expect it to work like a web service: it does something when I make a semantically valid call to it. From an operational perspective, there's some added complexity as well: when I deploy this to N nodes, I'll have to have my invocation script compute the local hostname before launching the service. Admittedly this is a small problem, but I don't see why it needs to be a problem at all. What am I missing here?
          Hide
          Sergey Beryozkin added a comment - - edited

          IMHO what needs to be decided upon is: what is more important for Tika Server, for it supporting all the possible host variations out of the box or expect the users do more work when the server is accessed remotely. If the security is not an issue for the Server then it does not make sense to keep the local host by default a lot, but if it is then opening it up completely by default does not seem right - it would seem reasonable to me for users actually having to do more work in such cases, with the host calculation requiring the least of effort , and with setting up the server certificates taking the most of effort

          Show
          Sergey Beryozkin added a comment - - edited IMHO what needs to be decided upon is: what is more important for Tika Server, for it supporting all the possible host variations out of the box or expect the users do more work when the server is accessed remotely. If the security is not an issue for the Server then it does not make sense to keep the local host by default a lot, but if it is then opening it up completely by default does not seem right - it would seem reasonable to me for users actually having to do more work in such cases, with the host calculation requiring the least of effort , and with setting up the server certificates taking the most of effort
          Hide
          Rian Stockbower added a comment -

          Those are more or less my thoughts. I'll solicit comments from the Tika users mailing list.

          Show
          Rian Stockbower added a comment - Those are more or less my thoughts. I'll solicit comments from the Tika users mailing list.
          Hide
          Rian Stockbower added a comment -

          Radio silence from the Tika mailing list. Can we get my latest patch rolled in?

          Show
          Rian Stockbower added a comment - Radio silence from the Tika mailing list. Can we get my latest patch rolled in?
          Hide
          Sergey Beryozkin added a comment -

          Rian, for now I've introduced a 'host' property defaulted to 'localhost'. I won't have issues if the team settles on 0.0.0.0 by default, I just can't make that decision given that I've not been actively involved

          Thanks, Sergey

          Show
          Sergey Beryozkin added a comment - Rian, for now I've introduced a 'host' property defaulted to 'localhost'. I won't have issues if the team settles on 0.0.0.0 by default, I just can't make that decision given that I've not been actively involved Thanks, Sergey
          Hide
          Rian Stockbower added a comment -

          Sounds reasonable. Thanks, Sergey.

          -Rian

          Show
          Rian Stockbower added a comment - Sounds reasonable. Thanks, Sergey. -Rian

            People

            • Assignee:
              Sergey Beryozkin
              Reporter:
              Rian Stockbower
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development