Tika
  1. Tika
  2. TIKA-1198

Consider optionally utilizing CXF JAX-RS Attachment support

    Details

    • Type: Wish Wish
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5
    • Component/s: server
    • Labels:
      None

      Description

      CXF offers a fairly extensive support for multiparts:
      http://cxf.apache.org/docs/jax-rs-multiparts.html

      Perhaps some of that can help with the server offering more options to do with uploading/downloading files

        Activity

        Sergey Beryozkin created issue -
        Hide
        Sergey Beryozkin added a comment -

        I've added an initial support for multiparts:
        r1548195

        This is CXF specific, but CXF specific code is kept to the very minimum. IMHO CXF is very good in the way it treats multiparts, but I've no problems at all with the team replacing that with something more portable;

        Note In TikaResource, I've removed calls to TikaInpputStream.getFile(), which seems to be adding to the extra processing time, all the tests pass, that can be added back if needed

        Show
        Sergey Beryozkin added a comment - I've added an initial support for multiparts: r1548195 This is CXF specific, but CXF specific code is kept to the very minimum. IMHO CXF is very good in the way it treats multiparts, but I've no problems at all with the team replacing that with something more portable; Note In TikaResource, I've removed calls to TikaInpputStream.getFile(), which seems to be adding to the extra processing time, all the tests pass, that can be added back if needed
        Hide
        Sergey Beryozkin added a comment - - edited

        Single part multiparts are supported, easy to support the ones with many parts or with a proper multipart form data payload with multiple files, but I'm not sure it makes sense for TikaResource

        Show
        Sergey Beryozkin added a comment - - edited Single part multiparts are supported, easy to support the ones with many parts or with a proper multipart form data payload with multiple files, but I'm not sure it makes sense for TikaResource
        Hide
        Dave Meikle added a comment -

        Sergey - this change appears to be breaking the current guidance on our Wiki about the commands to use when calling Tika Server (i.e. curl -T or curl -X PUT - d). It is also confusing when it is enabled on some and not all resources.

        I noticed this when running some test scripts I have on the latest build, so was wondering - is there a way to retain this compatibility?

        Show
        Dave Meikle added a comment - Sergey - this change appears to be breaking the current guidance on our Wiki about the commands to use when calling Tika Server (i.e. curl -T or curl -X PUT - d). It is also confusing when it is enabled on some and not all resources. I noticed this when running some test scripts I have on the latest build, so was wondering - is there a way to retain this compatibility?
        Hide
        Sergey Beryozkin added a comment -

        Can you clarify please how the change has affected it ? Thanks

        Show
        Sergey Beryozkin added a comment - Can you clarify please how the change has affected it ? Thanks
        Hide
        Dave Meikle added a comment -

        One example would be running the following command based on guidance from our wiki:
        curl -T test.txt http://localhost:9998/tika

        Whilst this works with tika-server from 1.4, in the current 1.5-SNAPSHOT it returns no response to the client and raises the following exception from the tika-server:

        Dec 31, 2013 10:48:03 AM org.apache.cxf.jaxrs.utils.JAXRSUtils readFromMessageBody
        WARNING: No message body reader has been found for request class Attachment, ContentType : application/octet-stream.
        Dec 31, 2013 10:48:03 AM org.apache.cxf.jaxrs.impl.WebApplicationExceptionMapper toResponse
        WARNING: javax.ws.rs.WebApplicationException
        at org.apache.cxf.jaxrs.utils.JAXRSUtils.readFromMessageBody(JAXRSUtils.java:1253)
        at org.apache.cxf.jaxrs.utils.JAXRSUtils.processParameter(JAXRSUtils.java:782)
        at org.apache.cxf.jaxrs.utils.JAXRSUtils.processParameters(JAXRSUtils.java:741)
        at org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.processRequest(JAXRSInInterceptor.java:254)
        at org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.handleMessage(JAXRSInInterceptor.java:90)
        at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:272)
        at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
        at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.serviceRequest(JettyHTTPDestination.java:355)
        at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:319)
        at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:72)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
        at org.eclipse.jetty.server.Server.handle(Server.java:370)
        at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
        at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
        at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:651)
        at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
        at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
        at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
        at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
        at java.lang.Thread.run(Thread.java:722)

        Show
        Dave Meikle added a comment - One example would be running the following command based on guidance from our wiki: curl -T test.txt http://localhost:9998/tika Whilst this works with tika-server from 1.4, in the current 1.5-SNAPSHOT it returns no response to the client and raises the following exception from the tika-server: Dec 31, 2013 10:48:03 AM org.apache.cxf.jaxrs.utils.JAXRSUtils readFromMessageBody WARNING: No message body reader has been found for request class Attachment, ContentType : application/octet-stream. Dec 31, 2013 10:48:03 AM org.apache.cxf.jaxrs.impl.WebApplicationExceptionMapper toResponse WARNING: javax.ws.rs.WebApplicationException at org.apache.cxf.jaxrs.utils.JAXRSUtils.readFromMessageBody(JAXRSUtils.java:1253) at org.apache.cxf.jaxrs.utils.JAXRSUtils.processParameter(JAXRSUtils.java:782) at org.apache.cxf.jaxrs.utils.JAXRSUtils.processParameters(JAXRSUtils.java:741) at org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.processRequest(JAXRSInInterceptor.java:254) at org.apache.cxf.jaxrs.interceptor.JAXRSInInterceptor.handleMessage(JAXRSInInterceptor.java:90) at org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:272) at org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.serviceRequest(JettyHTTPDestination.java:355) at org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:319) at org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:72) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1088) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1024) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:651) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:722)
        Hide
        Jeremy added a comment -

        I agree, the old command produce the same error with the snapshot, the new curl command is:
        curl -X PUT --form upload=@file http....

        It's working with that.

        Show
        Jeremy added a comment - I agree, the old command produce the same error with the snapshot, the new curl command is: curl -X PUT --form upload=@file http.... It's working with that.
        Hide
        Sergey Beryozkin added a comment - - edited

        Dave, I've missed your comment with the exception trace, sorry about it.

        After seeing a comment from Jeremy I've tested the JAX-RS server and I can confirm all works as expected.

        Note, "curl -T somefile targetURI" does not set Content-Type which explains the exception you are seeing. TikaServer has two resource methods accepting PUT payloads on the same path, one - specifically the multipart/form-data ones and another - all other types of payloads, and it uses a wildcard to match all possible types. Thus a method with a more specific JAX-RS Consumes value (multipart/form-data) is chosen when no Content-Type is available: the error actually mentions an octet-stream - this is due to the fact that the spec says that if no CT is available then use application/octet-stream when trying to read the stream - after the method selection has been completed.

        Two fixes are possible:

        1. Use -H curl parameter, for example, I've started a server (using a newly added -Pserver profile) and posted a pom.xml to it, adding '-H "Content-Type: text/xml"' and all worked fine. So the actual 'fix' is to update the docs and recommend to set up Content-Type when no multiparts are used.

        2. Have a TikaServer resource method accepting multiparts listen on a unique path, say on "http://localhost:9998/tika/form"

        Option 2 is less 'disruptive' but option 1 is marginally cleaner IMHO as the clients PUT-ing something into the server are expected to set Content-Type.

        I'm fine with implementing Option 2 though too - perhaps it can be done anyway but users should be encouraged to set content types anyway - this can optimize the parsing, aka, avoid doing the detection at the parser level and optionally use a Content-Type

        So, will we add a "/form" to a multipart/form-data accepting resource method or keep things as is ?

        Cheers, Sergey

        Show
        Sergey Beryozkin added a comment - - edited Dave, I've missed your comment with the exception trace, sorry about it. After seeing a comment from Jeremy I've tested the JAX-RS server and I can confirm all works as expected. Note, "curl -T somefile targetURI" does not set Content-Type which explains the exception you are seeing. TikaServer has two resource methods accepting PUT payloads on the same path, one - specifically the multipart/form-data ones and another - all other types of payloads, and it uses a wildcard to match all possible types. Thus a method with a more specific JAX-RS Consumes value (multipart/form-data) is chosen when no Content-Type is available: the error actually mentions an octet-stream - this is due to the fact that the spec says that if no CT is available then use application/octet-stream when trying to read the stream - after the method selection has been completed. Two fixes are possible: 1. Use -H curl parameter, for example, I've started a server (using a newly added -Pserver profile) and posted a pom.xml to it, adding '-H "Content-Type: text/xml"' and all worked fine. So the actual 'fix' is to update the docs and recommend to set up Content-Type when no multiparts are used. 2. Have a TikaServer resource method accepting multiparts listen on a unique path, say on "http://localhost:9998/tika/form" Option 2 is less 'disruptive' but option 1 is marginally cleaner IMHO as the clients PUT-ing something into the server are expected to set Content-Type. I'm fine with implementing Option 2 though too - perhaps it can be done anyway but users should be encouraged to set content types anyway - this can optimize the parsing, aka, avoid doing the detection at the parser level and optionally use a Content-Type So, will we add a "/form" to a multipart/form-data accepting resource method or keep things as is ? Cheers, Sergey
        Hide
        Sergey Beryozkin added a comment - - edited

        I'm treating the fact that Content-Type (if not available in the request) is defaulted to application/octet-stream after the method selection is done as a JAX-RS 2.0 spec text bug and seeking some clarifications before deciding how to treat it. Basically, if the defaulting was done before the method selection then adding a new method specifically accepting multiparts would have no effect on the existing tests using "curl -T" with no Content-Type set.

        We do not have to wait though for this spec issue be sorted out before Tika 1.5 gets released. Adding a "/form" to the multipart/form-data accepting method would make it work as expected without the existing tests being affected

        Show
        Sergey Beryozkin added a comment - - edited I'm treating the fact that Content-Type (if not available in the request) is defaulted to application/octet-stream after the method selection is done as a JAX-RS 2.0 spec text bug and seeking some clarifications before deciding how to treat it. Basically, if the defaulting was done before the method selection then adding a new method specifically accepting multiparts would have no effect on the existing tests using "curl -T" with no Content-Type set. We do not have to wait though for this spec issue be sorted out before Tika 1.5 gets released. Adding a "/form" to the multipart/form-data accepting method would make it work as expected without the existing tests being affected
        Hide
        Dave Meikle added a comment -

        Hi Sergey - Thanks for taking a look at this. I agree encouraging users to set Content Types helps optimise processing but in my head just changing how things works between minor releases in a feature like this is like breaking binary compatibility on the code API.

        I think adding "/form" would be a good short term fix that would provide the feature whilst retaining compatibility with older guidance on the Wiki. This would allow a prep of Tika 1.5 RC too. I am thinking we should also pre-empt a future change by starting to update the guidance around Content Types.

        Cheers,
        Dave

        Show
        Dave Meikle added a comment - Hi Sergey - Thanks for taking a look at this. I agree encouraging users to set Content Types helps optimise processing but in my head just changing how things works between minor releases in a feature like this is like breaking binary compatibility on the code API. I think adding "/form" would be a good short term fix that would provide the feature whilst retaining compatibility with older guidance on the Wiki. This would allow a prep of Tika 1.5 RC too. I am thinking we should also pre-empt a future change by starting to update the guidance around Content Types. Cheers, Dave
        Hide
        Sergey Beryozkin added a comment -

        Hi Dave, yes, I agree,
        All methods accepting multipart/form-data now have "/form" Path qualifiers
        Please try the snapshots/trunk

        Cheers, Sergey

        Show
        Sergey Beryozkin added a comment - Hi Dave, yes, I agree, All methods accepting multipart/form-data now have "/form" Path qualifiers Please try the snapshots/trunk Cheers, Sergey
        Hide
        Sergey Beryozkin added a comment -

        We've got an early agreement that it makes sense to sort out the issue of defaulting Content-Type to application/octet-stream earlier than is currently suggested. I can fix it in CXF right now but that will get it a bit 'exposed' to TCK test restrictions if JAX-RS 2.1 won't actually get it fixed. As such I think we can indeed settle on supporting a unique path for multipart/form-data payloads to support the cases where the client does not provide a content-type

        Cheers, Sergey

        Show
        Sergey Beryozkin added a comment - We've got an early agreement that it makes sense to sort out the issue of defaulting Content-Type to application/octet-stream earlier than is currently suggested. I can fix it in CXF right now but that will get it a bit 'exposed' to TCK test restrictions if JAX-RS 2.1 won't actually get it fixed. As such I think we can indeed settle on supporting a unique path for multipart/form-data payloads to support the cases where the client does not provide a content-type Cheers, Sergey
        Sergey Beryozkin made changes -
        Field Original Value New Value
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 1.5 [ 12324552 ]
        Resolution Fixed [ 1 ]
        Sergey Beryozkin made changes -
        Assignee Sergey Beryozkin [ sergey_beryozkin ]
        Jukka Zitting made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Sergey Beryozkin
            Reporter:
            Sergey Beryozkin
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development